A friend of mine recently posed a problem which seems at first blush quite simple but turns out to be quite interesting. Suppose you have a program which takes time and which is perfectly parallelizable — that is, I can split the program up into
tasks and the latency for the program will be
.
So I can make the latency arbitrarily small by taking .
So why wouldn’t you ever just set as large as conceivably possible? Well, suppose there were a fixed startup cost associated with each task,
. Then, the latency would be
. Again, it’s clear that you should just parallelize the hell out of your program and eventually your latency will approach
.
But what if the startup cost were random? Let the startup cost be drawn from some probability distribution . Now each subtask will take a different amount of time; since I’m concerned with the latency, the metric I want to optimize is
over
.
Now there’s a cool tradeoff involved in parallelization: one the one hand, splitting a task reduces the part of the equation, but on the other, it increases the number of draws of
and therefore the expected maximum value.
It turns out that the solution to this problem leads down a huge rabbit hole involving order statistics, sample maxima, and extreme value statistics. Here are some highlights:
- The probability distribution of the maximum value in a sample of
draws from a distribution
is given by
where
is the cumulative distribution function (There’s a more general form for the
th order statistic).
- The expected value of the maximum (drawn according to the procedure above) is
, where
is the inverse cdf (or quantile) function. For almost all distributions this expression cannot be computed analytically (The uniform distribution being the notable exception).
- When
, the analytical solution to the expected maximum value is
. This leads to a quadratic polynomial as the solution of the original problem. Without loss of generality, let
. Then the optimal value of
takes the form
. Note that when
the roots are negative, i.e., one should set
to infinity. The interesting range is between 0 and 1. I’ve attached a plot of this function for this range here. You’ll notice that the bend is quiet steep so that you’ll want the parallelization to either be very small or very large depending on how long your program takes relative to the startup costs.
- When
is not uniform, things are more difficult. I’m not entirely sure what the right model is for the costs, but I think and exponential distribution might make a first cut. Using a numerical approximation, I computed latencies as a function of
for various values of
and rate parameter set to 1 here (the values of
are in the title of each plot). Because of the long tails, one needs to be much more cautious before parallelizing. For values of
explored here you generally shouldn’t to go above
. Also, the different performance for different values of
can be quite substantial, but bang/buck is questionable. For example, when
if you spend 5 times as many resources, you’ll go almost twice as fast. Is that a worthwhile investment? When
you almost definitely don’t want to parallelize at all.
Ok, so it seemed like a simple little problem at first, but it turns out that something as straightforward sounding as parallelization can lead to some pretty interesting analysis.
Another application of the result you just mentioned( min(X1,..,Xk) is concentrated around 1/k when Xi are (pairwise?) independently drawn from the uniform distribution) is finding the number of distinct elements in a massive stream of data of size m consisting of elements from a massive set {1,…,n}, where you are only allowed O(log m + log n) storage.
I am interested in knowing how to perform this type of analysis. I have a math minor focused more on graph theory and linear algebra. Can you point me to some good statistics resources for this kind of thing?
You can find the proofs in a paper at
http://portal.acm.org/citation.cfm?id=237823
Some of the math used can be found in the following great book:
The Probabilistic Method – Noga Alon and Joel Spencer
You could take a look at relevant chapters a text on randomized algorithm (a standard is “Randomized Algorithms” by Motwani and Raghawan). These are more discrete math and algorithms rather than statistics references, but they contain everything you need to understand the proofs in the paper.
You assume that you start only a single task of each type. If startup costs are independent for several copies of the same task, then it makes sense to start a very large number of instances of the same task.
You’re totally right; if we replicate each task we can take the min, that is
. Then it clearly makes sense to make
as large as possible. Although, in cases when you also want to make
as large as possible, I wonder what you want
to be, that is, how you should divide resources between tasks and task replicates.