A friend of mine recently posed a problem which seems at first blush quite simple but turns out to be quite interesting. Suppose you have a program which takes time and which is perfectly parallelizable — that is, I can split the program up into
tasks and the latency for the program will be
.
So I can make the latency arbitrarily small by taking .
So why wouldn’t you ever just set as large as conceivably possible? Well, suppose there were a fixed startup cost associated with each task,
. Then, the latency would be
. Again, it’s clear that you should just parallelize the hell out of your program and eventually your latency will approach
.
But what if the startup cost were random? Let the startup cost be drawn from some probability distribution . Now each subtask will take a different amount of time; since I’m concerned with the latency, the metric I want to optimize is
over
.
Now there’s a cool tradeoff involved in parallelization: one the one hand, splitting a task reduces the part of the equation, but on the other, it increases the number of draws of
and therefore the expected maximum value.
It turns out that the solution to this problem leads down a huge rabbit hole involving order statistics, sample maxima, and extreme value statistics. Here are some highlights:
- The probability distribution of the maximum value in a sample of
draws from a distribution
is given by
where
is the cumulative distribution function (There’s a more general form for the
th order statistic).
- The expected value of the maximum (drawn according to the procedure above) is
, where
is the inverse cdf (or quantile) function. For almost all distributions this expression cannot be computed analytically (The uniform distribution being the notable exception).
- When
, the analytical solution to the expected maximum value is
. This leads to a quadratic polynomial as the solution of the original problem. Without loss of generality, let
. Then the optimal value of
takes the form
. Note that when
the roots are negative, i.e., one should set
to infinity. The interesting range is between 0 and 1. I’ve attached a plot of this function for this range here. You’ll notice that the bend is quiet steep so that you’ll want the parallelization to either be very small or very large depending on how long your program takes relative to the startup costs.
- When
is not uniform, things are more difficult. I’m not entirely sure what the right model is for the costs, but I think and exponential distribution might make a first cut. Using a numerical approximation, I computed latencies as a function of
for various values of
and rate parameter set to 1 here (the values of
are in the title of each plot). Because of the long tails, one needs to be much more cautious before parallelizing. For values of
explored here you generally shouldn’t to go above
. Also, the different performance for different values of
can be quite substantial, but bang/buck is questionable. For example, when
if you spend 5 times as many resources, you’ll go almost twice as fast. Is that a worthwhile investment? When
you almost definitely don’t want to parallelize at all.
Ok, so it seemed like a simple little problem at first, but it turns out that something as straightforward sounding as parallelization can lead to some pretty interesting analysis.