A friend of mine recently posed a problem which seems at first blush quite simple but turns out to be quite interesting. Suppose you have a program which takes time and which is perfectly parallelizable — that is, I can split the program up into tasks and the latency for the program will be .
So I can make the latency arbitrarily small by taking .
So why wouldn’t you ever just set as large as conceivably possible? Well, suppose there were a fixed startup cost associated with each task, . Then, the latency would be . Again, it’s clear that you should just parallelize the hell out of your program and eventually your latency will approach .
But what if the startup cost were random? Let the startup cost be drawn from some probability distribution . Now each subtask will take a different amount of time; since I’m concerned with the latency, the metric I want to optimize is over .
Now there’s a cool tradeoff involved in parallelization: one the one hand, splitting a task reduces the part of the equation, but on the other, it increases the number of draws of and therefore the expected maximum value.
It turns out that the solution to this problem leads down a huge rabbit hole involving order statistics, sample maxima, and extreme value statistics. Here are some highlights:
- The probability distribution of the maximum value in a sample of draws from a distribution is given by where is the cumulative distribution function (There’s a more general form for the th order statistic).
- The expected value of the maximum (drawn according to the procedure above) is , where is the inverse cdf (or quantile) function. For almost all distributions this expression cannot be computed analytically (The uniform distribution being the notable exception).
- When , the analytical solution to the expected maximum value is . This leads to a quadratic polynomial as the solution of the original problem. Without loss of generality, let . Then the optimal value of takes the form . Note that when the roots are negative, i.e., one should set to infinity. The interesting range is between 0 and 1. I’ve attached a plot of this function for this range here. You’ll notice that the bend is quiet steep so that you’ll want the parallelization to either be very small or very large depending on how long your program takes relative to the startup costs.
- When is not uniform, things are more difficult. I’m not entirely sure what the right model is for the costs, but I think and exponential distribution might make a first cut. Using a numerical approximation, I computed latencies as a function of for various values of and rate parameter set to 1 here (the values of are in the title of each plot). Because of the long tails, one needs to be much more cautious before parallelizing. For values of explored here you generally shouldn’t to go above . Also, the different performance for different values of can be quite substantial, but bang/buck is questionable. For example, when if you spend 5 times as many resources, you’ll go almost twice as fast. Is that a worthwhile investment? When you almost definitely don’t want to parallelize at all.
Ok, so it seemed like a simple little problem at first, but it turns out that something as straightforward sounding as parallelization can lead to some pretty interesting analysis.