Randomness makes parallelization interesting

A friend of mine recently posed a problem which seems at first blush quite simple but turns out to be quite interesting. Suppose you have a program which takes T time and which is perfectly parallelizable — that is, I can split the program up into K tasks and the latency for the program will be T / K.
So I can make the latency arbitrarily small by taking K \rightarrow \infty.

So why wouldn’t you ever just set K as large as conceivably possible? Well, suppose there were a fixed startup cost associated with each task, S. Then, the latency would be S + T/K. Again, it’s clear that you should just parallelize the hell out of your program and eventually your latency will approach S.

But what if the startup cost were random? Let the startup cost be drawn from some probability distribution S_i \sim f. Now each subtask will take a different amount of time; since I’m concerned with the latency, the metric I want to optimize is \max_{i\in \{1\ldots K\}} S_i + T/K over K.

Now there’s a cool tradeoff involved in parallelization: one the one hand, splitting a task reduces the T/K part of the equation, but on the other, it increases the number of draws of S_i and therefore the expected maximum value.

It turns out that the solution to this problem leads down a huge rabbit hole involving order statistics, sample maxima, and extreme value statistics. Here are some highlights:

  • The probability distribution of the maximum value in a sample of K draws from a distribution f is given by K F(x)^{K-1} f(x) where F is the cumulative distribution function (There’s a more general form for the ith order statistic).
  • The expected value of the maximum (drawn according to the procedure above) is \int_{0}^1 F^{-1}(x^{1/K}) dx, where F^{-1} is the inverse cdf (or quantile) function. For almost all distributions this expression cannot be computed analytically (The uniform distribution being the notable exception).
  • When f = \mathrm{Unif}(a, b), the analytical solution to the expected maximum value is a + (b - a) \frac{K}{K+1}. This leads to a quadratic polynomial as the solution of the original problem. Without loss of generality, let b-a = 1. Then the optimal value of K takes the form \frac{T + \sqrt{T}}{1-T}. Note that when T > 1 the roots are negative, i.e., one should set K to infinity. The interesting range is between 0 and 1. I’ve attached a plot of this function for this range here. You’ll notice that the bend is quiet steep so that you’ll want the parallelization to either be very small or very large depending on how long your program takes relative to the startup costs.
  • When f is not uniform, things are more difficult. I’m not entirely sure what the right model is for the costs, but I think and exponential distribution might make a first cut. Using a numerical approximation, I computed latencies as a function of K for various values of T and rate parameter set to 1 here (the values of T are in the title of each plot). Because of the long tails, one needs to be much more cautious before parallelizing. For values of T explored here you generally shouldn’t to go above K = 5. Also, the different performance for different values of K can be quite substantial, but bang/buck is questionable. For example, when T=5 if you spend 5 times as many resources, you’ll go almost twice as fast. Is that a worthwhile investment? When T=1 you almost definitely don’t want to parallelize at all.

Ok, so it seemed like a simple little problem at first, but it turns out that something as straightforward sounding as parallelization can lead to some pretty interesting analysis.

Advertisements

5 Comments

Filed under Uncategorized

5 responses to “Randomness makes parallelization interesting

  1. Indraneel

    Another application of the result you just mentioned( min(X1,..,Xk) is concentrated around 1/k when Xi are (pairwise?) independently drawn from the uniform distribution) is finding the number of distinct elements in a massive stream of data of size m consisting of elements from a massive set {1,…,n}, where you are only allowed O(log m + log n) storage.

    • Kevin Nuckolls

      I am interested in knowing how to perform this type of analysis. I have a math minor focused more on graph theory and linear algebra. Can you point me to some good statistics resources for this kind of thing?

      • Indraneel

        You can find the proofs in a paper at

        http://portal.acm.org/citation.cfm?id=237823

        Some of the math used can be found in the following great book:

        The Probabilistic Method – Noga Alon and Joel Spencer

        You could take a look at relevant chapters a text on randomized algorithm (a standard is “Randomized Algorithms” by Motwani and Raghawan). These are more discrete math and algorithms rather than statistics references, but they contain everything you need to understand the proofs in the paper.

  2. J Kujala

    You assume that you start only a single task of each type. If startup costs are independent for several copies of the same task, then it makes sense to start a very large number of instances of the same task.

    • You’re totally right; if we replicate each task we can take the min, that is \max_{i \in \{1\ldots K\}} \min_{j \in \{1 \ldots R\}} S_{ij} + T / K. Then it clearly makes sense to make R as large as possible. Although, in cases when you also want to make K as large as possible, I wonder what you want R / K to be, that is, how you should divide resources between tasks and task replicates.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s