I once heard it on good authority that Gelman says you usually don’t need more than 12 samples. Well, as a result of a discussion with Sam Gershman (sorry Sam for not answering the actual question you asked!), I wondered if that was true; that is, if under reasonable assumptions it might be better to take a small number of samples. Caveat: there’s probably lots of work on this already, but where would the fun be in that?
Ok, let’s assume that your goal is to estimate , where
represents some distribution on hidden variables over which you are trying to compute a function,
. For the usual reasons, it’s intractable to compute this exactly, so you’re going to use a sampler. Let’s assume
- that your sampler has mixed and that you’re getting independent samples (that condition alone should give you fair warning that what I’m about to say is of little practical value);
is bounded (say between 0 and 1);
- to obtain
samples from the sampler costs some amount, say
.
More samples are usually better, because they’ll give you a better representation of the true distribution, i.e. , where
is the distribution obtained by using
samples. Unfortunately, more samples come at a cost here, so you don’t want too many. How should you tradeoff then?
We can define a loss by , that is, how far off our sampled estimate is from the truth, plus the cost of obtaining those samples. Using Hoeffding, we can bound the loss
with probability
. This expression gives you something to think about when you're trying to decide how many samples to take — more samples loosen the bound but increase its probability.
If your cost is linear, , you might want to choose
something like , which gives you a loss of
with probability
.
The plot below shows what might happen if you make such a choice. Here, I've let the posterior be an equiprobable binomial distribution. The function I'm computing is the identity . The curves show the loss,
for various choices of the cost parameter
as a function of the number of samples. The dots show the chosen values of
for each value of
; the horizontal lines show the 80% loss bound for these choices.
Turns out for some reasonable values, you really should stick to about 12 samples.
Pingback: The cost of a sample « visualization, etc.
I like the accuracy vs. time dimension. It fits in with Monsieur Gelman’s continual blog comments about sticking to justified significant digits. Especially when you look at some of his hierarchical models that have very few items from which to estimate means.
This reminds me, for some reason, of the SGD vs. quasi-Newton methods like L-BFGS for fitting regression models. SGD’s way faster at getting to a couple significant digits, but slower for getting to lots of significant digits.
Yeah, a lot of statisticians love lots of significant digits which is odd since they should be the first to recognize that many of those digits are useless.
As for SGD vs. L-BFGS, in my experience both SGD and L-BFGS are terrible for getting to lots of significant digits =).