Daily Archives: March 16, 2009

Optimization instead of inference

You know, I’ve always taken it for granted that what we want to do is probabilistic inference but lately I’ve been thinking more about what we really want and how to get there.

To illustrate my point, consider our dear friend LDA. For the uninitiated, let me write down the generative process for this Bayesian mixture model of discrete data:

  1. For k \in 1\ldots K, draw \beta_k \sim \mathrm{Dir}(\eta);
  2. For d \in 1\ldots D:
    1. Draw \theta_d \sim \mathrm{Dir}(\alpha);
    2. For n \in 1\ldots N_d:
      1. Draw z_{d,n} \sim \mathrm{Mult}(\theta_d);
      2. Draw w_{d,n} \sim \mathrm{Mult}(\beta_{z_{d,n}}).

We only observe w_{d,n}; the rest of the variables are hidden from us. Now the story usually says that what we need to do is perform posterior inference, that is, determine the distribution over the hidden variables (z, \theta and sometimes \beta) given the observations. People have come up with a bunch of ways of doing this like MCMC and variational inference.

But in fact what we often care about is the mode. Just look at the way practitioners actually use these things. A typical MCMC experiment will run the updates for a hundred iterations or so, look at the final state of the parameters (under the assumption that we’ve found the mode) and then put a pretty picture of this final state in the paper.

If that’s what we actually want, why bother with the rest? Put another way, our goal is to maximize the joint likelihood of the data and the parameters we care about. One parameter we typically care about is \theta_d; however we typically don’t care about is z_{d,n}.

With these facts guiding us, we can write out the optimization problem for a particular document as \max_{\theta_d} p(\theta_d, w_d | \beta, \alpha) = \max_{\theta_d} p(\theta_d | \alpha) + \sum_n \log p(w_{d,n} | \theta_d, \beta). We can expand the probability for a single word p(w_{d,n} | \theta_d, \beta) = \sum_{z_{d,n}} p(w_{d,n} | z_{d,n}, \beta) p(z_{d,n} | \theta_d) = \theta^T \beta_{\cdot, w_{d,n}}. The portions of the prior term which are relevant are \log p(\theta_d | \alpha) = (\alpha - 1) \log \theta_d.

The objective function, in total then, is (\alpha - 1) \log \theta_d + \sum_n \log \theta^T \beta_{\cdot, w_{d,n}}. Now the solution is perhaps not entirely trivial but we can just plug this into a standard optimizer with adequate constraints to solve for the optimal \theta_d.

So how well does it do? To test this I generated a synthetic data set using what I felt were fairly typical hyperparameters: K=5, \eta = 0.1, \alpha=1 / K, D=1000 and I set the vocabulary size (i.e., the length of each \beta_k to 2000. Using these hyperparameters I followed the generative process to generate D documents. In order to decide how many words should appear in each document, I drew N_d \sim  \mathrm{Poisson}(100).

I then used the procedure above to estimate optimal values of \theta_d, which I’ll denote \hat{\theta}_{\mathrm{Opt}}. I also plugged this data set into LDA-C which performs variational inference. LDA-C estimates the posterior distribution; from its estimates I compute the mode mean using \hat{\theta}_{\mathrm{Var}} = \frac{\gamma_d}{\sum_k \gamma_{d,k}}. Let me emphasize that both methods are given the true value of the other parameters: \beta, \alpha, K.

With these estimates of \theta_d I compute their quality against the true value of \theta_d using KL-divergence. I’ve plotted the difference of the quality of \hat{\theta}_{\mathrm{Opt}} and \hat{\theta}_{\mathrm{Var}} for each document as a function of the number of words in that document. The plot is here. The points below the zero-line (colored in red) are points where the optimization described earlier actually does a better job of recovering the true \theta_d than LDA-C. The purple line (which is kind of hard to see since it’s near zero) represents the mean difference. Overall, LDA-C does slightly worse than we do.

So what we’ve got here is a fairly fast and easy way of finding some kind of “mode” for LDA. Although the modes found by the two techniques aren’t really maximizing the same thing, they both tend to do equally well at recovering the true value of the parameters. So why not just optimize?

P.S. If anybody is actually reading this, and has tried doing optimization with constraints of this sort, we should trade war stories.



Filed under Uncategorized