Optimization instead of inference

You know, I’ve always taken it for granted that what we want to do is probabilistic inference but lately I’ve been thinking more about what we really want and how to get there.

To illustrate my point, consider our dear friend LDA. For the uninitiated, let me write down the generative process for this Bayesian mixture model of discrete data:

For $k \in 1\ldots K$ , draw $\beta_k \sim \mathrm{Dir}(\eta)$ ;
For :
1. Draw $\theta_d \sim \mathrm{Dir}(\alpha)$ ;
2. For :
  1. Draw $z_{d,n} \sim \mathrm{Mult}(\theta_d)$ ;
  2. Draw $w_{d,n} \sim \mathrm{Mult}(\beta_{z_{d,n}})$ .

We only observe $w_{d,n}$ ; the rest of the variables are hidden from us. Now the story usually says that what we need to do is perform posterior inference, that is, determine the distribution over the hidden variables ( $z, \theta$ and sometimes $\beta$ ) given the observations. People have come up with a bunch of ways of doing this like MCMC and variational inference.

But in fact what we often care about is the mode. Just look at the way practitioners actually use these things. A typical MCMC experiment will run the updates for a hundred iterations or so, look at the final state of the parameters (under the assumption that we’ve found the mode) and then put a pretty picture of this final state in the paper.

If that’s what we actually want, why bother with the rest? Put another way, our goal is to maximize the joint likelihood of the data and the parameters we care about. One parameter we typically care about is $\theta_d$ ; however we typically don’t care about is $z_{d,n}$ .

With these facts guiding us, we can write out the optimization problem for a particular document as $\max_{\theta_d} p(\theta_d, w_d | \beta, \alpha) =$ $\max_{\theta_d} p(\theta_d | \alpha) + \sum_n \log p(w_{d,n} | \theta_d, \beta)$ . We can expand the probability for a single word $p(w_{d,n} | \theta_d, \beta) = \sum_{z_{d,n}} p(w_{d,n} | z_{d,n}, \beta) p(z_{d,n} | \theta_d) = \theta^T \beta_{\cdot, w_{d,n}}$ . The portions of the prior term which are relevant are $\log p(\theta_d | \alpha) = (\alpha - 1) \log \theta_d$ .

The objective function, in total then, is $(\alpha - 1) \log \theta_d + \sum_n \log \theta^T \beta_{\cdot, w_{d,n}}$ . Now the solution is perhaps not entirely trivial but we can just plug this into a standard optimizer with adequate constraints to solve for the optimal $\theta_d$ .

So how well does it do? To test this I generated a synthetic data set using what I felt were fairly typical hyperparameters: $K=5, \eta = 0.1, \alpha=1 / K, D=1000$ and I set the vocabulary size (i.e., the length of each $\beta_k$ to 2000. Using these hyperparameters I followed the generative process to generate $D$ documents. In order to decide how many words should appear in each document, I drew $N_d \sim \mathrm{Poisson}(100)$ .

I then used the procedure above to estimate optimal values of $\theta_d$ , which I’ll denote $\hat{\theta}_{\mathrm{Opt}}$ . I also plugged this data set into LDA-C which performs variational inference. LDA-C estimates the posterior distribution; from its estimates I compute the ~~mode~~ mean using $\hat{\theta}_{\mathrm{Var}} = \frac{\gamma_d}{\sum_k \gamma_{d,k}}$ . Let me emphasize that both methods are given the true value of the other parameters: $\beta, \alpha, K$ .

With these estimates of $\theta_d$ I compute their quality against the true value of $\theta_d$ using KL-divergence. I’ve plotted the difference of the quality of $\hat{\theta}_{\mathrm{Opt}}$ and $\hat{\theta}_{\mathrm{Var}}$ for each document as a function of the number of words in that document. The plot is here. The points below the zero-line (colored in red) are points where the optimization described earlier actually does a better job of recovering the true $\theta_d$ than LDA-C. The purple line (which is kind of hard to see since it’s near zero) represents the mean difference. Overall, LDA-C does slightly worse than we do.

So what we’ve got here is a fairly fast and easy way of finding some kind of “mode” for LDA. Although the modes found by the two techniques aren’t really maximizing the same thing, they both tend to do equally well at recovering the true value of the parameters. So why not just optimize?

P.S. If anybody is actually reading this, and has tried doing optimization with constraints of this sort, we should trade war stories.

8 responses to “Optimization instead of inference”

István Szita

April 25, 2009 at 2:25 pm

Hi,

I just stumbled upon your blog (and added it to my bookmarks 🙂

What you write here is really interesting, and resembles the cross-entropy method (a lengthy, but good introduction is here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.1072&rep=rep1&type=pdf ). It starts out as an importance sampling-like method for estimating probabilities of rare events. this problem is then transformed to an optimization problem.

I used CEM quite a lot as a black-box optimization method for all sorts of problems, and it worked really well so far.

Max Welling

April 26, 2009 at 10:22 pm

Hi Jonathan,

I believe that what you do is equivalent to Hoffman’s PLSA where you include a prior.
PLSA is a EM type algorithm to maximize the ML of the model to describe.

- slycoder
  
  April 27, 2009 at 6:23 am
  
  There’s definitely a tight connection here with pLSA. But as far as I know, the original EM algorithm for pLSA still computes expectations for each topic assignment in the E-step. Here we’re integrating those out and directly numerically estimating the per-document mixing proportions.
  
Austin Waters

April 27, 2009 at 5:55 am

Hi Jonathan,

You’re the newest addition to my google reader. 🙂

Above, I believe you’re actually computing the *mean* of posterior approximated by LDA-C, rather than the mode. (Or at least you wrote the formula for the mean, not the mode, in your post.) I wonder if this has any effect on the overall result?

- slycoder
  
  April 27, 2009 at 6:48 am
  
  Hi,
  Thanks for reading.
  
  Good catch; I’m indeed computing the mean, not the mode (post updated)! Computing the mode is actually a little bit problematic using the parameters gleaned from LDA-C — if any of the parameters of a Dirichlet distribution are less than one, then the mode is going to be at one of the corners of the simplex which is not really conducive for the comparison above.
  
  This suggests another subtle difference between the two techniques. At this late hour I don’t know how to do a “fair” comparison, but if you can concoct something, let me know.
  
Pingback: Jonathan’s Research Blog
winnerdy

December 7, 2013 at 1:54 pm

Hi Jonathan,

Very interesting insight! But I still have one question here. In your post, you’re only trying to estimate $\theta$ given $\beta$. In reality, we need to estimate both. Can you method be used to estimate both parameters? I know we need to add the prior of $\beta$ to the objective function. Then how should we optimize our objective and get the optimal value of $\theta$ and $\beta$. I’m sorry I cannot figure it out as I don’t have a strong background in optimization.

- slycoder
  
  December 10, 2013 at 12:36 am
  
  In principle you could estimate both. This post is pretty old and subsequently the stochastic variational techniques suggested by Hoffman et al. which have a similar flavor suggest that this sort of join optimization could be very well the best way to train things.