Optimization instead of inference

You know, I’ve always taken it for granted that what we want to do is probabilistic inference but lately I’ve been thinking more about what we really want and how to get there.

To illustrate my point, consider our dear friend LDA. For the uninitiated, let me write down the generative process for this Bayesian mixture model of discrete data:

  1. For k \in 1\ldots K, draw \beta_k \sim \mathrm{Dir}(\eta);
  2. For d \in 1\ldots D:
    1. Draw \theta_d \sim \mathrm{Dir}(\alpha);
    2. For n \in 1\ldots N_d:
      1. Draw z_{d,n} \sim \mathrm{Mult}(\theta_d);
      2. Draw w_{d,n} \sim \mathrm{Mult}(\beta_{z_{d,n}}).

We only observe w_{d,n}; the rest of the variables are hidden from us. Now the story usually says that what we need to do is perform posterior inference, that is, determine the distribution over the hidden variables (z, \theta and sometimes \beta) given the observations. People have come up with a bunch of ways of doing this like MCMC and variational inference.

But in fact what we often care about is the mode. Just look at the way practitioners actually use these things. A typical MCMC experiment will run the updates for a hundred iterations or so, look at the final state of the parameters (under the assumption that we’ve found the mode) and then put a pretty picture of this final state in the paper.

If that’s what we actually want, why bother with the rest? Put another way, our goal is to maximize the joint likelihood of the data and the parameters we care about. One parameter we typically care about is \theta_d; however we typically don’t care about is z_{d,n}.

With these facts guiding us, we can write out the optimization problem for a particular document as \max_{\theta_d} p(\theta_d, w_d | \beta, \alpha) = \max_{\theta_d} p(\theta_d | \alpha) + \sum_n \log p(w_{d,n} | \theta_d, \beta). We can expand the probability for a single word p(w_{d,n} | \theta_d, \beta) = \sum_{z_{d,n}} p(w_{d,n} | z_{d,n}, \beta) p(z_{d,n} | \theta_d) = \theta^T \beta_{\cdot, w_{d,n}}. The portions of the prior term which are relevant are \log p(\theta_d | \alpha) = (\alpha - 1) \log \theta_d.

The objective function, in total then, is (\alpha - 1) \log \theta_d + \sum_n \log \theta^T \beta_{\cdot, w_{d,n}}. Now the solution is perhaps not entirely trivial but we can just plug this into a standard optimizer with adequate constraints to solve for the optimal \theta_d.

So how well does it do? To test this I generated a synthetic data set using what I felt were fairly typical hyperparameters: K=5, \eta = 0.1, \alpha=1 / K, D=1000 and I set the vocabulary size (i.e., the length of each \beta_k to 2000. Using these hyperparameters I followed the generative process to generate D documents. In order to decide how many words should appear in each document, I drew N_d \sim  \mathrm{Poisson}(100).

I then used the procedure above to estimate optimal values of \theta_d, which I’ll denote \hat{\theta}_{\mathrm{Opt}}. I also plugged this data set into LDA-C which performs variational inference. LDA-C estimates the posterior distribution; from its estimates I compute the mode mean using \hat{\theta}_{\mathrm{Var}} = \frac{\gamma_d}{\sum_k \gamma_{d,k}}. Let me emphasize that both methods are given the true value of the other parameters: \beta, \alpha, K.

With these estimates of \theta_d I compute their quality against the true value of \theta_d using KL-divergence. I’ve plotted the difference of the quality of \hat{\theta}_{\mathrm{Opt}} and \hat{\theta}_{\mathrm{Var}} for each document as a function of the number of words in that document. The plot is here. The points below the zero-line (colored in red) are points where the optimization described earlier actually does a better job of recovering the true \theta_d than LDA-C. The purple line (which is kind of hard to see since it’s near zero) represents the mean difference. Overall, LDA-C does slightly worse than we do.

So what we’ve got here is a fairly fast and easy way of finding some kind of “mode” for LDA. Although the modes found by the two techniques aren’t really maximizing the same thing, they both tend to do equally well at recovering the true value of the parameters. So why not just optimize?

P.S. If anybody is actually reading this, and has tried doing optimization with constraints of this sort, we should trade war stories.

Advertisements

8 Comments

Filed under Uncategorized

8 responses to “Optimization instead of inference

  1. Hi,

    I just stumbled upon your blog (and added it to my bookmarks 🙂

    What you write here is really interesting, and resembles the cross-entropy method (a lengthy, but good introduction is here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.1072&rep=rep1&type=pdf ). It starts out as an importance sampling-like method for estimating probabilities of rare events. this problem is then transformed to an optimization problem.

    I used CEM quite a lot as a black-box optimization method for all sorts of problems, and it worked really well so far.

  2. Max Welling

    Hi Jonathan,

    I believe that what you do is equivalent to Hoffman’s PLSA where you include a prior.
    PLSA is a EM type algorithm to maximize the ML of the model to describe.

    • There’s definitely a tight connection here with pLSA. But as far as I know, the original EM algorithm for pLSA still computes expectations for each topic assignment in the E-step. Here we’re integrating those out and directly numerically estimating the per-document mixing proportions.

  3. Austin Waters

    Hi Jonathan,

    You’re the newest addition to my google reader. 🙂

    Above, I believe you’re actually computing the *mean* of posterior approximated by LDA-C, rather than the mode. (Or at least you wrote the formula for the mean, not the mode, in your post.) I wonder if this has any effect on the overall result?

    • Hi,
      Thanks for reading.

      Good catch; I’m indeed computing the mean, not the mode (post updated)! Computing the mode is actually a little bit problematic using the parameters gleaned from LDA-C — if any of the parameters of a Dirichlet distribution are less than one, then the mode is going to be at one of the corners of the simplex which is not really conducive for the comparison above.

      This suggests another subtle difference between the two techniques. At this late hour I don’t know how to do a “fair” comparison, but if you can concoct something, let me know.

  4. Pingback: Jonathan’s Research Blog

  5. winnerdy

    Hi Jonathan,

    Very interesting insight! But I still have one question here. In your post, you’re only trying to estimate $\theta$ given $\beta$. In reality, we need to estimate both. Can you method be used to estimate both parameters? I know we need to add the prior of $\beta$ to the objective function. Then how should we optimize our objective and get the optimal value of $\theta$ and $\beta$. I’m sorry I cannot figure it out as I don’t have a strong background in optimization.

    • In principle you could estimate both. This post is pretty old and subsequently the stochastic variational techniques suggested by Hoffman et al. which have a similar flavor suggest that this sort of join optimization could be very well the best way to train things.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s