When is first-order better than second-order?

Dave and I were recently talking about Asuncion et al.’s wonderful recent paper “On Smoothing and Inference for Topic Models.” One thing that caught our eye was the CVB0 inference method for topic models, which is described as a first-order approximation of the collapsed variational Bayes approach. The odd thing is that this first-order approximation performs better than other, more “principled” approaches. I want to try to understand why. Here’s my current less-than-satisfactory stab:

Let me just lay out the problem. Suppose I want to approximate the marginal posterior over topic assignments z_i in a topic model given the observed words w, p(z_i | w). We can expand this probability using an expectation,

p(z_i | w) = \mathbb{E}_{p(z_{-i} | w)}[p(z_i | z_{-i}, w)].

We can’t compute the expectation analytically, so we must turn to an approximate inference technique. One technique is to run a Gibbs sampler whence we get S samples from the joint posterior over topic assignments z^{(1)}, z^{(2)}, \ldots, z^{(S)}. Then using these samples we approximate the expectation,

p(z_i | w) \approx \frac{1}{S}\sum_s p(z_i | z^{(s)}_{-i}).

In the case of LDA, this conditional probability is proportional to

p(z_i | z^{(s)}_{-i}) \propto \frac{N^{\lnot i}_{wk} + \eta}{N^{\lnot i}_k + V \eta} (N^{\lnot i}_{dk} + \alpha),


  • \alpha is the Dirichlet hyperparameter for topic proportion vectors;
  • \eta is the Dirichlet hyperparameter for the topic multinomials;
  • N^{\lnot i}_{wk} is the number of times topic k has been assigned to word w;
  • N^{\lnot i}_{k} is the number of times topic k has been assigned overall;
  • N^{\lnot i}_{dk} is the number of times topic k has been assigned in document d.

Note that the above counts do not include the current value of z_i (hence the superscript).

Instead of the Gibbs sampling approach, we could also approximate the expectation by taking a first order approximation (which we denote \gamma_i),

\mathbb{E}[p(z_i | z_{-i})] \approx \gamma_i \propto \frac{\mathbb{E}[N^{\lnot i}_{wk}] + \eta}{\mathbb{E}[N^{\lnot i}_k] + V \eta} (\mathbb{E}[N^{\lnot i}_{dk}] + \alpha),

where the expectations are taken with respect to p(z_{-i} | w). Because the terms in the expectations are simple sums, they can be computed solely as functions of \gamma. For example,

\mathbb{E}[N^{\lnot i}_k] = \mathbb{E}[\sum_{j \ne i} z_{jk}] = \sum_{j \ne i} \mathbb{E}[z_{jk}] = \sum_{j \ne i} p(z_j) \approx \sum_{j \ne i} \gamma_j.

Thus, the solution to this approximation is exactly the CVB0 technique described in the paper. Note that I never directly introduced the concept of a variational distribution! CVB0 is simply a first-order approximation to the true expectations; in contrast, the second-order CVB approximation is an approximation of the variational expectations. So maybe that’s the answer to the puzzle: sometimes a first-order approximation to the true value is better than a second-order approximation to a surrogate objective.

Does anyone have any other explanations?



Filed under Uncategorized

3 responses to “When is first-order better than second-order?

  1. Max Welling

    Hi Jonathan,

    How are you?
    This derivation is related to what Yee Whye, Dave and I called the “Callen equations” in our CVB paper.

    One thing perhaps to mention is that this is only an approximation of the implicit equation that still needs to be iterated to convergence. Hence, errors can build up and the final answer can be very wrong. Instead, a variational approximation has an approximation to the objective function which may therefore have more value.

    Another question: what is the second order approximation of what you call “the true expectations”? Isn’t that exactly equal to the CVB equations as well? I guess I am not quite clear what the difference is between true and variational expectations or how they are defined.

    • So the second order approximation has two parts, one is expanding the conditional probabilities to second order, the other is evaluating its expectation. The first part is easy, it amounts to a covariance matrix and some weighting.

      In the CVB paper, one computes the covariance matrix with respect to the variational distribution (because it’s not easy to do so with respect to the true distribution). Since there’s no need to bring in a variational distribution for the definition of CVB0, it’s unclear how you would extend it to second order tractably.

  2. Pingback: Chang et al. (2009) Reading Tea Leaves: How Humans Interpret Topic Models « LingPipe Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s