Daily Archives: November 26, 2008

Predictions Using Mixture Components

Suppose that I have the following model: z_i \sim \mathrm{B}(\theta) \;\mathrm{ for }\; i=1\ldots N, \lambda \sim \mathrm{Unif}(1\ldots N), and p(y = 1 | \vec{z}, \lambda) = \pi_1 z_\lambda + \pi_0 (1 - z_\lambda). z, \lambda are hidden and y is observed.  This model is perhaps not as facile as it seems; the z‘s might be mixture components in a mixture model, i.e., you can imagine hanging other observations off of these variables.

Since z, \lambda are hidden, we usually resort to some sort of approximate inference mechanism.  For the sake of argument, assume that we’ve managed to a.) estimate the parameters of the model (\theta, \pi_1, \pi_0) on some training set and b.) (approximately) compute the posteriors p(\lambda) and p(z_i) for some test document using the other parts of the model which I’m purposefully glossing over.   The question is, how do we make a prediction about y on this test document using this information?  

Seems like an easy enough question; and for exposition’s sake I’m going to make this even easier.  Assume that p(\lambda) = \frac{1}{N}.   Now, to do the prediction we marginalize out the hidden variables p(y) = \sum_{\vec{z}} \sum_\lambda p(y | \vec{z}, \lambda) p(\vec{z}) p(\lambda) = \mathbb{E}_{p(\vec{z})}[\sum_\lambda p (y | \vec{z}, \lambda) \frac{1}{N}].  By linearity, we can rewrite this as \pi_1 \mathbb{E}[\bar{z}] + \pi_0 (1 - \mathbb{E}[\bar{z}]), where \bar{z} = \frac{1}{N} \sum_i z_i.  And if, as usual, I wanted the log probabilities, those would be easy too, \log (\pi_1\mathbb{E}[\bar{z}] + \pi_0 ( 1 - \mathbb{E}[\bar{z}])).

But suppose I were really nutter-butters and I decided to compute the log predictive likelihood another way, say by computing \mathbb{E}_\lambda[\mathbb{E}_{\vec{z}}[\log p(y | \vec{z}, \lambda)]] instead of \log \mathbb{E}_{\vec{z}}[\mathbb{E}_\lambda [p(y | \vec{z}, \lambda)]].  It is worth pointing out that Jensen’s inequality tells us that the first term is a bound on the latter term.   Because the \vec{z} is a set of indicators, the log probability has a convenient form, namely \log p(y = 1 | \vec{z}, \lambda) = \log (\pi_1) z_\lambda + \log(\pi_0) (1 - z_\lambda).  Applying the expectations yields \log (\pi_1) \mathbb{E}[\bar{z}] + \log(\pi_0) ( 1-\mathbb{E}[\bar{z}]).    

Jensen’s inequality is basically a statement about a first order Taylor approximation, i.e., \mathbb{E}[f(X)] \approx f(\mathbb{E}[X]), when f is convex.    When is this approximation good?  We might attempt this by looking at the second order term.  I will save you the arduous derivation: \log \mathbb{E}[p(y = 1 | \vec{z}, \lambda)] - \mathbb{E}[\log p(y = 1 | \vec{z}, \lambda)] \approx \frac{\pi_1 - \pi_0}{2 \pi_1 \mathbb{E}[\bar{z}] + 2 \pi_0 (1 - \mathbb{E}[\bar{z}])} \mathbb{E}[\bar{z}] (1 - \mathbb{E}[\bar{z}]). 

Now what better accompaniment to an equation than a plot.  I’ve plotted the 2nd order approximation to the difference (solid lines) and the true difference (dashed lines) for different values of \pi_0 while keeping \pi_1 fixed at \sigma(4) where \sigma is the sigmoid function.   The errors are plotted as functions of \mathbb{E}[\bar{z}].  I’ve also put a purple line at \log(0.5); that’s the actual log probability you’d see if your predictive model just guessed y=1 with probability 0.5.

Moving the log into the expectation

First thing off the bat: the error is pretty large.  For fairly reasonable settings of the parameters, the difference between the two predictive methods actually exceeds the error one would get just by guessing 50/50.  For small values of \mathbb{E}[\bar{z}], the 2nd order approximation is pretty good, but for large values it greatly underestimates the error.  Should I be worried (because, for example, moving the log around the expectation is a key step in EM, variational methods, etc.)?

I was motivated to do all this because of a previous post, in which I claimed that a certain approximation of a similar nature was good; the reason there was the low covariance or the random variable.  Here, the covariance is large when the expectation is not close to either 0 or 1.  Thus the second (and higher) order terms cannot be ignored.  There are really two questions I want answered now:

  1. What happens if the random variable in this model is \bar{z} rather than z?  
  2. What happens if the random variable in the earlier post is z instead of \bar{z}?

Leave a comment

Filed under Uncategorized