# Daily Archives: November 26, 2008

## Predictions Using Mixture Components

Suppose that I have the following model: $z_i \sim \mathrm{B}(\theta) \;\mathrm{ for }\; i=1\ldots N,$ $\lambda \sim \mathrm{Unif}(1\ldots N),$ and $p(y = 1 | \vec{z}, \lambda) = \pi_1 z_\lambda + \pi_0 (1 - z_\lambda).$ $z, \lambda$ are hidden and $y$ is observed.  This model is perhaps not as facile as it seems; the $z$‘s might be mixture components in a mixture model, i.e., you can imagine hanging other observations off of these variables.

Since $z, \lambda$ are hidden, we usually resort to some sort of approximate inference mechanism.  For the sake of argument, assume that we’ve managed to a.) estimate the parameters of the model ( $\theta, \pi_1, \pi_0$) on some training set and b.) (approximately) compute the posteriors $p(\lambda)$ and $p(z_i)$ for some test document using the other parts of the model which I’m purposefully glossing over.   The question is, how do we make a prediction about $y$ on this test document using this information?

Seems like an easy enough question; and for exposition’s sake I’m going to make this even easier.  Assume that $p(\lambda) = \frac{1}{N}$.   Now, to do the prediction we marginalize out the hidden variables $p(y) = \sum_{\vec{z}} \sum_\lambda p(y | \vec{z}, \lambda) p(\vec{z}) p(\lambda) = \mathbb{E}_{p(\vec{z})}[\sum_\lambda p (y | \vec{z}, \lambda) \frac{1}{N}].$  By linearity, we can rewrite this as $\pi_1 \mathbb{E}[\bar{z}] + \pi_0 (1 - \mathbb{E}[\bar{z}]),$ where $\bar{z} = \frac{1}{N} \sum_i z_i.$  And if, as usual, I wanted the log probabilities, those would be easy too, $\log (\pi_1\mathbb{E}[\bar{z}] + \pi_0 ( 1 - \mathbb{E}[\bar{z}])).$

But suppose I were really nutter-butters and I decided to compute the log predictive likelihood another way, say by computing $\mathbb{E}_\lambda[\mathbb{E}_{\vec{z}}[\log p(y | \vec{z}, \lambda)]]$ instead of $\log \mathbb{E}_{\vec{z}}[\mathbb{E}_\lambda [p(y | \vec{z}, \lambda)]].$  It is worth pointing out that Jensen’s inequality tells us that the first term is a bound on the latter term.   Because the $\vec{z}$ is a set of indicators, the log probability has a convenient form, namely $\log p(y = 1 | \vec{z}, \lambda) = \log (\pi_1) z_\lambda + \log(\pi_0) (1 - z_\lambda).$  Applying the expectations yields $\log (\pi_1) \mathbb{E}[\bar{z}] + \log(\pi_0) ( 1-\mathbb{E}[\bar{z}]).$

Jensen’s inequality is basically a statement about a first order Taylor approximation, i.e., $\mathbb{E}[f(X)] \approx f(\mathbb{E}[X]),$ when $f$ is convex.    When is this approximation good?  We might attempt this by looking at the second order term.  I will save you the arduous derivation: $\log \mathbb{E}[p(y = 1 | \vec{z}, \lambda)] - \mathbb{E}[\log p(y = 1 | \vec{z}, \lambda)] \approx \frac{\pi_1 - \pi_0}{2 \pi_1 \mathbb{E}[\bar{z}] + 2 \pi_0 (1 - \mathbb{E}[\bar{z}])} \mathbb{E}[\bar{z}] (1 - \mathbb{E}[\bar{z}]).$

Now what better accompaniment to an equation than a plot.  I’ve plotted the 2nd order approximation to the difference (solid lines) and the true difference (dashed lines) for different values of $\pi_0$ while keeping $\pi_1$ fixed at $\sigma(4)$ where $\sigma$ is the sigmoid function.   The errors are plotted as functions of $\mathbb{E}[\bar{z}].$  I’ve also put a purple line at $\log(0.5);$ that’s the actual log probability you’d see if your predictive model just guessed $y=1$ with probability 0.5.

First thing off the bat: the error is pretty large.  For fairly reasonable settings of the parameters, the difference between the two predictive methods actually exceeds the error one would get just by guessing 50/50.  For small values of $\mathbb{E}[\bar{z}],$ the 2nd order approximation is pretty good, but for large values it greatly underestimates the error.  Should I be worried (because, for example, moving the log around the expectation is a key step in EM, variational methods, etc.)?
1. What happens if the random variable in this model is $\bar{z}$ rather than $z$?
2. What happens if the random variable in the earlier post is $z$ instead of $\bar{z}$?