26 | November | 2008 | Please Scoop Me!

Suppose that I have the following model: $z_i \sim \mathrm{B}(\theta) \;\mathrm{ for }\; i=1\ldots N,$ $\lambda \sim \mathrm{Unif}(1\ldots N),$ and $p(y = 1 | \vec{z}, \lambda) = \pi_1 z_\lambda + \pi_0 (1 - z_\lambda).$ $z, \lambda$ are hidden and $y$ is observed. This model is perhaps not as facile as it seems; the $z$ ‘s might be mixture components in a mixture model, i.e., you can imagine hanging other observations off of these variables.

Since $z, \lambda$ are hidden, we usually resort to some sort of approximate inference mechanism. For the sake of argument, assume that we’ve managed to a.) estimate the parameters of the model ( $\theta, \pi_1, \pi_0$ ) on some training set and b.) (approximately) compute the posteriors $p(\lambda)$ and $p(z_i)$ for some test document using the other parts of the model which I’m purposefully glossing over. The question is, how do we make a prediction about $y$ on this test document using this information?

Seems like an easy enough question; and for exposition’s sake I’m going to make this even easier. Assume that $p(\lambda) = \frac{1}{N}$ . Now, to do the prediction we marginalize out the hidden variables $p(y) = \sum_{\vec{z}} \sum_\lambda p(y | \vec{z}, \lambda) p(\vec{z}) p(\lambda) = \mathbb{E}_{p(\vec{z})}[\sum_\lambda p (y | \vec{z}, \lambda) \frac{1}{N}].$ By linearity, we can rewrite this as $\pi_1 \mathbb{E}[\bar{z}] + \pi_0 (1 - \mathbb{E}[\bar{z}]),$ where $\bar{z} = \frac{1}{N} \sum_i z_i.$ And if, as usual, I wanted the log probabilities, those would be easy too, $\log (\pi_1\mathbb{E}[\bar{z}] + \pi_0 ( 1 - \mathbb{E}[\bar{z}])).$

But suppose I were really nutter-butters and I decided to compute the log predictive likelihood another way, say by computing $\mathbb{E}_\lambda[\mathbb{E}_{\vec{z}}[\log p(y | \vec{z}, \lambda)]]$ instead of $\log \mathbb{E}_{\vec{z}}[\mathbb{E}_\lambda [p(y | \vec{z}, \lambda)]].$ It is worth pointing out that Jensen’s inequality tells us that the first term is a bound on the latter term. Because the $\vec{z}$ is a set of indicators, the log probability has a convenient form, namely $\log p(y = 1 | \vec{z}, \lambda) = \log (\pi_1) z_\lambda + \log(\pi_0) (1 - z_\lambda).$ Applying the expectations yields $\log (\pi_1) \mathbb{E}[\bar{z}] + \log(\pi_0) ( 1-\mathbb{E}[\bar{z}]).$

Jensen’s inequality is basically a statement about a first order Taylor approximation, i.e., $\mathbb{E}[f(X)] \approx f(\mathbb{E}[X]),$ when $f$ is convex. When is this approximation good? We might attempt this by looking at the second order term. I will save you the arduous derivation: $\log \mathbb{E}[p(y = 1 | \vec{z}, \lambda)] - \mathbb{E}[\log p(y = 1 | \vec{z}, \lambda)] \approx \frac{\pi_1 - \pi_0}{2 \pi_1 \mathbb{E}[\bar{z}] + 2 \pi_0 (1 - \mathbb{E}[\bar{z}])} \mathbb{E}[\bar{z}] (1 - \mathbb{E}[\bar{z}]).$

Now what better accompaniment to an equation than a plot. I’ve plotted the 2nd order approximation to the difference (solid lines) and the true difference (dashed lines) for different values of $\pi_0$ while keeping $\pi_1$ fixed at $\sigma(4)$ where $\sigma$ is the sigmoid function. The errors are plotted as functions of $\mathbb{E}[\bar{z}].$ I’ve also put a purple line at $\log(0.5);$ that’s the actual log probability you’d see if your predictive model just guessed $y=1$ with probability 0.5.

Moving the log into the expectation

First thing off the bat: the error is pretty large. For fairly reasonable settings of the parameters, the difference between the two predictive methods actually exceeds the error one would get just by guessing 50/50. For small values of $\mathbb{E}[\bar{z}],$ the 2nd order approximation is pretty good, but for large values it greatly underestimates the error. Should I be worried (because, for example, moving the log around the expectation is a key step in EM, variational methods, etc.)?

I was motivated to do all this because of a previous post, in which I claimed that a certain approximation of a similar nature was good; the reason there was the low covariance or the random variable. Here, the covariance is large when the expectation is not close to either 0 or 1. Thus the second (and higher) order terms cannot be ignored. There are really two questions I want answered now:

What happens if the random variable in this model is $\bar{z}$ rather than $z$ ?
What happens if the random variable in the earlier post is $z$ instead of $\bar{z}$ ?

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Daily Archives: November 26, 2008

Predictions Using Mixture Components

Blog Stats

Archives

Meta