Suppose that I have the following model: and are hidden and is observed. This model is perhaps not as facile as it seems; the ‘s might be mixture components in a mixture model, i.e., you can imagine hanging other observations off of these variables.
Since are hidden, we usually resort to some sort of approximate inference mechanism. For the sake of argument, assume that we’ve managed to a.) estimate the parameters of the model () on some training set and b.) (approximately) compute the posteriors and for some test document using the other parts of the model which I’m purposefully glossing over. The question is, how do we make a prediction about on this test document using this information?
Seems like an easy enough question; and for exposition’s sake I’m going to make this even easier. Assume that . Now, to do the prediction we marginalize out the hidden variables By linearity, we can rewrite this as where And if, as usual, I wanted the log probabilities, those would be easy too,
But suppose I were really nutter-butters and I decided to compute the log predictive likelihood another way, say by computing instead of It is worth pointing out that Jensen’s inequality tells us that the first term is a bound on the latter term. Because the is a set of indicators, the log probability has a convenient form, namely Applying the expectations yields
Jensen’s inequality is basically a statement about a first order Taylor approximation, i.e., when is convex. When is this approximation good? We might attempt this by looking at the second order term. I will save you the arduous derivation:
Now what better accompaniment to an equation than a plot. I’ve plotted the 2nd order approximation to the difference (solid lines) and the true difference (dashed lines) for different values of while keeping fixed at where is the sigmoid function. The errors are plotted as functions of I’ve also put a purple line at that’s the actual log probability you’d see if your predictive model just guessed with probability 0.5.
First thing off the bat: the error is pretty large. For fairly reasonable settings of the parameters, the difference between the two predictive methods actually exceeds the error one would get just by guessing 50/50. For small values of the 2nd order approximation is pretty good, but for large values it greatly underestimates the error. Should I be worried (because, for example, moving the log around the expectation is a key step in EM, variational methods, etc.)?
I was motivated to do all this because of a previous post, in which I claimed that a certain approximation of a similar nature was good; the reason there was the low covariance or the random variable. Here, the covariance is large when the expectation is not close to either 0 or 1. Thus the second (and higher) order terms cannot be ignored. There are really two questions I want answered now:
- What happens if the random variable in this model is rather than ?
- What happens if the random variable in the earlier post is instead of ?