# Daily Archives: December 1, 2008

## Further approximation improvements

I was a little bit sneaky in my previous post and the reason I say this will become apparent when I change the function being approximated.  Let’s break it down like this: there are three things we want to compare $\log \mathbb{E}[f(x)] (1),$ $\log f(\mathbb{E}[x]) (2),$ and $\mathbb{E}[\log f(x)] (3).$ And just to add a little more notation, let $f_\pi(\bar{z}) = \pi_1 \bar{z} + \pi_0 (1 - \bar{z})$ and $f_\sigma(\bar{z}) = \sigma(\eta \bar{z} + nu).$

In the previous post, I focused on $f_\pi$ and I was comparing (3) against both (1) and (2) by virtue of (1) and (2) being equal for this choice of f.  I lamented that the approximation wasn’t so good.  We can try to ameliorate the situation by adding second order terms.  Let’s focus on going from (3) to (2).  (2) is generally the easiest to compute as you only have to evaluate f at one point.  (3) often occurs in the objective function of variational approximations.  Now the second order term here is $\frac{1}{2}\mathrm{Var}(x)(\frac{f''(\mathbb{E}[x])}{f(\mathbb{E}[x])} - (\frac{f'(\mathbb{E}[x])}{f(\mathbb{E}[x])})^2).$   (Note that this gives you the result I posted a fe posts back regarding the 2nd order approximation of $\latex f_\pi$) Adding this term in leads to the following results (same methodology as before):

2nd order approximation of $f_\pi$

2nd order approximation of $f_\sigma$

Note that here we do much better than before.  Except for a few errant cases we’re actually down almost an order of magnitude.  We should be happy that we’ve got the problem licked.  Ok, on to the next problem.  Suppose we want to compare (1) and (2).  Note that these are identical for $f_\pi$ by linearity.  But for $f_\sigma$ they are not.  Again, (1) is easy to compute but (2) is usually what we want for a predictive probability.    The second order term here is simpler since we can disregard the log (as they appear on the outer most layer of both (1) and (2)): $\frac{1}{2}\mathrm{Var}(x)f''(\mathbb{E}[x]).$   Let’s run the same experiments again, except this time comparing (1) and (2) for $f_\sigma$ only.

Error using a 1st order predictive approximation

Error using a 2nd order predictive approximation

With a first order approximation, we get some pretty significant errors.  Using the second order approximation greatly reduces those errors (by an order of magnitude).  It seems like the 2nd order approximation is a good idea for this application.  Now the question is, will more accuracy here lead to a better performing model on real data?

Filed under Uncategorized

In the previous post, I posed two questions.   I’ll answer the second first.

This question considers what would happen if the response function (any response function) were to depend only on a single latent variable.  To use the notation of the previous post, I’d write $p(y = 1 | \vec{z}, \lambda) = \phi(z_\lambda).$  Here I will be a little more general and allow $z$ to be multinomial rather than binomial.  The model presented in the previous post can then just be written as $p(y = 1 | \vec{z}, \lambda) = \pi^T z_\lambda,$ where with a slight abuse of notation I’m letting $z_\lambda$ denote an indicator vector.

So it turns out that we can just reduce any choice of $\phi$ to this construction by rewriting $p(y = 1 | \vec{z}, \lambda) = \sum_k \phi(k) \delta_k(z_\lambda) = \vec{\phi}^T z_\lambda.$  For example, in the model presented several posts earlier, $\phi(z) = \sigma(\eta^T z + \nu),$ which means that we can represent this in the more general parameterization as $\pi=\sigma(\eta + \nu).$  Thus the answer to the question is a non-starter; the previous post already answered it.

The other question asks what would happen if we took the previous post’s model, but made the response a function of $\bar{z}$ instead of $z_\lambda.$  In other words, $p(y = 1 | \vec{z}) = \pi^T \bar{z}.$   This looks pretty similar (in expectation) to the previous model except that $\bar{z}$ is part of the model now rather than just a result of an expectation.  So, if we want to compute $\mathbb{E}_z[p(y = 1| \vec{z})],$ it works out to be the same.  But if we want to compute $\mathbb{E}_z[\log p(y = 1 | \vec{z})],$ it’s different.  In terms of the original formulation, we are moving the logarithm across the expectation with respect to $z,$ but NOT across the expectation with respect to $\lambda.$

Moving it as such, it does not have a simple closed form solution, so I can’t provide an analytic solution to the error incurred by moving the log inside the expectation.  However, I can empirically estimate this by sampling from $\bar{z}$ where each $z_i$ is drawn according to a binomial distribution parameterized by $\theta.$  I then compute $p(y = 1)$ using this $\bar{z}$ and take the mean over all samples (in this case, 100 samples were used for every point).  I compare this value to the value obtained by computing $p(y=1)$ at the mean over these same samples.   This error is what I plot in the attached figure as a function of $\theta.$

I produce three series for different values of $N,$ the number of draws from the binomial distribution used to create $\bar{z}.$   I chose values of $\pi_1 = \sigma(4), \pi_0 = \sigma(-3).$  This corresponds to the red line in the previous post.  One thing that is evident is that the error is much smaller.  You can think of the previous post’s curve as the $N=1$ case.  As $N$ increases, the covariance on $\bar{z}$ decreases and the error drops.  This is rather comforting.  However, for reasonable values of $N,$ the errors are still rather large: 0.10 may not seem like a lot in log likelihood, but that’s a lot larger that the difference between most techniques!