The central challenge of variational methods is usually computing expectations of log probabilities. In the case of the RTM, this is where .
The first term is linear and so is easy enough, the second is problematic though. One approach is to use a Taylor approximation. The issue then becomes choosing the point around which to center the approximation. The partition function above really has two regimes: for small , but for large . The solution that the delta method uses is to center it at the mean . But does this give us any real guarantee that we won’t be better off by centering it elsewhere?
I couldn’t really answer this question analytically, so I decided to experiment. I sampled using settings typical of the corpora I look at. Turns out that the first order approximation at the mean is really good because the variance on z is really low when you have enough words.
That of course brings up another question. Why does doing the “correct” () thing not work as well as the “incorrect” () approximation?