Below I asked the question of why an approximation which is more accurate seems to be performing worse than one which is more haphazard. The two methods differ in two principal ways:
- The approximate gradient computed during the E-step.
- The optimization of the parameters in the M-step.
In the E-step they differ mainly in how they push a node’s topic distribution toward its neighbors’. With the simple but incorrect method, the gradient is proportional to while the correct method gives
. Intuitively, the last method tells us that the closer we are to the right answer, the less we need to push.
I futzed with the code to make it so that we push more like the incorrect method, but this didn’t seem to affect the results. I suspect that is always pretty small so this doesn’t have much of an impact. Then I tried changing the M-Step. In particular, I tried removing the M-Step fitting with
. It turns out that this “fit” performs about as well as
. Examining the fits shows that
under
is consistently smaller than for
. Why is this? I hypothesize that the L2 regularization penalty is causing this to fail. Having to determine the optimal penalty is a pain; I originally added this term because the fits were diverging. Perhaps the right thing to do is figure out why they were diverging in the first place.