What’s the deal with the logistic response?

Below I asked the question of why an approximation which is more accurate seems to be performing worse than one which is more haphazard. The two methods differ in two principal ways:

The approximate gradient computed during the E-step.
The optimization of the parameters in the M-step.

In the E-step they differ mainly in how they push a node’s topic distribution toward its neighbors’. With the simple but incorrect method, the gradient is proportional to $\eta^t z'$ while the correct method gives $\eta^t z'(1 - \sigma(\eta^t z \circ z' + \nu))$ . Intuitively, the last method tells us that the closer we are to the right answer, the less we need to push.

I futzed with the code to make it so that we push more like the incorrect method, but this didn’t seem to affect the results. I suspect that $\eta^t z \circ z' + \nu$ is always pretty small so this doesn’t have much of an impact. Then I tried changing the M-Step. In particular, I tried removing the M-Step fitting with $\psi_\sigma$ . It turns out that this “fit” performs about as well as $\psi_e$ . Examining the fits shows that $eta$ under $\psi_\sigma$ is consistently smaller than for $\psi_e$ . Why is this? I hypothesize that the L2 regularization penalty is causing this to fail. Having to determine the optimal penalty is a pain; I originally added this term because the fits were diverging. Perhaps the right thing to do is figure out why they were diverging in the first place.