Answers to two questions

In the previous post, I posed two questions. I’ll answer the second first.

This question considers what would happen if the response function (any response function) were to depend only on a single latent variable. To use the notation of the previous post, I’d write $p(y = 1 | \vec{z}, \lambda) = \phi(z_\lambda).$ Here I will be a little more general and allow $z$ to be multinomial rather than binomial. The model presented in the previous post can then just be written as $p(y = 1 | \vec{z}, \lambda) = \pi^T z_\lambda,$ where with a slight abuse of notation I’m letting $z_\lambda$ denote an indicator vector.

So it turns out that we can just reduce any choice of $\phi$ to this construction by rewriting $p(y = 1 | \vec{z}, \lambda) = \sum_k \phi(k) \delta_k(z_\lambda) = \vec{\phi}^T z_\lambda.$ For example, in the model presented several posts earlier, $\phi(z) = \sigma(\eta^T z + \nu),$ which means that we can represent this in the more general parameterization as $\pi=\sigma(\eta + \nu).$ Thus the answer to the question is a non-starter; the previous post already answered it.

The other question asks what would happen if we took the previous post’s model, but made the response a function of $\bar{z}$ instead of $z_\lambda.$ In other words, $p(y = 1 | \vec{z}) = \pi^T \bar{z}.$ This looks pretty similar (in expectation) to the previous model except that $\bar{z}$ is part of the model now rather than just a result of an expectation. So, if we want to compute $\mathbb{E}_z[p(y = 1| \vec{z})],$ it works out to be the same. But if we want to compute $\mathbb{E}_z[\log p(y = 1 | \vec{z})],$ it’s different. In terms of the original formulation, we are moving the logarithm across the expectation with respect to $z,$ but NOT across the expectation with respect to $\lambda.$

Moving it as such, it does not have a simple closed form solution, so I can’t provide an analytic solution to the error incurred by moving the log inside the expectation. However, I can empirically estimate this by sampling from $\bar{z}$ where each $z_i$ is drawn according to a binomial distribution parameterized by $\theta.$ I then compute $p(y = 1)$ using this $\bar{z}$ and take the mean over all samples (in this case, 100 samples were used for every point). I compare this value to the value obtained by computing $p(y=1)$ at the mean over these same samples. This error is what I plot in the attached figure as a function of $\theta.$

Estimating the loss incurred by moving the log into the expectation.

I produce three series for different values of $N,$ the number of draws from the binomial distribution used to create $\bar{z}.$ I chose values of $\pi_1 = \sigma(4), \pi_0 = \sigma(-3).$ This corresponds to the red line in the previous post. One thing that is evident is that the error is much smaller. You can think of the previous post’s curve as the $N=1$ case. As $N$ increases, the covariance on $\bar{z}$ decreases and the error drops. This is rather comforting. However, for reasonable values of $N,$ the errors are still rather large: 0.10 may not seem like a lot in log likelihood, but that’s a lot larger that the difference between most techniques!