Category Archives: Uncategorized

Regularized Logistic Regression Gradients

In the previous post, I showed how to compute the approximate objective function for regularization.  I implemented such a harness in R with some real data from one of my runs to prototype it.  Directly using nlm seems to work in fits and starts, perhaps because the optimization routine does not use gradients.  The next step then is to compute the gradients.   The gradient of the zero’th order term is \sigma(\mu) (\vec{p}^+ \circ \vec{p}^+), where \mu = \eta^T \vec{p} \circ \vec{p} + \nu and \vec{p}^+ = \langle \vec{p}, 1\rangle.  

The second order term is where it gets hard.  Taking the derivative by parts, we get \frac{1}{2} A''(\mu) \nabla \mathrm{Var}(x) + \frac{1}{2}\mathrm{Var}(x)A'''(\mu)(\vec{p}^+ \circ \vec{p}^+). To compute \nabla \mathrm{Var}(x), we use the derivation in the previous post to show that this is equal to 2 \eta \circ (V - \mathrm{Diag}(C)) + C \eta + \eta \circ \mathrm{Diag}(C).

I implement all this in R and it seems to converge erratically.  Further more, the “converged” values perform worse than the not fitting at all.   So to summarize results:

  • Both \psi_e and \psi_\sigma perform best without any M-Step fitting.
  • \psi_\sigma > \psi_e when the first is unfit but the second is fit.
  • \psi_e > \psi_\sigma when both are fit or both are unfit.

This is all very peculiar.  In order to ameliorate the fitting process, I collapsed the parameter space so that \vec{\eta} = \langle \eta, \ldots, \eta\rangle so that there are two parameters.   This seems to converge ok, and give results comparable to without fitting for \psi_\sigma.  When I do the similar parameter collapsing in \psi_e it performs worse.

In summary, it seems like across the board not fitting is still the best.

Leave a comment

Filed under Uncategorized

Regularized Logistic Regression

Currently there are two regularization penalties, and this is sort of a hack.  Ideally, we’d want to stick with one that is consistent across methods.  This involves simulating additional non-links.   In other words, we want to add \log p(y = 0 | \psi_1) + \log p(y = 0 | \psi_2) + \ldots + \log p(y = 0 | \psi_n) to the likelihood we wish to optimize, where \psi is drawn from a distribution of our choosing.  Note that this becomes equivalent to \mathbb{E}[\log p(y = 0 | \psi)] = -\mathbb{E}[A(\eta^T \psi + nu)], where A(x) = \log(1 + \exp(x)).   This is intractable to do exactly so we use the same old Taylor trick.  

This is very important: first-order is NOT enough.  To see why this is, consider that what a first order approximation really says is that you can replace a set of points with their mean.  Well, if you did this in logistic regression for all the 0’s and all the 1’s you’d get a single point for the 0’s and a single point for the 1’s, i.e., complete separability.  And therefore your estimates will diverge.

So what are the ingredients to a second-order approximation?  We first need to compute the gradients of the partition function: A'(x) = \sigma(x), A''(x) = \sigma(x) \sigma(-x), A'''(x) = \sigma(x) \sigma(-x) (1 - 2\sigma(x)).  The other thing we need is the variance of x = \eta^T \psi + \nu.  Note that since variance is shift invariant, we can just compute the variance of x = \eta^T \psi.  

We can expand this by \mathrm{Var}(x) = \sum_i \sum_j \eta_i \eta_j \mathrm{Cov}(\psi_i, \psi_j).  Typically, covariance has two forms, one for when i=j and when for when i \ne j.  For convenience, we will denote the first V_i and the latter C_{ij}.  Then this can be rewritten as \sum_i \eta_i^2 (V_i - C_{ii}) + \sum_i \sum_j \eta_i \eta_j C_{ij}.  

At this point we need to make about the distribution of \psi.  We will assume that \psi = z_1 \circ z_2 where z \sim \mathrm{Dir}(\alpha \vec{p}).  C_{ij} = \mathbb{E}[z_{i} z_{j}]^2- p_i^2p_j^2.  \mathbb{E}[z_i z_j ] = \mathrm{Cov}(z_i, z_j) + p_i p_j = -p_i p_j \frac{1}{\alpha + 1} + p_i p_j = \frac{\alpha}{\alpha + 1} p_i p_j.  Consequently, C_{ij} = -\frac{2\alpha + 1}{(\alpha+1)^2} (p_ip_j)^2.

On to the next term: V_i = \mathbb{E}[z_i^2]^2 - p_{i}^4.  Using common properties of the Dirichlet, \mathbb{E}[z_{i}^2] = \mathrm{Var}(z_i) + p_{i}^2 = \frac{p_i (1 - p_i)}{\alpha + 1} + p_{i}^2 = \frac{p_i (1 + \alpha p_i)}{\alpha + 1}. This yields V_i = p_i^2 \frac{1 + 2 \alpha p_i }{(\alpha + 1)^2} + C_{ii}.

Finally, notice that \sum_i \sum_j \eta_i \eta_j C_{ij} = -\frac{2\alpha + 1}{(\alpha + 1)^2} (\eta^T (\vec{p} \circ \vec{p}))^2.

Leave a comment

Filed under Uncategorized

What’s the deal with the logistic response?

Below I asked the question of why an approximation which is more accurate seems to be performing worse than one which is more haphazard.  The two methods differ in two principal ways:

  • The approximate gradient computed during the E-step.  
  • The optimization of the parameters in the M-step.

In the E-step they differ mainly in how they push a node’s topic distribution toward its neighbors’.  With the simple but incorrect method, the gradient is proportional to \eta^t z' while the correct method gives \eta^t z'(1 - \sigma(\eta^t z \circ z' + \nu)).  Intuitively, the last method tells us that the closer we are to the right answer, the less we need to push. 

I futzed with the code to make it so that we push more like the incorrect method, but this didn’t seem to affect the results.  I suspect that \eta^t z \circ z' + \nu is always pretty small so this doesn’t have much of an impact.  Then I tried changing the M-Step.  In particular, I tried removing the M-Step fitting with \psi_\sigma.  It turns out that this “fit” performs about as well as \psi_e.  Examining the fits shows that eta under \psi_\sigma is consistently smaller than for \psi_e.  Why is this?  I hypothesize that the L2 regularization penalty is causing this to fail.  Having to determine the optimal penalty is a pain; I originally added this term because the fits were diverging.  Perhaps the right thing to do is figure out why they were diverging in the first place.

Leave a comment

Filed under Uncategorized

Reproducing experiments and rediscovering alpha

It is a good idea to always make sure to make experiments exactly reproducible.  Sometimes in the heat of a paper deadline, a bunch of experiments are run with different parameter settings and what not and a few months down the road one might be hard pressed to remember exactly how you got those numbers.  I’ve learned this the hard way so for the last few submissions I’ve always tried to make sure something in the CVS encodes the precise settings necessary.

Last night, while looking up what I did, I rediscovered that alpha is important sometimes.  Not so much because of scale, but because of the distribution over topics it reports.  My experience is that fitting alpha in an EM context rarely works; it either totally messes up your inference or it just reproduces the alpha that you fed in initially.  But I did get into the habit of having the model fit and spit out alpha after the very last iteration of EM, just get some idea of what the topic distributions look like.

Usually you get something which is pretty uniform over your K topics.  But sometimes you don’t.  In old-fashioned LDA, this usually happens when you have too many topics.  With models such as RTM, you can also get this effect because your graph manifold may have cliques of different sizes.  In these cases, when you’re computing predictive word likelihood of an empty document, you are liable to perform much worse than baseline if you assume a symmetric Dirichlet prior rather than one which reflects the topic distribution on your training set.

Leave a comment

Filed under Uncategorized

Approximating the logistic response

The central challenge of variational methods is usually computing expectations of log probabilities.  In the case of the RTM, this is \mathbb{E}[\log p(y | z, z')] = y \mathbb{E}[x] - \mathbb{E}[\log(1 + \exp(x))], where x = \eta^t z \circ z' + \nu.

The first term is linear and so is easy enough, the second is problematic though.  One approach is to use a Taylor approximation.  The issue then becomes choosing the point around which to center the approximation.  The partition function above really has two regimes: for small x, \log(1 + \exp(x)) \approx 0, but for large x, \log(1 + \exp(x)) \approx x.  The solution that the delta method uses is to center it at the mean \mu = \mathbb{E}[x]. But does this give us any real guarantee that we won’t be better off by centering it elsewhere? 

I couldn’t really answer this question analytically, so I decided to experiment.  I sampled x using settings typical of the corpora I look at.  Turns out that the first order approximation at the mean is really good because the variance on z is really low when you have enough words.

That of course brings up another question.  Why does doing the “correct” (\psi_\sigma) thing not work as well as the “incorrect” (\psi_e) approximation?

Leave a comment

Filed under Uncategorized