# Daily Archives: November 23, 2008

## Regularized Logistic Regression Gradients

In the previous post, I showed how to compute the approximate objective function for regularization.  I implemented such a harness in R with some real data from one of my runs to prototype it.  Directly using nlm seems to work in fits and starts, perhaps because the optimization routine does not use gradients.  The next step then is to compute the gradients.   The gradient of the zero’th order term is $\sigma(\mu) (\vec{p}^+ \circ \vec{p}^+),$ where $\mu = \eta^T \vec{p} \circ \vec{p} + \nu$ and $\vec{p}^+ = \langle \vec{p}, 1\rangle$.

The second order term is where it gets hard.  Taking the derivative by parts, we get $\frac{1}{2} A''(\mu) \nabla \mathrm{Var}(x) + \frac{1}{2}\mathrm{Var}(x)A'''(\mu)(\vec{p}^+ \circ \vec{p}^+)$. To compute $\nabla \mathrm{Var}(x)$, we use the derivation in the previous post to show that this is equal to $2 \eta \circ (V - \mathrm{Diag}(C)) + C \eta + \eta \circ \mathrm{Diag}(C)$.

I implement all this in R and it seems to converge erratically.  Further more, the “converged” values perform worse than the not fitting at all.   So to summarize results:

• Both $\psi_e$ and $\psi_\sigma$ perform best without any M-Step fitting.
• $\psi_\sigma > \psi_e$ when the first is unfit but the second is fit.
• $\psi_e > \psi_\sigma$ when both are fit or both are unfit.

This is all very peculiar.  In order to ameliorate the fitting process, I collapsed the parameter space so that $\vec{\eta} = \langle \eta, \ldots, \eta\rangle$ so that there are two parameters.   This seems to converge ok, and give results comparable to without fitting for $\psi_\sigma$.  When I do the similar parameter collapsing in $\psi_e$ it performs worse.

In summary, it seems like across the board not fitting is still the best.

Filed under Uncategorized

## Regularized Logistic Regression

Currently there are two regularization penalties, and this is sort of a hack.  Ideally, we’d want to stick with one that is consistent across methods.  This involves simulating additional non-links.   In other words, we want to add $\log p(y = 0 | \psi_1) + \log p(y = 0 | \psi_2) + \ldots + \log p(y = 0 | \psi_n)$ to the likelihood we wish to optimize, where $\psi$ is drawn from a distribution of our choosing.  Note that this becomes equivalent to $\mathbb{E}[\log p(y = 0 | \psi)] = -\mathbb{E}[A(\eta^T \psi + nu)]$, where $A(x) = \log(1 + \exp(x))$.   This is intractable to do exactly so we use the same old Taylor trick.

This is very important: first-order is NOT enough.  To see why this is, consider that what a first order approximation really says is that you can replace a set of points with their mean.  Well, if you did this in logistic regression for all the 0’s and all the 1’s you’d get a single point for the 0’s and a single point for the 1’s, i.e., complete separability.  And therefore your estimates will diverge.

So what are the ingredients to a second-order approximation?  We first need to compute the gradients of the partition function: $A'(x) = \sigma(x)$, $A''(x) = \sigma(x) \sigma(-x)$, $A'''(x) = \sigma(x) \sigma(-x) (1 - 2\sigma(x))$.  The other thing we need is the variance of $x = \eta^T \psi + \nu$.  Note that since variance is shift invariant, we can just compute the variance of $x = \eta^T \psi$.

We can expand this by $\mathrm{Var}(x) = \sum_i \sum_j \eta_i \eta_j \mathrm{Cov}(\psi_i, \psi_j)$.  Typically, covariance has two forms, one for when $i=j$ and when for when $i \ne j$.  For convenience, we will denote the first $V_i$ and the latter $C_{ij}$.  Then this can be rewritten as $\sum_i \eta_i^2 (V_i - C_{ii}) + \sum_i \sum_j \eta_i \eta_j C_{ij}$.

At this point we need to make about the distribution of $\psi$.  We will assume that $\psi = z_1 \circ z_2$ where $z \sim \mathrm{Dir}(\alpha \vec{p})$. $C_{ij} = \mathbb{E}[z_{i} z_{j}]^2- p_i^2p_j^2$. $\mathbb{E}[z_i z_j ] = \mathrm{Cov}(z_i, z_j) + p_i p_j = -p_i p_j \frac{1}{\alpha + 1} + p_i p_j = \frac{\alpha}{\alpha + 1} p_i p_j.$  Consequently, $C_{ij} = -\frac{2\alpha + 1}{(\alpha+1)^2} (p_ip_j)^2$.

On to the next term: $V_i = \mathbb{E}[z_i^2]^2 - p_{i}^4$.  Using common properties of the Dirichlet, $\mathbb{E}[z_{i}^2] = \mathrm{Var}(z_i) + p_{i}^2 = \frac{p_i (1 - p_i)}{\alpha + 1} + p_{i}^2 = \frac{p_i (1 + \alpha p_i)}{\alpha + 1}$. This yields $V_i = p_i^2 \frac{1 + 2 \alpha p_i }{(\alpha + 1)^2} + C_{ii}$.

Finally, notice that $\sum_i \sum_j \eta_i \eta_j C_{ij} = -\frac{2\alpha + 1}{(\alpha + 1)^2} (\eta^T (\vec{p} \circ \vec{p}))^2$.