Regularized Logistic Regression

Currently there are two regularization penalties, and this is sort of a hack.  Ideally, we’d want to stick with one that is consistent across methods.  This involves simulating additional non-links.   In other words, we want to add \log p(y = 0 | \psi_1) + \log p(y = 0 | \psi_2) + \ldots + \log p(y = 0 | \psi_n) to the likelihood we wish to optimize, where \psi is drawn from a distribution of our choosing.  Note that this becomes equivalent to \mathbb{E}[\log p(y = 0 | \psi)] = -\mathbb{E}[A(\eta^T \psi + nu)], where A(x) = \log(1 + \exp(x)).   This is intractable to do exactly so we use the same old Taylor trick.  

This is very important: first-order is NOT enough.  To see why this is, consider that what a first order approximation really says is that you can replace a set of points with their mean.  Well, if you did this in logistic regression for all the 0’s and all the 1’s you’d get a single point for the 0’s and a single point for the 1’s, i.e., complete separability.  And therefore your estimates will diverge.

So what are the ingredients to a second-order approximation?  We first need to compute the gradients of the partition function: A'(x) = \sigma(x), A''(x) = \sigma(x) \sigma(-x), A'''(x) = \sigma(x) \sigma(-x) (1 - 2\sigma(x)).  The other thing we need is the variance of x = \eta^T \psi + \nu.  Note that since variance is shift invariant, we can just compute the variance of x = \eta^T \psi.  

We can expand this by \mathrm{Var}(x) = \sum_i \sum_j \eta_i \eta_j \mathrm{Cov}(\psi_i, \psi_j).  Typically, covariance has two forms, one for when i=j and when for when i \ne j.  For convenience, we will denote the first V_i and the latter C_{ij}.  Then this can be rewritten as \sum_i \eta_i^2 (V_i - C_{ii}) + \sum_i \sum_j \eta_i \eta_j C_{ij}.  

At this point we need to make about the distribution of \psi.  We will assume that \psi = z_1 \circ z_2 where z \sim \mathrm{Dir}(\alpha \vec{p}).  C_{ij} = \mathbb{E}[z_{i} z_{j}]^2- p_i^2p_j^2.  \mathbb{E}[z_i z_j ] = \mathrm{Cov}(z_i, z_j) + p_i p_j = -p_i p_j \frac{1}{\alpha + 1} + p_i p_j = \frac{\alpha}{\alpha + 1} p_i p_j.  Consequently, C_{ij} = -\frac{2\alpha + 1}{(\alpha+1)^2} (p_ip_j)^2.

On to the next term: V_i = \mathbb{E}[z_i^2]^2 - p_{i}^4.  Using common properties of the Dirichlet, \mathbb{E}[z_{i}^2] = \mathrm{Var}(z_i) + p_{i}^2 = \frac{p_i (1 - p_i)}{\alpha + 1} + p_{i}^2 = \frac{p_i (1 + \alpha p_i)}{\alpha + 1}. This yields V_i = p_i^2 \frac{1 + 2 \alpha p_i }{(\alpha + 1)^2} + C_{ii}.

Finally, notice that \sum_i \sum_j \eta_i \eta_j C_{ij} = -\frac{2\alpha + 1}{(\alpha + 1)^2} (\eta^T (\vec{p} \circ \vec{p}))^2.

Advertisements

Leave a comment

Filed under Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s