# Regularized Logistic Regression

Currently there are two regularization penalties, and this is sort of a hack.  Ideally, we’d want to stick with one that is consistent across methods.  This involves simulating additional non-links.   In other words, we want to add $\log p(y = 0 | \psi_1) + \log p(y = 0 | \psi_2) + \ldots + \log p(y = 0 | \psi_n)$ to the likelihood we wish to optimize, where $\psi$ is drawn from a distribution of our choosing.  Note that this becomes equivalent to $\mathbb{E}[\log p(y = 0 | \psi)] = -\mathbb{E}[A(\eta^T \psi + nu)]$, where $A(x) = \log(1 + \exp(x))$.   This is intractable to do exactly so we use the same old Taylor trick.

This is very important: first-order is NOT enough.  To see why this is, consider that what a first order approximation really says is that you can replace a set of points with their mean.  Well, if you did this in logistic regression for all the 0’s and all the 1’s you’d get a single point for the 0’s and a single point for the 1’s, i.e., complete separability.  And therefore your estimates will diverge.

So what are the ingredients to a second-order approximation?  We first need to compute the gradients of the partition function: $A'(x) = \sigma(x)$, $A''(x) = \sigma(x) \sigma(-x)$, $A'''(x) = \sigma(x) \sigma(-x) (1 - 2\sigma(x))$.  The other thing we need is the variance of $x = \eta^T \psi + \nu$.  Note that since variance is shift invariant, we can just compute the variance of $x = \eta^T \psi$.

We can expand this by $\mathrm{Var}(x) = \sum_i \sum_j \eta_i \eta_j \mathrm{Cov}(\psi_i, \psi_j)$.  Typically, covariance has two forms, one for when $i=j$ and when for when $i \ne j$.  For convenience, we will denote the first $V_i$ and the latter $C_{ij}$.  Then this can be rewritten as $\sum_i \eta_i^2 (V_i - C_{ii}) + \sum_i \sum_j \eta_i \eta_j C_{ij}$.

At this point we need to make about the distribution of $\psi$.  We will assume that $\psi = z_1 \circ z_2$ where $z \sim \mathrm{Dir}(\alpha \vec{p})$.  $C_{ij} = \mathbb{E}[z_{i} z_{j}]^2- p_i^2p_j^2$.  $\mathbb{E}[z_i z_j ] = \mathrm{Cov}(z_i, z_j) + p_i p_j = -p_i p_j \frac{1}{\alpha + 1} + p_i p_j = \frac{\alpha}{\alpha + 1} p_i p_j.$  Consequently, $C_{ij} = -\frac{2\alpha + 1}{(\alpha+1)^2} (p_ip_j)^2$.

On to the next term: $V_i = \mathbb{E}[z_i^2]^2 - p_{i}^4$.  Using common properties of the Dirichlet, $\mathbb{E}[z_{i}^2] = \mathrm{Var}(z_i) + p_{i}^2 = \frac{p_i (1 - p_i)}{\alpha + 1} + p_{i}^2 = \frac{p_i (1 + \alpha p_i)}{\alpha + 1}$. This yields $V_i = p_i^2 \frac{1 + 2 \alpha p_i }{(\alpha + 1)^2} + C_{ii}$.

Finally, notice that $\sum_i \sum_j \eta_i \eta_j C_{ij} = -\frac{2\alpha + 1}{(\alpha + 1)^2} (\eta^T (\vec{p} \circ \vec{p}))^2$.