Regularized Logistic Regression

Currently there are two regularization penalties, and this is sort of a hack. Ideally, we’d want to stick with one that is consistent across methods. This involves simulating additional non-links. In other words, we want to add $\log p(y = 0 | \psi_1) + \log p(y = 0 | \psi_2) + \ldots + \log p(y = 0 | \psi_n)$ to the likelihood we wish to optimize, where $\psi$ is drawn from a distribution of our choosing. Note that this becomes equivalent to $\mathbb{E}[\log p(y = 0 | \psi)] = -\mathbb{E}[A(\eta^T \psi + nu)]$ , where $A(x) = \log(1 + \exp(x))$ . This is intractable to do exactly so we use the same old Taylor trick.

This is very important: first-order is NOT enough. To see why this is, consider that what a first order approximation really says is that you can replace a set of points with their mean. Well, if you did this in logistic regression for all the 0’s and all the 1’s you’d get a single point for the 0’s and a single point for the 1’s, i.e., complete separability. And therefore your estimates will diverge.

So what are the ingredients to a second-order approximation? We first need to compute the gradients of the partition function: $A'(x) = \sigma(x)$ , $A''(x) = \sigma(x) \sigma(-x)$ , $A'''(x) = \sigma(x) \sigma(-x) (1 - 2\sigma(x))$ . The other thing we need is the variance of $x = \eta^T \psi + \nu$ . Note that since variance is shift invariant, we can just compute the variance of $x = \eta^T \psi$ .

We can expand this by $\mathrm{Var}(x) = \sum_i \sum_j \eta_i \eta_j \mathrm{Cov}(\psi_i, \psi_j)$ . Typically, covariance has two forms, one for when $i=j$ and when for when $i \ne j$ . For convenience, we will denote the first $V_i$ and the latter $C_{ij}$ . Then this can be rewritten as $\sum_i \eta_i^2 (V_i - C_{ii}) + \sum_i \sum_j \eta_i \eta_j C_{ij}$ .

At this point we need to make about the distribution of $\psi$ . We will assume that $\psi = z_1 \circ z_2$ where $z \sim \mathrm{Dir}(\alpha \vec{p})$ . $C_{ij} = \mathbb{E}[z_{i} z_{j}]^2- p_i^2p_j^2$ . $\mathbb{E}[z_i z_j ] = \mathrm{Cov}(z_i, z_j) + p_i p_j = -p_i p_j \frac{1}{\alpha + 1} + p_i p_j = \frac{\alpha}{\alpha + 1} p_i p_j.$ Consequently, $C_{ij} = -\frac{2\alpha + 1}{(\alpha+1)^2} (p_ip_j)^2$ .

On to the next term: $V_i = \mathbb{E}[z_i^2]^2 - p_{i}^4$ . Using common properties of the Dirichlet, $\mathbb{E}[z_{i}^2] = \mathrm{Var}(z_i) + p_{i}^2 = \frac{p_i (1 - p_i)}{\alpha + 1} + p_{i}^2 = \frac{p_i (1 + \alpha p_i)}{\alpha + 1}$ . This yields $V_i = p_i^2 \frac{1 + 2 \alpha p_i }{(\alpha + 1)^2} + C_{ii}$ .