Currently there are two regularization penalties, and this is sort of a hack. Ideally, we’d want to stick with one that is consistent across methods. This involves simulating additional non-links. In other words, we want to add to the likelihood we wish to optimize, where is drawn from a distribution of our choosing. Note that this becomes equivalent to , where . This is intractable to do exactly so we use the same old Taylor trick.
This is very important: first-order is NOT enough. To see why this is, consider that what a first order approximation really says is that you can replace a set of points with their mean. Well, if you did this in logistic regression for all the 0’s and all the 1’s you’d get a single point for the 0’s and a single point for the 1’s, i.e., complete separability. And therefore your estimates will diverge.
So what are the ingredients to a second-order approximation? We first need to compute the gradients of the partition function: , , . The other thing we need is the variance of . Note that since variance is shift invariant, we can just compute the variance of .
We can expand this by . Typically, covariance has two forms, one for when and when for when . For convenience, we will denote the first and the latter . Then this can be rewritten as .
At this point we need to make about the distribution of . We will assume that where . . Consequently, .
On to the next term: . Using common properties of the Dirichlet, . This yields .
Finally, notice that .