Currently there are two regularization penalties, and this is sort of a hack. Ideally, we’d want to stick with one that is consistent across methods. This involves simulating additional non-links. In other words, we want to add to the likelihood we wish to optimize, where
is drawn from a distribution of our choosing. Note that this becomes equivalent to
, where
. This is intractable to do exactly so we use the same old Taylor trick.
This is very important: first-order is NOT enough. To see why this is, consider that what a first order approximation really says is that you can replace a set of points with their mean. Well, if you did this in logistic regression for all the 0’s and all the 1’s you’d get a single point for the 0’s and a single point for the 1’s, i.e., complete separability. And therefore your estimates will diverge.
So what are the ingredients to a second-order approximation? We first need to compute the gradients of the partition function: ,
,
. The other thing we need is the variance of
. Note that since variance is shift invariant, we can just compute the variance of
.
We can expand this by . Typically, covariance has two forms, one for when
and when for when
. For convenience, we will denote the first
and the latter
. Then this can be rewritten as
.
At this point we need to make about the distribution of . We will assume that
where
.
.
Consequently,
.
On to the next term: . Using common properties of the Dirichlet,
. This yields
.
Finally, notice that .