08 | December | 2008 | Please Scoop Me!

When training generative models, we usually optimize parameters to optimize joint likelihood. However, this is in no way (or is it? let me know if you know better) a guarantee that you’ll do better on many real-world benchmarks such as precision/recall. So I wondered, what would happen if you optimized for these quantities?

Let’s take a step back. In a lot of applications (especially information retrieval), what you want is some mechanism (such as a predictive model) of generating scores associated with each possible outcome. And then you measure the loss based on the score of the outcome you wanted versus the score on all other outcomes. For example, let $o_{i}$ denote the $i$ th outcome and $s(o_i)$ the score of the $i$ th outcome. Then our goal is to optimize $\mathscr{L} = \log s(o^*) - \log \sum_i s(o_i),$ where $o^*$ is the “good” outcome. For example, if we were searching for a needle in a haystack, $o^*$ would be the needle and we want to ensure that the score associated with the needle is much higher than the score associated with the hay.

In the context of the models I’ve been discussing, the “score” is just the probability of a link. The objective then takes the form (over all links of interest):

$(\sum_{ij \in \Lambda} \log p(y_{ij})) - |\Lambda| \log \sum_{ij} p(y_{ij}),$ where $\Lambda$ is the set of links of interest. When we use $\psi_e$ as the link probability function, we can evaluate this quantity analytically: $\sum_{ij \in \Lambda} \log(\pi^T (\phi_i \circ \phi_j)^+) - |\Lambda| \log \pi^T \sum_{ij} (\phi_i \circ \phi_j)^+.$ Here, $x^+ = \langle x, 1 - \sum x \rangle.$ The last sum is could be computationally difficult, except that we can rewrite it as $M (\phi_\sigma \circ \phi_\sigma)^+,$ where $M$ is the total number of possible links and $\phi_\sigma = \frac{1}{N}\sum_i \phi_i.$ The gradient of this whole thing is easy enough to compute and we can throw the whole thing into our favorite optimizer using the log barrier method to ensure that $\pi$ doesn’t run away.

So what happens you might wonder? Well, for the data I have it becomes unstable and the probability of linkage when two nodes are at all similar approaches 1 and when they are not it approaches 0. Thus, in practice the log probability becomes a large number minus a large number. The intuition behind why this happens is a little like the intuition behind separation in regression. If, on the balance, any covariate is predictive of the output, then it’s coefficient might as well get shoved to (negative) infinity. Another perspective is by looking at the objective in terms of one covariate (one that governs the current outcome of interest); then $\mathscr{L} = \log \frac{s(o^*)}{A + B s(o^*)},$ where we have put in the denominator $B$ other outcomes which have the same response function as the outcome we care about. No matter what we think about those other outcomes, the objective can be optimized by setting the score for all of these outcomes as high as possible.

Meandering theory aside, I wondered what would happen if I tried this to do actual prediction. The table below gives the results (the first two are from a previous table, the last one is the method I’m describing in this post).

Method	Predictive Log Link Likelihood
Maximum likelihood estimate	-11.6019
Arbitrarily set to 4,-3	-11.4354
Proposed method	-11.3065

Wow, epic win using this method. But before we get too excited, we should remember that setting the parameters this way has a discriminative flavor. In particular, the word likelihood, that other thing we care about, actually goes down. In contrast to the “proper” generative approach, here you need to set the parameters differently for each task you want to participate in.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Daily Archives: December 8, 2008

Optimizing for precision/recall

Blog Stats

Archives

Meta