# Monthly Archives: December 2008

## Optimizing for precision/recall

When training generative models, we usually optimize parameters to optimize joint likelihood.  However, this is in no way (or is it? let me know if you know better) a guarantee that you’ll do better on many real-world benchmarks such as precision/recall.  So I wondered, what would happen if you optimized for these quantities?

Let’s take a step back.  In a lot of applications (especially information retrieval), what you want is some mechanism (such as a predictive model) of generating scores associated with each possible outcome.  And then you measure the loss based on the score of the outcome you wanted versus the score on all other outcomes.   For example, let $o_{i}$ denote the $i$th outcome and $s(o_i)$ the score of the $i$th outcome.   Then our goal is to optimize $\mathscr{L} = \log s(o^*) - \log \sum_i s(o_i),$ where $o^*$ is the “good” outcome.  For example, if we were searching for a needle in a haystack, $o^*$ would be the needle and we want to ensure that the score associated with the needle is much higher than the score associated with the hay.

In the context of the models I’ve been discussing, the “score” is just the probability of a link.  The objective then takes the form (over all links of interest):
$(\sum_{ij \in \Lambda} \log p(y_{ij})) - |\Lambda| \log \sum_{ij} p(y_{ij}),$ where $\Lambda$ is the set of links of interest.   When we use $\psi_e$ as the link probability function, we can evaluate this quantity analytically: $\sum_{ij \in \Lambda} \log(\pi^T (\phi_i \circ \phi_j)^+) - |\Lambda| \log \pi^T \sum_{ij} (\phi_i \circ \phi_j)^+.$  Here,  $x^+ = \langle x, 1 - \sum x \rangle.$   The last sum is could be computationally difficult, except that we can rewrite it as $M (\phi_\sigma \circ \phi_\sigma)^+,$ where $M$ is the total number of possible links and $\phi_\sigma = \frac{1}{N}\sum_i \phi_i.$  The gradient of this whole thing is easy enough to compute and we can throw the whole thing into our favorite optimizer using the log barrier method to ensure that $\pi$ doesn’t run away.

So what happens you might wonder?  Well, for the data I have it becomes unstable and the probability of linkage when two nodes are at all similar approaches 1 and when they are not it approaches 0.  Thus, in practice the log probability becomes a large number minus a large number.  The intuition behind why this happens is a little like the intuition behind separation in regression.  If, on the balance, any covariate is predictive of the output, then it’s coefficient might as well get shoved to (negative) infinity.   Another perspective is by looking at the objective in terms of one covariate (one that governs the current outcome of interest); then $\mathscr{L} = \log \frac{s(o^*)}{A + B s(o^*)},$ where we have put in the denominator $B$ other outcomes which have the same response function as the outcome we care about.  No matter what we think about those other outcomes, the objective can be optimized by setting the score for all of these outcomes as high as possible.

Meandering theory aside, I wondered what would happen if I tried this to do actual prediction.   The table below gives the results (the first two are from a previous table, the last one is the method I’m describing in this post).
 Method Predictive Log Link Likelihood Maximum likelihood estimate -11.6019 Arbitrarily set to 4,-3 -11.4354 Proposed method -11.3065

Wow, epic win using this method. But before we get too excited, we should remember that setting the parameters this way has a discriminative flavor. In particular, the word likelihood, that other thing we care about, actually goes down. In contrast to the “proper” generative approach, here you need to set the parameters differently for each task you want to participate in.

1 Comment

Filed under Uncategorized

## Even more predictive models

In the last post, I presented a comparison of different ways of doing prediction. A natural follow-up question is whether or not there are even better functions?  I observed in the last post that a straight line performs better than the logistic function which has a downward hump.  A natural set of functions to explore then has an upwards hump.   First let’s plot the two probability functions from the previous post along with several new ones (note the color scheme is different):

It’s worth mentioning that I’ve parameterized them in such a way so as to ensure that the endpoints all line up.  Let’s describe them and see how well they do:

 Function Label Predictive Likelihood $b - \exp(-c x + d)$ $\psi_b, b=1$ -11.4085 … $\psi_b, b=0.8$ -11.402 … $\psi_b, b=0.75$ -11.4003 $\pi_0 + (\pi_1 - \pi_0)\delta_{>r}(x)$ threshold -11.7929 $a x^2 -2 a x + c$ $\psi_q$ -11.3968 $u \exp(- \frac{(x - 1)^2}{v^2})$ $\psi_g$ -11.4704 $u + w * \exp(- \log(x)^2 /2)$ $\psi_l$ -11.3747

As it happens, we actually do better by making the hump protrude upwards (i.e., making the function concave.). $\psi_b, \psi_q$ and $\psi_l$ both perform better than the previously proposed functions; $\psi_q$ is easy since it has a simple quadratic form. In other words, it could be a second approximation of something, but of what?  $\psi_l$ does a little bit better than that. Maybe we’re looking a posterior normality of log likelihoods?  If I knew the answer to this question, I think I’d be able to get a grip on a function which is actually justifiably better…

Filed under Uncategorized

## Deriving and evaluating the second order approximation for links

In the previous post I argued that the second order approximation is useful for prediction.  Let’s apply that to a model with links and see what happens. The random variable over which we take the expectation is now $\bar{z}_1 \circ \bar{z}_2$ and the second order term is then $\frac{1}{2}\mathrm{Cov}(\bar{z}_1 \circ \bar{z}_2) H(\mathbb{E}[\bar{z}_1 \circ \bar{z}_2]),$ where $H$ is the Hessian matrix.   We address each of these terms in turn.

Let $x = \eta^T (\bar{z}_1 \circ \bar{z}_2) + \nu.$  Then $f = \sigma(x)$ and $\partial_{\bar{z}_{1k} \bar{z}_{2k}} f(x) = \eta_k \sigma'(x).$  Consequently, the entries of the Hessian are $H_{ij} = \eta_i \eta_j \sigma''(x).$

Now to compute the covariance.  Observe that $\mathbb{E}[\bar{z}_{1k} \bar{z}_{2k} \bar{z}_{1j} \bar{z}_{2j}] = \mathbb{E}[\bar{z}_{1k} \bar{z}_{1j}] \mathbb{E}[\bar{z}_{2k} \bar{z}_{2j}]$ by independence.  This can be rewritten as $C^1_{kj} C^2_{kj} + C^1_{kj} \mu^2_{kj} + C^2_{kj} \mu^1_{kj},$ where $C^i_{kj} = \mathrm{Cov}(\bar{z}_{ik}, \bar{z}_{ij})$ and $\mu^i_{kj} = \mathbb{E}[\bar{z}_{ik}]\mathbb{E}[\bar{z}_{ij}].$

I implemented this new approximation and here are the results.

 Inference method Parameter estimation Prediction method Log likelihood (Higher is better) $\psi_\sigma$ Regularized optimization $\psi_\sigma$ first order -11.755 $\psi_e$ Regularized optimization $\psi_e$ -11.6019 $\psi_\sigma$ Fixed at 4,-3 $\psi_\sigma$ first order -11.5183 $\psi_e$ Fixed at 4,-3 $\psi_e$ -11.4354 $\psi_\sigma$ Fixed at 4,-3 $\psi_\sigma$ second order -11.5192 $\psi_\sigma$ Fixed at 4,-3 $\psi_e$ -11.4352

Suckitude.  The only thing that really matters is

• Fix parameters.
• Use $\psi_e$ predictive model.

The rest doesn’t matter.  Why?  It might help to look at the plots of the link probability functions.  These are parameterized by $\eta$ and $\nu.$  We use a simplified model here where $\eta$ is assumed to be the same across all components.  In that case, the response is entirely a function of scalar, $\vec{1}^T (\bar{\phi}_1 \bar{\phi}_2).$  I plot the values of the response against different values of this scalar.

Comparison of different link probability functions

Comparison of different link prediction functions (log)

The endpoints are the same but there’s a dip in the middle for $\psi_\sigma.$  I suppose that although this function seems more “correct,” and is better able to lower the joint likelihood, the dip means that it is less aggressive in proposing positive links and is therefore worse at precision/recall type measures.

So why not directly optimize the objective we care about?

Filed under Uncategorized

## Further approximation improvements

I was a little bit sneaky in my previous post and the reason I say this will become apparent when I change the function being approximated.  Let’s break it down like this: there are three things we want to compare $\log \mathbb{E}[f(x)] (1),$ $\log f(\mathbb{E}[x]) (2),$ and $\mathbb{E}[\log f(x)] (3).$ And just to add a little more notation, let $f_\pi(\bar{z}) = \pi_1 \bar{z} + \pi_0 (1 - \bar{z})$ and $f_\sigma(\bar{z}) = \sigma(\eta \bar{z} + nu).$

In the previous post, I focused on $f_\pi$ and I was comparing (3) against both (1) and (2) by virtue of (1) and (2) being equal for this choice of f.  I lamented that the approximation wasn’t so good.  We can try to ameliorate the situation by adding second order terms.  Let’s focus on going from (3) to (2).  (2) is generally the easiest to compute as you only have to evaluate f at one point.  (3) often occurs in the objective function of variational approximations.  Now the second order term here is $\frac{1}{2}\mathrm{Var}(x)(\frac{f''(\mathbb{E}[x])}{f(\mathbb{E}[x])} - (\frac{f'(\mathbb{E}[x])}{f(\mathbb{E}[x])})^2).$   (Note that this gives you the result I posted a fe posts back regarding the 2nd order approximation of $\latex f_\pi$) Adding this term in leads to the following results (same methodology as before):

2nd order approximation of $f_\pi$

2nd order approximation of $f_\sigma$

Note that here we do much better than before.  Except for a few errant cases we’re actually down almost an order of magnitude.  We should be happy that we’ve got the problem licked.  Ok, on to the next problem.  Suppose we want to compare (1) and (2).  Note that these are identical for $f_\pi$ by linearity.  But for $f_\sigma$ they are not.  Again, (1) is easy to compute but (2) is usually what we want for a predictive probability.    The second order term here is simpler since we can disregard the log (as they appear on the outer most layer of both (1) and (2)): $\frac{1}{2}\mathrm{Var}(x)f''(\mathbb{E}[x]).$   Let’s run the same experiments again, except this time comparing (1) and (2) for $f_\sigma$ only.

Error using a 1st order predictive approximation

Error using a 2nd order predictive approximation

With a first order approximation, we get some pretty significant errors.  Using the second order approximation greatly reduces those errors (by an order of magnitude).  It seems like the 2nd order approximation is a good idea for this application.  Now the question is, will more accuracy here lead to a better performing model on real data?

Filed under Uncategorized

In the previous post, I posed two questions.   I’ll answer the second first.

This question considers what would happen if the response function (any response function) were to depend only on a single latent variable.  To use the notation of the previous post, I’d write $p(y = 1 | \vec{z}, \lambda) = \phi(z_\lambda).$  Here I will be a little more general and allow $z$ to be multinomial rather than binomial.  The model presented in the previous post can then just be written as $p(y = 1 | \vec{z}, \lambda) = \pi^T z_\lambda,$ where with a slight abuse of notation I’m letting $z_\lambda$ denote an indicator vector.

So it turns out that we can just reduce any choice of $\phi$ to this construction by rewriting $p(y = 1 | \vec{z}, \lambda) = \sum_k \phi(k) \delta_k(z_\lambda) = \vec{\phi}^T z_\lambda.$  For example, in the model presented several posts earlier, $\phi(z) = \sigma(\eta^T z + \nu),$ which means that we can represent this in the more general parameterization as $\pi=\sigma(\eta + \nu).$  Thus the answer to the question is a non-starter; the previous post already answered it.

The other question asks what would happen if we took the previous post’s model, but made the response a function of $\bar{z}$ instead of $z_\lambda.$  In other words, $p(y = 1 | \vec{z}) = \pi^T \bar{z}.$   This looks pretty similar (in expectation) to the previous model except that $\bar{z}$ is part of the model now rather than just a result of an expectation.  So, if we want to compute $\mathbb{E}_z[p(y = 1| \vec{z})],$ it works out to be the same.  But if we want to compute $\mathbb{E}_z[\log p(y = 1 | \vec{z})],$ it’s different.  In terms of the original formulation, we are moving the logarithm across the expectation with respect to $z,$ but NOT across the expectation with respect to $\lambda.$

Moving it as such, it does not have a simple closed form solution, so I can’t provide an analytic solution to the error incurred by moving the log inside the expectation.  However, I can empirically estimate this by sampling from $\bar{z}$ where each $z_i$ is drawn according to a binomial distribution parameterized by $\theta.$  I then compute $p(y = 1)$ using this $\bar{z}$ and take the mean over all samples (in this case, 100 samples were used for every point).  I compare this value to the value obtained by computing $p(y=1)$ at the mean over these same samples.   This error is what I plot in the attached figure as a function of $\theta.$

I produce three series for different values of $N,$ the number of draws from the binomial distribution used to create $\bar{z}.$   I chose values of $\pi_1 = \sigma(4), \pi_0 = \sigma(-3).$  This corresponds to the red line in the previous post.  One thing that is evident is that the error is much smaller.  You can think of the previous post’s curve as the $N=1$ case.  As $N$ increases, the covariance on $\bar{z}$ decreases and the error drops.  This is rather comforting.  However, for reasonable values of $N,$ the errors are still rather large: 0.10 may not seem like a lot in log likelihood, but that’s a lot larger that the difference between most techniques!