Category Archives: Uncategorized

February 18, 2009 · 10:53 pm

Applying the Ising model to another data set

Audioscrobbler is this really cool data set from a few years ago; back then, Audioscrobbler had not yet been rolled into the last.fm but it had about the same functionality as it does now. Basically, it’s a little plugin for iTunes et al. that lets someone keep track of all the artists you listen to. The listening habits of several thousand people were collected and distributed under a creative commons license.

After some normalization/cleanup, we end up with a set of artists each user is liable to listen to.
This is the sort of co-occurrence statistic which Ising models are good at capturing. The Ising model contains a matrix of parameters which indicate the correlations between artists — that is, the relative likelihood that a given user will end up listening to both artists.

Because this is a rather high-dimensional problem, we can employ some L1 + L2 penalization; what we end up learning is a relatively sparse parameter matrix that is often easier to interpret.
With some magic (cough cough) we can learn this parameter matrix fairly quickly. I thought I’d post some of the correlations between artists here for your {be/a}musement.

Now the actual parameter matrix consists of several thousand artists. Here, I’m selecting the 10 artists with the highest total correlations. You might say that these are the artists which tug most fiercely on other artists (the most cliquey artists if you want). For each of these 10 artists, I show the 5 most highly correlated artists.

The results make pretty good sense; it’s actually kind of disturbing how predictable people’s musical tastes are. And for some reason the main cliques at the top of the list are all either metal bands or the sort of indie bands likely to populate OC soundtracks =). I should point out that if you go further down the list you eventually find a few other cliques such as trip hop (Portishead, Massive Attack, Lamb, Tricky, et al. [note to self: how cool would “et al.” be as a band name?]), 80s rock with remarkable staying power (Aerosmith, Bon Jovi, Guns N’ Roses), wuss rock (Counting Crows, DMB, Goo Goo Dolls), and just plain bad music (3DD, Hoobastank, Staind, Nickleback).

Artist…	…is correlated with
Metallica	Iron Maiden	Megadeth	Pantera	Slayer	Nightwish
In Flames	Dark Tranquillity	Soilwork	Children of Bodom	Arch Enemy	Dimmu Borgir
The Arcade Fire	The Fiery Furnaces	Broken Social Scene	The Go! Team	Bloc Party	Stars
Nightwish	Within Temptation	Sonata Arctica	Blind Guardian	Stratovarius	Therion
Rammstein	Nightwish	Apocalyptica	KoЯn	Marilyn Manson	Metallica
Belle and Sebastian	The Magnetic Fields	Neutral Milk Hotel	Yo La Tengo	Elliott Smith	Camera Obscura
Iron Maiden	Judas Priest	Iced Earth	Helloween	Manowar	Bruce Dickinson
Elliott Smith	Iron & Wine	The Decemberists	Bright Eyes	Sufjan Stevens	Belle and Sebastian
Bright Eyes	Rilo Kiley	Death Cab for Cutie	Desaparecidos	Cursive	The Good Life
Death Cab for Cutie	The Postal Service	Bright Eyes	The Shins	Rilo Kiley	Cursive

2 Comments

Filed under Uncategorized

January 18, 2009 · 7:33 am

Generating holdout sets for network data

What would we do without the train/test paradigm? It gives us a convenient way of saying “Aha, I’m better than X et al. by 5%!” But rarely do we actually get data which has been partitioned into train/test by a higher power; instead we get a bunch of labeled data which we have to split in twain (or variations thereof if we wanted to be more sophisticated and do x-validation, bootstrapping, etc.).

Now this is usually not too big an issue. We just uniformly randomly select elements from the labeled data to construct our training set. This gives us training and test sets with approximately the same distributional characteristics and all is well with the world.

This becomes considerably more challenging (or interesting, depending on your point of view) when we have network data. The whole point of modeling data along with the relationships between them is to remove the typical i.i.d. assumptions. This means that drawing i.i.d. samples from the set of documents to construct the training/test sets disregards the very thing you are trying to model (and which you therefore believe to be important).

How you actually want to construct your train/test split will probably depend in part on your application and the constraints you have on how data is obtained. But we’d like to believe our algorithms do (relatively) as well no matter how the train/test split is done. Because of that (and limited resources) we tend to evaluate our models only on one such splitting scheme. In this post, I want to see if this actually matters.

More precisely, I want to consider what happens when you want to predict links. Because of sampling or what have you, your model is only able to see a subset of the links in the network (the train set). The model’s goal is to infer from these seen links the remaining hidden links (the test set).

Here’s my experimental set up. I’m going to imagine that I’m writing a paper and I want to compare two models. One is a naive model (NM) and the other is a really naive model (RNM). The specifics of these models aren’t too important (and looking at the results there’s a good chance you’ll be able to guess what they are). And I’m going to try doing the evaluation with two different train/test splitting methods. The first uses a “Bernoulli” hold out method. That is, each edge is allocated independently to the test set with probability $p$ . The edges not allocated to the test set are allocated to the train set. The second hold-out method is “survey.” In this method, I go to each node and randomly select $K$ of the edges which have an endpoint at that node (or all of them if it has less than $K$ neighbors).

So now we have two (plausible) ways of constructing train/test splits but which look potentially very different. I’m going to “evaluate” my two models by computing the mean log likelihood of the held-out links for 10 chosen graphs using the aforementioned sampling schemes. For these experiments, I chose $K = 3$ and $p = .86$ so that the number of edges in the train/test split is approximately the same between the two methods.

Results are presented below (standard deviations are in red).

Predictive power of two models on two holdout methods

RNM performs the same no matter how you select the hold-out graphs (and the standard deviations are zero, which should give you a pretty good idea of what RNM is). NM as we’d expect performs (significantly) better no matter what sampling method is used. However, when survey sampling is used, NM performs considerably better than NM on Bernoulli. In fact, the difference between NM and RNM on Bernoulli is paltry compared to the difference on survey.

What this means is that the value proposition between RNM and NM is very different depending on how your data is being collected. And the conclusions we draw about models on one sampling method might not be easily applied to another.

Optimizing for precision/recall

When training generative models, we usually optimize parameters to optimize joint likelihood. However, this is in no way (or is it? let me know if you know better) a guarantee that you’ll do better on many real-world benchmarks such as precision/recall. So I wondered, what would happen if you optimized for these quantities?

Let’s take a step back. In a lot of applications (especially information retrieval), what you want is some mechanism (such as a predictive model) of generating scores associated with each possible outcome. And then you measure the loss based on the score of the outcome you wanted versus the score on all other outcomes. For example, let $o_{i}$ denote the $i$ th outcome and $s(o_i)$ the score of the $i$ th outcome. Then our goal is to optimize $\mathscr{L} = \log s(o^*) - \log \sum_i s(o_i),$ where $o^*$ is the “good” outcome. For example, if we were searching for a needle in a haystack, $o^*$ would be the needle and we want to ensure that the score associated with the needle is much higher than the score associated with the hay.

In the context of the models I’ve been discussing, the “score” is just the probability of a link. The objective then takes the form (over all links of interest):

$(\sum_{ij \in \Lambda} \log p(y_{ij})) - |\Lambda| \log \sum_{ij} p(y_{ij}),$ where $\Lambda$ is the set of links of interest. When we use $\psi_e$ as the link probability function, we can evaluate this quantity analytically: $\sum_{ij \in \Lambda} \log(\pi^T (\phi_i \circ \phi_j)^+) - |\Lambda| \log \pi^T \sum_{ij} (\phi_i \circ \phi_j)^+.$ Here, $x^+ = \langle x, 1 - \sum x \rangle.$ The last sum is could be computationally difficult, except that we can rewrite it as $M (\phi_\sigma \circ \phi_\sigma)^+,$ where $M$ is the total number of possible links and $\phi_\sigma = \frac{1}{N}\sum_i \phi_i.$ The gradient of this whole thing is easy enough to compute and we can throw the whole thing into our favorite optimizer using the log barrier method to ensure that $\pi$ doesn’t run away.

So what happens you might wonder? Well, for the data I have it becomes unstable and the probability of linkage when two nodes are at all similar approaches 1 and when they are not it approaches 0. Thus, in practice the log probability becomes a large number minus a large number. The intuition behind why this happens is a little like the intuition behind separation in regression. If, on the balance, any covariate is predictive of the output, then it’s coefficient might as well get shoved to (negative) infinity. Another perspective is by looking at the objective in terms of one covariate (one that governs the current outcome of interest); then $\mathscr{L} = \log \frac{s(o^*)}{A + B s(o^*)},$ where we have put in the denominator $B$ other outcomes which have the same response function as the outcome we care about. No matter what we think about those other outcomes, the objective can be optimized by setting the score for all of these outcomes as high as possible.

Meandering theory aside, I wondered what would happen if I tried this to do actual prediction. The table below gives the results (the first two are from a previous table, the last one is the method I’m describing in this post).

Method	Predictive Log Link Likelihood
Maximum likelihood estimate	-11.6019
Arbitrarily set to 4,-3	-11.4354
Proposed method	-11.3065

Wow, epic win using this method. But before we get too excited, we should remember that setting the parameters this way has a discriminative flavor. In particular, the word likelihood, that other thing we care about, actually goes down. In contrast to the “proper” generative approach, here you need to set the parameters differently for each task you want to participate in.

1 Comment

Filed under Uncategorized

December 6, 2008 · 10:08 am

Even more predictive models

In the last post, I presented a comparison of different ways of doing prediction. A natural follow-up question is whether or not there are even better functions? I observed in the last post that a straight line performs better than the logistic function which has a downward hump. A natural set of functions to explore then has an upwards hump. First let’s plot the two probability functions from the previous post along with several new ones (note the color scheme is different):

Even more link probability functions

It’s worth mentioning that I’ve parameterized them in such a way so as to ensure that the endpoints all line up. Let’s describe them and see how well they do:

Function	Label	Predictive Likelihood
$b - \exp(-c x + d)$	$\psi_b, b=1$	-11.4085
…	$\psi_b, b=0.8$	-11.402
…	$\psi_b, b=0.75$	-11.4003
$\pi_0 + (\pi_1 - \pi_0)\delta_{>r}(x)$	threshold	-11.7929
$a x^2 -2 a x + c$	$\psi_q$	-11.3968
$u \exp(- \frac{(x - 1)^2}{v^2})$	$\psi_g$	-11.4704
$u + w * \exp(- \log(x)^2 /2)$	$\psi_l$	-11.3747

As it happens, we actually do better by making the hump protrude upwards (i.e., making the function concave.). $\psi_b, \psi_q$ and $\psi_l$ both perform better than the previously proposed functions; $\psi_q$ is easy since it has a simple quadratic form. In other words, it could be a second approximation of something, but of what? $\psi_l$ does a little bit better than that. Maybe we’re looking a posterior normality of log likelihoods? If I knew the answer to this question, I think I’d be able to get a grip on a function which is actually justifiably better…

Deriving and evaluating the second order approximation for links

In the previous post I argued that the second order approximation is useful for prediction. Let’s apply that to a model with links and see what happens. The random variable over which we take the expectation is now $\bar{z}_1 \circ \bar{z}_2$ and the second order term is then $\frac{1}{2}\mathrm{Cov}(\bar{z}_1 \circ \bar{z}_2) H(\mathbb{E}[\bar{z}_1 \circ \bar{z}_2]),$ where $H$ is the Hessian matrix. We address each of these terms in turn.

Let $x = \eta^T (\bar{z}_1 \circ \bar{z}_2) + \nu.$ Then $f = \sigma(x)$ and $\partial_{\bar{z}_{1k} \bar{z}_{2k}} f(x) = \eta_k \sigma'(x).$ Consequently, the entries of the Hessian are $H_{ij} = \eta_i \eta_j \sigma''(x).$

Now to compute the covariance. Observe that $\mathbb{E}[\bar{z}_{1k} \bar{z}_{2k} \bar{z}_{1j} \bar{z}_{2j}] = \mathbb{E}[\bar{z}_{1k} \bar{z}_{1j}] \mathbb{E}[\bar{z}_{2k} \bar{z}_{2j}]$ by independence. This can be rewritten as $C^1_{kj} C^2_{kj} + C^1_{kj} \mu^2_{kj} + C^2_{kj} \mu^1_{kj},$ where $C^i_{kj} = \mathrm{Cov}(\bar{z}_{ik}, \bar{z}_{ij})$ and $\mu^i_{kj} = \mathbb{E}[\bar{z}_{ik}]\mathbb{E}[\bar{z}_{ij}].$

I implemented this new approximation and here are the results.

Inference method	Parameter estimation	Prediction method	Log likelihood (Higher is better)
$\psi_\sigma$	Regularized optimization	$\psi_\sigma$ first order	-11.755
$\psi_e$	Regularized optimization	$\psi_e$	-11.6019
$\psi_\sigma$	Fixed at 4,-3	$\psi_\sigma$ first order	-11.5183
$\psi_e$	Fixed at 4,-3	$\psi_e$	-11.4354
$\psi_\sigma$	Fixed at 4,-3	$\psi_\sigma$ second order	-11.5192
$\psi_\sigma$	Fixed at 4,-3	$\psi_e$	-11.4352

Suckitude. The only thing that really matters is

Fix parameters.
Use $\psi_e$ predictive model.

The rest doesn’t matter. Why? It might help to look at the plots of the link probability functions. These are parameterized by $\eta$ and $\nu.$ We use a simplified model here where $\eta$ is assumed to be the same across all components. In that case, the response is entirely a function of scalar, $\vec{1}^T (\bar{\phi}_1 \bar{\phi}_2).$ I plot the values of the response against different values of this scalar.

Comparison of different link probability functions

Comparison of different link prediction functions (log)

The endpoints are the same but there’s a dip in the middle for $\psi_\sigma.$ I suppose that although this function seems more “correct,” and is better able to lower the joint likelihood, the dip means that it is less aggressive in proposing positive links and is therefore worse at precision/recall type measures.

So why not directly optimize the objective we care about?

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31