Monthly Archives: February 2009

February 25, 2009 · 4:18 am

Labeling topics

We all love topic models, and one of the things we like about them is that we get these lovely, interesting, semantically-coherent topics. But one thing limiting the use of these topics is that we have no easy appellations for them. Wouldn’t it be great if we could say that this document is 20% “astronomy” rather than 20% the topic whose top words are “star planet quasar galaxy”?

To this end, I ran another Mechanical Turk experiment. I presented the top 6 words from a topic (in a shuffled order to different turkers for good measure). The turkers were asked whether or not the set of words makes sense together, and if they did to come up with a concise subject heading for them.

The topics I used were some of the topics learned by Nubbi on Wikipedia. I’ve put the responses here. The “Good topic %” is the proportion of people who judged that the set of words makes sense together. For some of the topics nearly everyone agrees that they’re good topics; for others almost everyone agrees that they’re terrible. More worrisome is that there are a few in the middle.

As for the labels, there was considerable variation. Some of it is accounted for by different levels of specificity (“Monarchy” vs. “European history” vs. “English history” vs. “Mary and Bothwell”). But while the labels are in the same milieu, there doesn’t seem to be any universal agreement on what makes a good topic label.

So the take away is:

Not as many topics are “good” as I would’ve thought.
Coming up with good labels for topics is no easy task.

Let’s chew on that next time we stick a corpus through the LDA wringer.

Learning translations using large-scale Ising models

In a previous post, I opined about how Ising models could be applied to learning relationships between musical artists based solely on their co-occurrence patterns. So I got to thinking: what if the “artists” were actually words in two languages. By looking at the connections between words in language 1 and language 2 we can automatically learn translations for words between the two languages.

To do this, we need a corpus where sentences between the two languages are aligned, so that we can get good co-occurrence statistics. Luckily, Europarl is a great resource for this sort of thing. From this point on, we’ll be talking about the French-English Europarl.

Next we need to make a decision about what sort of co-occurrence statistics to include. There are two roads here. In one case, we include all the co-occurrences of the concatenated English-French sentence. So “delegation” might co-occur with both “foreign” (because they tend to occur together in English sentences) and “délégation” (because it occurs in the corresponding French sentences). In the second case, we keep only co-occurrences between Language 1 and Language 2, and none of the co-occurrences between Language 1 and Language 1 or Language 2 and Language 2. In the example above, this would mean that we would NOT consider co-occurrences of “delegation” and “foreign”.

So when I was deciding which of these to do, I thought that route 1 might be better. With more information, the model might be more capable of teasing out the fact that “delegation” does not translate to “étrangère” — in fact “delegation” co-occurs with “foreign” and so we can explain away the co-occurrences with “étrangère.”

How did it do? Here are dictionaries for route 1 and route 2. (Note that the sets of translated words are a little bit different because I automatically prune the entries we’re less sure about). These dictionaries show the top French translation for each English word (i.e., the French word with the highest correlation parameter in the Ising model). There’s lots of interesting stuff in the 2nd, 3rd, etc. best translations too (for example, while the top translation for “reply” is “réponse,” the next one on the list is “repondre.”) for economy of space I’m only putting up the top one.

I’ve also pruned a lot of the stop words automatically. Note that those get translated very poorly as do extremely polysemous words. So you’re not liable to get very good translations for words like “that.” Multi-word idioms also do poorly (“once” -> “fois” rather than “une fois”, “course” -> “sûr” rather than “of course” -> “bien sûr”). Words which have no translation such as modal verbs in English which are often captured by inflection in French (“will” -> “sera” instead of “will be” -> “sera”, “would” -> “voudrais” instead of “would like” -> “voudrais” [ which incidentally tells you a lot about how members of the European parliament use the word “would” ]) are also problemmatic.

But for specific and somewhat rare words, it does amazingly well (“talk” -> “parler”, “rights” -> “droits”, “stress” -> “souligner”). I don’t have exact numbers (perhaps I can make AMT do this for me), but it seems to do pretty well and I think ironically that route 2 does better. I’m not entirely sure why but I think it’s because without the intra-language co-occurrences the model can focus on connections that explain inter-language co-occurrences.

In summary, the approach is fast, easy, and works pretty well (considering that the only nlp tool I used was a simple tokenizer; no morphological analysis or stemming was done!).

1 Comment

Filed under Uncategorized

February 22, 2009 · 10:36 pm

What’s in a name?

I recently browsed my way over here and found it eminently amusing.

New meme: here’s a totally random way to make your new random band’s new random album cover. Post one! Go to “Wikipedia.” Hit “random” and the first article you get is the name of your band. Then go to “Random Quotations” and the last four or five words of the very last quote of the page is the title of your first album. Then, go to Flickr and click on “Explore the Last Seven Days” and the third picture, no matter what it is, will be your album cover.

I tried it a few times and found that it worked amazingly well. But what was more fascinating to me were the kind of backstories people would make based solely on a series of randomly chosen phrases. In particular, people seemed to be assigning genres to bands whose names were arbitrarily scraped from wikipedia. Is there some latent logic behind these assignations? Would people generally agree on genres for these made up bands?

In the old days, I would furrow my brow and call it a day. Nowadays, we have Mechanical Turk. Here’s the experiment: I randomly chose article titles from wikipedia (I did a little automatic processing to get rid of excessively long article titles, disambiguation pages, etc.). I offered this random article title as a band name to three turkers and asked them to guess the genre. In sum, I had genre guesses for 200 bands, each one done in triplicate.

The results are here (start registering those myspace pages!). It really looks like people had a lot of fun with this assignment (some more than others). And while there are plenty of differences, there are similarities which would suggest that there’s some sort of common cultural knowledge that goes into choosing a band name. For example, people agree that “Cut the knot” should be a metal band (in actuality it’s a magazine devoted to mathematics). A couple of people thought that Larry Ellison (the CEO of Oracle) should be a jazz/lounge act, while two people thought Brook Lundy (the creator of an e-card humor website) should be a country singer. They’re just names for god’s sake!

Now the thought rolling around my brain is whether or not we can teach machines what people already seem to know about what’s in a name…

1 Comment

Filed under Uncategorized

February 20, 2009 · 7:49 pm

Name check

00:00:12 Jonathan Chang: http://googlesystem.blogspot.com/2009/02/download-books-from-google-book-search.html
00:01:44 Jordan Boyd-Graber: interesting
00:01:55 Jonathan Chang: notice the book?
00:02:43 Jordan Boyd-Graber: oh, no 🙂
00:02:56 Jordan Boyd-Graber: the text doesn’t make much sense, though, given the title
00:03:21 Jonathan Chang: it must be a snippet from some corpus
00:04:23 Jordan Boyd-Graber: yes, so it is
00:04:32 Jordan Boyd-Graber: thank you Google book search
00:09:27 Jonathan Chang: that’s really meta
00:09:40 Jonathan Chang: an image of a book quoting some other book
00:11:59 Jordan Boyd-Graber: You should write a blog entry about it with a screenshot of that blog

Done.

Applying the Ising model to another data set

Audioscrobbler is this really cool data set from a few years ago; back then, Audioscrobbler had not yet been rolled into the last.fm but it had about the same functionality as it does now. Basically, it’s a little plugin for iTunes et al. that lets someone keep track of all the artists you listen to. The listening habits of several thousand people were collected and distributed under a creative commons license.

After some normalization/cleanup, we end up with a set of artists each user is liable to listen to.
This is the sort of co-occurrence statistic which Ising models are good at capturing. The Ising model contains a matrix of parameters which indicate the correlations between artists — that is, the relative likelihood that a given user will end up listening to both artists.

Because this is a rather high-dimensional problem, we can employ some L1 + L2 penalization; what we end up learning is a relatively sparse parameter matrix that is often easier to interpret.
With some magic (cough cough) we can learn this parameter matrix fairly quickly. I thought I’d post some of the correlations between artists here for your {be/a}musement.

Now the actual parameter matrix consists of several thousand artists. Here, I’m selecting the 10 artists with the highest total correlations. You might say that these are the artists which tug most fiercely on other artists (the most cliquey artists if you want). For each of these 10 artists, I show the 5 most highly correlated artists.

The results make pretty good sense; it’s actually kind of disturbing how predictable people’s musical tastes are. And for some reason the main cliques at the top of the list are all either metal bands or the sort of indie bands likely to populate OC soundtracks =). I should point out that if you go further down the list you eventually find a few other cliques such as trip hop (Portishead, Massive Attack, Lamb, Tricky, et al. [note to self: how cool would “et al.” be as a band name?]), 80s rock with remarkable staying power (Aerosmith, Bon Jovi, Guns N’ Roses), wuss rock (Counting Crows, DMB, Goo Goo Dolls), and just plain bad music (3DD, Hoobastank, Staind, Nickleback).

Artist…	…is correlated with
Metallica	Iron Maiden	Megadeth	Pantera	Slayer	Nightwish
In Flames	Dark Tranquillity	Soilwork	Children of Bodom	Arch Enemy	Dimmu Borgir
The Arcade Fire	The Fiery Furnaces	Broken Social Scene	The Go! Team	Bloc Party	Stars
Nightwish	Within Temptation	Sonata Arctica	Blind Guardian	Stratovarius	Therion
Rammstein	Nightwish	Apocalyptica	KoЯn	Marilyn Manson	Metallica
Belle and Sebastian	The Magnetic Fields	Neutral Milk Hotel	Yo La Tengo	Elliott Smith	Camera Obscura
Iron Maiden	Judas Priest	Iced Earth	Helloween	Manowar	Bruce Dickinson
Elliott Smith	Iron & Wine	The Decemberists	Bright Eyes	Sufjan Stevens	Belle and Sebastian
Bright Eyes	Rilo Kiley	Death Cab for Cutie	Desaparecidos	Cursive	The Good Life
Death Cab for Cutie	The Postal Service	Bright Eyes	The Shins	Rilo Kiley	Cursive