I’ve been doing some work on the topic, along with Lars, Cameron, and Itamar. Read more at http://www.facebook.com/note.php?note_id=205925658858
For those of you in Vancouver right now, here’s a shameless plug for our NIPS talk on interpreting topic models, which is happening at 4:10. Hope to see you there. And to whet your apetite, here’s a picture:
To find out what it means, come to the talk! And don’t forget the workshop on Friday =).
Thanks to all of you who’ve expressed interest in and support for our recent paper Reading Tea Leaves: How Humans Interpret Topic Models, which was co-authored with Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David Blei. Many people (myself included) either implicitly or explicitly assume that topic models can find meaningful latent spaces with semantically coherent topics. The goal of this paper was to put this assumption to the test by gathering lots of human responses to some tasks we devised. We got some surprising and interesting results — held-out likelihood is often not a good proxy interpretability. You’ll have to read the paper for the details, but I’ll just leave you with a teaser plot below.
Furthermore, Jordan has worked hard prepping some of our data for public release. You can find that stuff here.
Your favorite package for running topic models in R has been updated! This one not only has bugfixes and more utility functions, it also has two new models:
- The Networks Uncovered by Bayesian Inference (NUBBI) model which discovers connections between entities in free text (run
demo(nubbi), note that because of licensing reasons, I could not include the data for this demo in the package);
- the Relational Topic Model (RTM) for discovering patterns which account for both document content and connections between documents (run
And because it’s on CRAN, everyone (including windows users) can install by simply executing
install.packages("lda"). Please install, play with it, and let me know if you find any bugs.
Dave and I were recently talking about Asuncion et al.’s wonderful recent paper “On Smoothing and Inference for Topic Models.” One thing that caught our eye was the CVB0 inference method for topic models, which is described as a first-order approximation of the collapsed variational Bayes approach. The odd thing is that this first-order approximation performs better than other, more “principled” approaches. I want to try to understand why. Here’s my current less-than-satisfactory stab:
Let me just lay out the problem. Suppose I want to approximate the marginal posterior over topic assignments in a topic model given the observed words , We can expand this probability using an expectation,
We can’t compute the expectation analytically, so we must turn to an approximate inference technique. One technique is to run a Gibbs sampler whence we get samples from the joint posterior over topic assignments Then using these samples we approximate the expectation,
In the case of LDA, this conditional probability is proportional to
- is the Dirichlet hyperparameter for topic proportion vectors;
- is the Dirichlet hyperparameter for the topic multinomials;
- is the number of times topic has been assigned to word ;
- is the number of times topic has been assigned overall;
- is the number of times topic has been assigned in document
Note that the above counts do not include the current value of (hence the superscript).
Instead of the Gibbs sampling approach, we could also approximate the expectation by taking a first order approximation (which we denote ),
where the expectations are taken with respect to Because the terms in the expectations are simple sums, they can be computed solely as functions of . For example,
Thus, the solution to this approximation is exactly the CVB0 technique described in the paper. Note that I never directly introduced the concept of a variational distribution! CVB0 is simply a first-order approximation to the true expectations; in contrast, the second-order CVB approximation is an approximation of the variational expectations. So maybe that’s the answer to the puzzle: sometimes a first-order approximation to the true value is better than a second-order approximation to a surrogate objective.
Does anyone have any other explanations?
Dave gently reminded me that properly assessing convergence of our models is important and that just running a sampler for N iterations is unsatisfactory. I agree wholeheartedly. As a first step, the collapsed Gibbs sampler in the R LDA package can now optionally report the log likelihood (to within a constant). For example, we can rerun the model fit in
demo(lda) but with an extra flag set:
result <- lda.collapsed.gibbs.sampler(cora.documents, K, ## Num clusters cora.vocab, 25, ## Num iterations 0.1, 0.1, compute.log.likelihood=TRUE)
Using the now-available variable
result$log.likelihoods, we can plot the progress of the sampler versus iteration:
Grab it while it’s hot: http://www.cs.princeton.edu/~jcone/lda_1.0.1.tar.gz.
The other day I paid a nice visit to Alex and Yan. We got around to talking about how bit.ly (a link shortener) can be used to track things on twitter. Anyhow, I’m sure they will blow you away with their analysis soon enough, but I thought I’d post some results from a really simple analysis.
The cool thing about bit.ly is that there’s an API that allows us to find out how many clickthrus there were on each link. This makes basic website analytics available to everyone and gives us the ability to start looking at what drives traffic. So we can try to figure out what motivates people to click on links posted on twitter: content, network, or something else?
Here’s what I did: I took the the last 3200 tweets by theonion and extracted all the bit.ly links therein (there were about 1200). I then got the number of clicks for each of the links as well as relevant metadata through the bit.ly API. There’s a tiny bit of noise there, but here’s what it looks like when I plot the number of clicks (as measured by bit.ly) versus the date when the link was tweeted:
You can see how phenomenal theonion’s twitter account has taken off in the last year, eventually reaching this weird cyclical pattern, a valley of which we currently seem to be in. (I don’t really have a good explanation for why that pattern is occurring.) But what’s also phenomenal is how closely clicks tend to track with the mean. That is, there isn’t a whole lot of variance at any given time. I’d guess that there is a set of regular readers who click on pretty much everything that theonion posts. And while there is an ebb and tide of regular readers, it’s not like within some time slice there are a few articles which really take off (“go viral”) and a bunch which languish. This is totally strange to me; my intuition based on diggs is that there’d be a polya-urn rich-get-richer type of distribution for link clickthrus but there doesn’t appear to be.
This is also strange to me because followers of this account are basically treating it like an RSS feed of onion articles, which makes me wonder: why are they using twitter at all?
I broke down the data a few other ways to see if I could tease out other trends. I tried breaking it down by time of day. And as expected posting stops at night and beings to pick up again at noon GMT = 8 am Eastern. But there isn’t a huge amount of variation based on when the urls get tweeted: once it gets into people’s queue it seems that they’ll get around to it eventually.
Finally, I tried breaking it down by day of the week. Not much news to report here. There are fewer tweets on Saturday and Sunday (although it sort of picked up on those days during July). And there isn’t any significant difference in terms of number of clickthrus per link on any given day of the week.
So there you have it. theonion has basically co-opted twitter as a news feed. And its readers faithfully read (or at least click on) the posted bit.ly links and any content or network effects seem to average out in the end.
Major thanks to Eytan for introducing me to the bit.ly API and lots of pro-tips on navigating/understanding the twitterverse.