Data for some topic model tasseography

Thanks to all of you who’ve expressed interest in and support for our recent paper Reading Tea Leaves: How Humans Interpret Topic Models, which was co-authored with Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David Blei. Many people (myself included) either implicitly or explicitly assume that topic models can find meaningful latent spaces with semantically coherent topics. The goal of this paper was to put this assumption to the test by gathering lots of human responses to some tasks we devised. We got some surprising and interesting results — held-out likelihood is often not a good proxy interpretability. You’ll have to read the paper for the details, but I’ll just leave you with a teaser plot below.

Furthermore, Jordan has worked hard prepping some of our data for public release. You can find that stuff here.



Filed under Uncategorized

2 responses to “Data for some topic model tasseography

  1. Laurens van der Maaten

    I guess you could also try to compute word similarities based on the topics, and compare those to human word associations (using the USF free association data; ) to get an idea of the quality of the topics…

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s