LDA for the masses (who use R)

Long time no post. I’ve been busy with lots of stuff: writing my thesis, renaming this blog to pleasescoopme.com, and other stuff which I’ll post soon enough. Another thing I’ve been working on is an R package that implements collapsed Gibbs samplers (written in C) for some of the models I’ve been using: latent Dirichlet allocation (LDA), the mixed-membership stochastic blockmodel (MMSB), and supervised LDA (sLDA). It’s still somewhat experimental but I’ve found it to be immensely useful already. Here some included demos to show off what you can already do out of the box (plots made with the fantastic ggplot2 package):

You too can make all these pretty pictures by downloading the package here. Then simply run ‘R CMD INSTALL lda_1.0.tar.gz’ to install the package and you’ll be ready to go! All of you out there who work with these models, or want to start working with these models, give it a shot and gimme any feedback you have. I hope to improve things and add more models in some upcoming releases.

7 Comments

Filed under Uncategorized

Another simple parameter estimation method for LDA

In an earlier post, I looked at a couple of different ways to estimate the parameters of a latent Dirichlet allocation (LDA) model, namely the topic mixing proportions \theta and the multinomial topic distributions \beta. Continuing along those lines, here I describe yet another way of doing so. This way turns out to be much faster, but at the cost of accuracy.

As I mentioned in the previous post, the objective we try to maximize is the log likelihood of the data, namely \sum_d \sum_{w \in d} \log \theta_d^T\beta_{w}. The trick I want to use here is the concave bound, which lets me rewrite this function as \sum_d \sum_{w \in d} \theta_d^T \log \beta_{w}. Solving this for \beta_{w} turns out to be easy: \beta_{w'} \propto \sum_d \sum_{w \in d} \mathbf{1}(w' = w) \theta_d + \eta, where \eta is a Dirichlet hyperparameter on \beta.

Now this bound isn’t as nice for solving for \theta_d. The solution will always be degenerate at a corner of the simplex. While this is not necessarily bad for modeling likelihood, it does mean that we won’t be able to get a handle on mixed-membership behavior. An alternative is to optimize a slightly different bound: playing the same trick as before but with different variables yields: \sum_d \sum_{w \in d} \frac{\beta_{w}}{|\beta_w|_1} \log \theta^d + \log |\beta_w|_1. This has a solution of the form: \theta_d \propto \sum_{w \in d} \frac{\beta_{w}}{|\beta_w|_1} + \alpha, where \alpha is a Dirichlet hyperpamater on \theta_d.

So in essence what I’m proposing here is alternating between optimizing \beta and \theta, something which has a very EM flavor, except that each of the optimization steps really optimizes a different bound. Notice here that the “E-step” which optimizes \theta is completely non-iterative — it’s just a simple sum-then-normalize — unlike variational inference for LDA which requires iterative estimation of the variational parameters for each E-step. This means that this simple procedure is liable to be much faster. I didn’t do any formal speed tests, but for the results below which were implemented in R, using this procedure to estimate the parameters of the model was noticeably faster than reading in the data! The same is decidedly NOT true for variational EM.

Ok, but how bad is the bound for getting us the true parameters? This bad. Here I’m showing you the barplots of the log KL distance between the estimated values of \theta and the true values of \theta used to generate this synthetic data set. Lower is better. “LDA” uses the mean estimates derived from the lda-c package. “Opti” uses the optimization procedure I described in the previous post (as mentioned there, “Opti” compares favorably with “LDA”). “Fast” uses the method I described here. “Fast” is rather worse than the other methods unfortunately. When looking at the results I noticed that “Fast” was considerably “smoother” than the other methods, i.e., “Fast” produces estimates with higher entropy. As an ad hoc fix, I also computed “Fast^2″ which consists of the same estimates as “Fast”, but squared to yield sharper results. That one actually does better and is within shooting distance of the other methods, but it’s not quite there yet.

Anyways, there are a lot of scenarios where speed matters and in those cases I suspect “Fast” will turn out to be accurate enough and fast enough to be of practical use. “Fast^2″ does better on this case, but I have no real understanding of why that’s so. Maybe it’s a fluke or maybe it’s the solution to a more accurate estimate of the log likelihood. And perhaps understanding why it’s doing better will help derive an even better way of estimating the parameters. Maybe one of you can enlighten me.

Leave a Comment

Filed under Uncategorized

How far apart are you and I?

As a result of a conversation I recently had, I was curious to know how far two people plucked at random from the population of the world/US would be. I used the data from the R maps package and figured it out.

Answer? Farther (or further?) than I would’ve thought. Assuming the readers of this blog are distributed like everyone else, then you and I are probably around 5000 miles apart.

2 Comments

Filed under Uncategorized

I shall be telling this with a sigh somewhere ages and ages hence

Warning: If you have a vested interest in what I do after I graduate (you know who you are; if you’re not sure if this applies to you, then it doesn’t) or you don’t have the stomach for the inherently impolitic nature of job-search aporia, you should probably stop reading.


As some of you know, I’ve been wrestling with some decisions over what I want to do once I graduate (in the near future, knock on wood). And I’ve had a really hard time coming to any sort of definitive conclusion. I know a lot of you also had/have the same decisions to make.

The first order decision is whether to go the route of industry or academia. I’m really interested in having the freedom and resources to work on interesting problems; I’ve fielded arguments from both sides on whether industry or academia is best able to realize this. On the one hand, industry generally has a lot of resources (be that data or computers). And some places are engaging on problems which I think are of real research interest. But in the end you have to sing for your supper, so your eye is typically always towards some product (vague though it may be).

Academia on the other hand is relatively unfettered. It is true, however, that you are subject to the whims of inconstant moons, grants, committees, etc. You’ve got to hustle and sometimes that hustle may entail compromise on the research front. Academics also might have fewer resources on average, but they’re no slouches either. On the balance, academia provides an environment in which the talismans of research, publications, are given priority. Whether these count as progress is up for debate, I suppose, but there’s no denying that there’s lots of interesting stuff in those pages.

The second order decisions are on where one should go in academia/industry. What should one look for in an academic position? In an industry position? A smart boss? A famous boss? Smart peers? Smart underlings? Other interactions with academia/industry? A secure role? A flexible role? Other? All of the above? None of the above?

Finally, while it may seem from the above discussion that options can only cause headache, I still think options are good. So what track provides the most opportunities for change in the future? I’ve heard it argued that one should just go to academia; you can always leave. On the other hand, it’s also been argued that silicon valley changes much more quickly than the lumbering giant that is academia; you should take those opportunities when they appear. If I go to academia will I wall myself off in an ivory tower? If I go to industry, will academia then shun me?

So, do you have advice/musings for me? Feel free to comment below, or email me if it’s personal. And contact me if you want to take a more specific survey about these questions =).


Update:

I did an informal poll of some people I know. They came out on the side of going into academia. Out of curiosity, I broke down the results according to whether the survey respondent was currently in academia or industry (there’s some fuzziness in how I categorized people: e.g., grad students were counted as academia). The result of this is here. As you can see, if I were to only query industry people, then industry would win out on my poll. People currently in academia on the other hand are overwhelmingly tilted towards academia. Go figure.


Update 2:

Edo pointed me to this helpful link.

1 Comment

Filed under Uncategorized

Fun with names

In line with a previous post I decided to have some more fun with names.

I used the census names data to generate 200 names by taking a random first name and a random last name and combining them. The first names were chosen 50/50 from the male and female lists. All the random choices were done uniformly so that I’d get “interesting” names instead of a bunch of Dave Joneses.

For each of the names I asked five mechanical turkers to judge the person to whom the name belongs on six axes:

  • Hero or villain?
  • Trustworthy or untrustworthy?
  • Attractive or unattractive?
  • Sophisticated or naive?
  • Powerful or weak?
  • Outgoing or reserved?

There were a few methodological issues that I need to sort out at some point. And I’m going to keep the full results secret for a little longer but I thought I’d share a few quick aggregate results:

  • Most heroic – Kurt Rollin
  • Most Villainous – Toney Prus
  • Most outgoing – Jeanie Bainard
  • Most reserved – Jesusita Fondaw
  • Most trustworthy – Laureen Hodgen
  • Most untrustworthy – Robbie Holck
  • Most sophisticated – Cecille Legat
  • Most naive – Ramon Smialowski
  • Most powerful – Rich Keaty
  • Most weak – Bonnie Wurz
  • Most attractive – Whitney Okano
  • Least attractive – Freddy Doughtery

How do you guys and gals read these names (and other ones)? Let me know in the comments!

8 Comments

Filed under Uncategorized

Word counts vs. word presence for LDA

Every now and then someone (you know who you are) asks if the feature vectors one passes into LDA should be vectors of word counts (i.e., vectors of non-negative integers) or vectors of word presence/absence (i.e., vectors of binary values). Now the former gives strictly more information so the short answer is that you should always use word counts when available. But in my experience the difference is less than you might think.

To that end, I decided to put together a little side-by-side. Today’s corpus is Cora abstracts (with stopword and infrequent word filtering). I’m running everything through stock LDA-C with alpha set to 1.0 and K set to 10. Let me just preface what follows with the warning that your mileage may vary (and if it does, let me know!).

First up, let’s look at the topics (or specifically, the top 10 words in each topic) produced for the two feature representations. No earth-shattering differences here and absent an objective way to measure the quality of a topic, I’d be hard-pressed to say that one produced results any better than the other.

  • Word counts

    Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
    learning problem performance model research
    control problems results models report
    state search classification data part
    reinforcement method paper bayesian technical
    paper selection methods analysis paper
    dynamic algorithms data probability grant
    system solution method markov university
    systems space classifier time supported
    simulation optimization parallel distribution science
    robot test application methods artificial
    Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
    network learning algorithm genetic knowledge
    neural decision error model system
    networks paper number evolutionary design
    input examples show fitness reasoning
    training features algorithms visual case
    weights algorithms class results theory
    recurrent algorithm learning population paper
    hidden rules results evolution systems
    trained approach function crossover cases
    output tree model strategies approach
  • Word presence

    Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
    system research performance learning algorithms
    behavior report paper data genetic
    systems abstract approach present search
    complex technical parallel classification algorithm
    design part proposed algorithm problem
    model paper implementation show results
    development university results training optimal
    paper science level accuracy show
    computational supported memory results problems
    environment computer multiple decision find
    Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
    method bayesian neural function learning
    methods reasoning network distribution paper
    problem models networks algorithm learn
    applied model input general problem
    problems cases learning class system
    time case model model reinforcement
    demonstrate paper hidden linear approach
    technique framework training functions knowledge
    large markov trained show learned
    learning knowledge weights results programming

Next, let’s look at the entropy of the topic multinomials to get an idea of what the general shape of these topic distributions look like. Here I’ve computed the topic entropies for both sides of the comparison; I then sort them by entropy (silly exchangeability!). Finally, I’m showing the scatter plot of the entropy of the word-counts topics vs. the entropy of the word-presence topics. The blue line runs along the diagonal. The differences are not too phenomenal; as expected binary features mean higher entropy since binary features amount to a smaller number of observations with which to overcome smoothing. But in the end these are really just 3-5% differences.

Next up, let’s look at convergence of the likelihood bound (these aren’t strictly comparable in any meaningful way since of course the likelihood is computed over two different data representations). Here I’m showing the variational bound on the log likelihood as a function of iteration. In black, word counts are used as features; in red, strictly binary features are used. Because the likelihood scales are different, I’ve rescaled the two curves so that they’re approximately equal (the two axes reflect the two original scales). As with the other evaluations, the shape of the two curves is very similar.

Finally, let’s look at the entropy of the per-document topic-proportions to get an idea of how topics are assigned in each case. As with the previous scatter plot, this plot shows the entropy of each document’s topic proportions under word counts vs. word presence. As before, less data in the binary feature case means generally higher entropy. But the differences are more notable in this case. While for many documents the differences are within 10% for some the difference is as large as 100%. This is most likely due to the fact that in some cases (e.g., when the number of distinct words in a document is small), feature binarization changes the character and amount of data in the document by quite a lot.

I think it is in these corner cases that using full word counts data is likely to be most useful. But overall I think the differences are not that great and not worth expending too many grey cells over.

6 Comments

Filed under Uncategorized

Ignoring outliers during boosting

I had a discussion the other day about using the weights returned by boosting to do outlier detection. That is to say, the examples with high weight are hard to fit and are therefore likely to be mislabeled/outliers. So why not find some way of automatically ignoring these instances?

I know that BBM already magically does something along these lines, but I was wondering if there were some simple fudges for AdaBoost which would also do this. So I decided to run a quick experiment to see what would work. To set up the data, I randomly drew 100 points from \mathbb{R}^2 and let these be the instances. The binary labels for the instances were +1 iff the point fell within any of K circles; that is, y_j = +1 if and only if |x_j - c_i|_2^2 < r_i^2 for any i=1\ldots K. Just to confound the problem a little bit I then randomly flipped the labels on 10% of the points. The first plate of each of the plots below shows this input data set with the flipped (read: noisy/mislabeled) instances in red and the K=2 circles in blue.

For the weak hypotheses I chose the set of circles (of arbitrary radius) which predict either +1 or -1 in their interior and abstain (i.e., predict 0) on their exterior. Since the true labels can be represented as a linear combination of weak hypotheses in principle it should be easy for boosting to recover drive the training down to 0 were it not for the noisy points. Typically AdaBoost will spend a lot of effort trying to fix up these points with little benefit with respect to generalization error.

Below I describe a couple of methods for finding and ignoring these points during the boosting process. For each I include a plot of the set of weak hypotheses after each iteration (blue circles), along with the predictions made by the combined hypothesis on each of the training instances (purple = +, green = -). Below each plot I also indicate the error on the entire training set (first number) along with the error on the non-noisy, correctly labeled points (second number) since we presumably don’t want to match the noisy points too closely.

  • Stock AdaBoost – Nothing special here. This is just plain-jane AdaBoost. It does a pretty good job of driving down the training error (0.11 after 11 iterations). But the hypothesis found by AdaBoost is a rather complex combination of many circles that does not bear that much resemblance to the generating hypothesis. Also note that when the noise is removed, the error at the end is 0.07. Not so good when considering that this number could be zero after two or three iterations.
  • AdaBoost /w thresholding – Here I again run stock AdaBoost except that the distribution shown to the weak learner ignores all points whose weight exceeds some threshold. Precisely, let D denote the distribution over training examples which AdaBoost maintains internally. Then the distribution the weak learner sees is D^*_i \propto \mathbf{1}(D_i \le \theta)D_i. Here, the combined hypothesis is much simpler consisting essentially of just four circles. The predictions made by this technique really aren’t any worse than regular AdaBoost (0.12 vs 0.11 on the full training set, 0.07 in both cases on the de-noised set). So it seems like doing this is not a completely terrible idea; you make about the same predictions but with much simpler rules.
  • AdaBoost /w trimming – Here, once again, we run AdaBoost except this time the distribution shown to the weak learner excludes the 10% of points with the highest mass. That is, let R_i = \sum_{j=1}^M \mathbf{1}(D_i > D_j) be the rank of a training example, where M denotes the total number of training examples. Then the distribution shown to the weak learner is D^*_i \propto \mathbf{1}(R_i \le \theta)D_i (with some fudging I won’t get into here to break ties). The rules found by this method are unfortunately not that uncomplicated. And while some of the circles are vaguely in the spot they should be, in large part they’re all over the place. But what’s interesting here is that while the error on the training set is about the same as with the other techniques (0.12), the error on just the non-noisy ones is (0.03). So in fact, this method does the best job, in some sense, of ignoring the points which are noisy.

Practically speaking, I think I’d go with AdaBoost /w trimming as an easy approach to getting rid of label noise. Has anyone out there tried this before?

5 Comments

Filed under Uncategorized