Daily Archives: May 6, 2009

Word counts vs. word presence for LDA

Every now and then someone (you know who you are) asks if the feature vectors one passes into LDA should be vectors of word counts (i.e., vectors of non-negative integers) or vectors of word presence/absence (i.e., vectors of binary values). Now the former gives strictly more information so the short answer is that you should always use word counts when available. But in my experience the difference is less than you might think.

To that end, I decided to put together a little side-by-side. Today’s corpus is Cora abstracts (with stopword and infrequent word filtering). I’m running everything through stock LDA-C with alpha set to 1.0 and K set to 10. Let me just preface what follows with the warning that your mileage may vary (and if it does, let me know!).

First up, let’s look at the topics (or specifically, the top 10 words in each topic) produced for the two feature representations. No earth-shattering differences here and absent an objective way to measure the quality of a topic, I’d be hard-pressed to say that one produced results any better than the other.

  • Word counts

    Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
    learning problem performance model research
    control problems results models report
    state search classification data part
    reinforcement method paper bayesian technical
    paper selection methods analysis paper
    dynamic algorithms data probability grant
    system solution method markov university
    systems space classifier time supported
    simulation optimization parallel distribution science
    robot test application methods artificial
    Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
    network learning algorithm genetic knowledge
    neural decision error model system
    networks paper number evolutionary design
    input examples show fitness reasoning
    training features algorithms visual case
    weights algorithms class results theory
    recurrent algorithm learning population paper
    hidden rules results evolution systems
    trained approach function crossover cases
    output tree model strategies approach
  • Word presence

    Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
    system research performance learning algorithms
    behavior report paper data genetic
    systems abstract approach present search
    complex technical parallel classification algorithm
    design part proposed algorithm problem
    model paper implementation show results
    development university results training optimal
    paper science level accuracy show
    computational supported memory results problems
    environment computer multiple decision find
    Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
    method bayesian neural function learning
    methods reasoning network distribution paper
    problem models networks algorithm learn
    applied model input general problem
    problems cases learning class system
    time case model model reinforcement
    demonstrate paper hidden linear approach
    technique framework training functions knowledge
    large markov trained show learned
    learning knowledge weights results programming

Next, let’s look at the entropy of the topic multinomials to get an idea of what the general shape of these topic distributions look like. Here I’ve computed the topic entropies for both sides of the comparison; I then sort them by entropy (silly exchangeability!). Finally, I’m showing the scatter plot of the entropy of the word-counts topics vs. the entropy of the word-presence topics. The blue line runs along the diagonal. The differences are not too phenomenal; as expected binary features mean higher entropy since binary features amount to a smaller number of observations with which to overcome smoothing. But in the end these are really just 3-5% differences.

Next up, let’s look at convergence of the likelihood bound (these aren’t strictly comparable in any meaningful way since of course the likelihood is computed over two different data representations). Here I’m showing the variational bound on the log likelihood as a function of iteration. In black, word counts are used as features; in red, strictly binary features are used. Because the likelihood scales are different, I’ve rescaled the two curves so that they’re approximately equal (the two axes reflect the two original scales). As with the other evaluations, the shape of the two curves is very similar.

Finally, let’s look at the entropy of the per-document topic-proportions to get an idea of how topics are assigned in each case. As with the previous scatter plot, this plot shows the entropy of each document’s topic proportions under word counts vs. word presence. As before, less data in the binary feature case means generally higher entropy. But the differences are more notable in this case. While for many documents the differences are within 10% for some the difference is as large as 100%. This is most likely due to the fact that in some cases (e.g., when the number of distinct words in a document is small), feature binarization changes the character and amount of data in the document by quite a lot.

I think it is in these corner cases that using full word counts data is likely to be most useful. But overall I think the differences are not that great and not worth expending too many grey cells over.



Filed under Uncategorized