September 2, 2009 · 6:07 am

R LDA package minor update: 1.0.1

Dave gently reminded me that properly assessing convergence of our models is important and that just running a sampler for N iterations is unsatisfactory. I agree wholeheartedly. As a first step, the collapsed Gibbs sampler in the R LDA package can now optionally report the log likelihood (to within a constant). For example, we can rerun the model fit in demo(lda) but with an extra flag set:


result <- lda.collapsed.gibbs.sampler(cora.documents,
                                      K,  ## Num clusters
                                      cora.vocab,
                                      25,  ## Num iterations
                                      0.1,
                                      0.1,
                                      compute.log.likelihood=TRUE)

Using the now-available variable result$log.likelihoods, we can plot the progress of the sampler versus iteration:

Grab it while it’s hot: http://www.cs.princeton.edu/~jcone/lda_1.0.1.tar.gz.

6 Comments

Filed under Uncategorized

6 responses to “R LDA package minor update: 1.0.1”

lingpipe

September 14, 2009 at 5:29 pm

I agree with Dave! I finally took Martin Jansche’s advice and refactored LingPipe’s Java implementation of collapsed Gibbs for LDA to return an iterator over Gibbs samples. (The original implementation, which is still there, took a callback function to apply to each sample.)

One thing the sample lets you do is compute the corpus log likelihood given the parameters. Note that this is just the likelihood, not the prior. Did you include the Dirichlet prior or just p(docs|topics)? And did you sum over all possible topic assignments, or just evaluate the probability of the actual one?

I do the same thing, only with the coefficient prior as well, for the SGD implementation of logistic regression.

Reply
- slycoder
  
  September 15, 2009 at 3:16 am
  
  Hi,
  I also wanted to do an iterator approach but laziness overtook me =).
  
  As for what I compute, it’s the joint likelihood given a sample of topic assignments $p(w, z) = p(w | z) p (z)$ .
  
  Reply
  - Adnan
    
    October 14, 2009 at 6:15 pm
    
    Could you please explain a little more in regards to how you calculate the log-likelihood? I’ll tell you what I understand:
    
    To make things clear, let’s say we have: (K – # of topics, M – # of documents, V – # of unique terms)
    
    ndsum[j] = total # of words in document j, size M
    nwsum[i] = total # of words assigned to topic i, size K
    and
    nw[i][j] = # of words i assigned to topic j
    nd[i][j] = # of words in doc i assigned to topic j
    
    Now, for each m document, you sum all the log-gamma(nds + alpha) and subtract the ndsums. Then for each k topics, you sum all the log-gamma(nws + beta) and subtract the nwsums.
    
    Is this the technique that you use for log-likelihood evaluation? Does this correspond to evaluating log(P(z = j | all other z, w, d, alpha, beta))? I’m referring to this paper: http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf
    
    Best regards,
    
    Adnan
  - slycoder
    
    October 15, 2009 at 1:49 am
    
    You are more or less correct – that is how I’m computing log likelihood (see line 832 of gibbs.c in the package for details). I should point out that it is not the conditional probability of a single assignment but rather the joint probability over a set of assignments: log p(z_1, z_2, z_3, … | w, alpha, beta). It gets computed after an entire gibbs sweep over the variables.
Sidahmed

May 20, 2015 at 5:44 pm

how to plot log-likelihood vs iteration in R LDA

Reply
- slycoder
  
  May 21, 2015 at 11:07 pm
  
  I’d recommend asking on the topic-models mailing list as there are many practitioners there who can help you.
  
  Reply