Dave gently reminded me that properly assessing convergence of our models is important and that just running a sampler for N iterations is unsatisfactory. I agree wholeheartedly. As a first step, the collapsed Gibbs sampler in the R LDA package can now optionally report the log likelihood (to within a constant). For example, we can rerun the model fit in demo(lda)
but with an extra flag set:
result <- lda.collapsed.gibbs.sampler(cora.documents, K, ## Num clusters cora.vocab, 25, ## Num iterations 0.1, 0.1, compute.log.likelihood=TRUE)
Using the now-available variable result$log.likelihoods
, we can plot the progress of the sampler versus iteration:
Grab it while it’s hot: http://www.cs.princeton.edu/~jcone/lda_1.0.1.tar.gz.