The other day I paid a nice visit to Alex and Yan. We got around to talking about how bit.ly (a link shortener) can be used to track things on twitter. Anyhow, I’m sure they will blow you away with their analysis soon enough, but I thought I’d post some results from a really simple analysis.

The cool thing about bit.ly is that there’s an API that allows us to find out how many clickthrus there were on each link. This makes basic website analytics available to everyone and gives us the ability to start looking at what drives traffic. So we can try to figure out what motivates people to click on links posted on twitter: content, network, or something else?

Here’s what I did: I took the the last 3200 tweets by theonion and extracted all the bit.ly links therein (there were about 1200). I then got the number of clicks for each of the links as well as relevant metadata through the bit.ly API. There’s a tiny bit of noise there, but here’s what it looks like when I plot the number of clicks (as measured by bit.ly) versus the date when the link was tweeted:

You can see how phenomenal theonion’s twitter account has taken off in the last year, eventually reaching this weird cyclical pattern, a valley of which we currently seem to be in. (I don’t really have a good explanation for why that pattern is occurring.) But what’s also phenomenal is how closely clicks tend to track with the mean. That is, there isn’t a whole lot of variance at any given time. I’d guess that there is a set of regular readers who click on pretty much everything that theonion posts. And while there is an ebb and tide of regular readers, it’s not like within some time slice there are a few articles which really take off (“go viral”) and a bunch which languish. This is totally strange to me; my intuition based on diggs is that there’d be a polya-urn rich-get-richer type of distribution for link clickthrus but there doesn’t appear to be.
This is also strange to me because followers of this account are basically treating it like an RSS feed of onion articles, which makes me wonder: why are they using twitter at all?

I broke down the data a few other ways to see if I could tease out other trends. I tried breaking it down by time of day. And as expected posting stops at night and beings to pick up again at noon GMT = 8 am Eastern. But there isn’t a huge amount of variation based on when the urls get tweeted: once it gets into people’s queue it seems that they’ll get around to it eventually.

Finally, I tried breaking it down by day of the week. Not much news to report here. There are fewer tweets on Saturday and Sunday (although it sort of picked up on those days during July). And there isn’t any significant difference in terms of number of clickthrus per link on any given day of the week.

So there you have it. theonion has basically co-opted twitter as a news feed. And its readers faithfully read (or at least click on) the posted bit.ly links and any content or network effects seem to average out in the end.

Major thanks to Eytan for introducing me to the bit.ly API and lots of pro-tips on navigating/understanding the twitterverse.

Filed under Uncategorized

## LDA for the masses (who use R)

Long time no post. I’ve been busy with lots of stuff: writing my thesis, renaming this blog to pleasescoopme.com, and other stuff which I’ll post soon enough. Another thing I’ve been working on is an R package that implements collapsed Gibbs samplers (written in C) for some of the models I’ve been using: latent Dirichlet allocation (LDA), the mixed-membership stochastic blockmodel (MMSB), and supervised LDA (sLDA). It’s still somewhat experimental but I’ve found it to be immensely useful already. Here some included demos to show off what you can already do out of the box (plots made with the fantastic ggplot2 package):

You too can make all these pretty pictures by downloading the package here. Then simply run ‘R CMD INSTALL lda_1.0.tar.gz’ to install the package and you’ll be ready to go! All of you out there who work with these models, or want to start working with these models, give it a shot and gimme any feedback you have. I hope to improve things and add more models in some upcoming releases.

Filed under Uncategorized

## Another simple parameter estimation method for LDA

In an earlier post, I looked at a couple of different ways to estimate the parameters of a latent Dirichlet allocation (LDA) model, namely the topic mixing proportions $\theta$ and the multinomial topic distributions $\beta$. Continuing along those lines, here I describe yet another way of doing so. This way turns out to be much faster, but at the cost of accuracy.

As I mentioned in the previous post, the objective we try to maximize is the log likelihood of the data, namely $\sum_d \sum_{w \in d} \log \theta_d^T\beta_{w}$. The trick I want to use here is the concave bound, which lets me rewrite this function as $\sum_d \sum_{w \in d} \theta_d^T \log \beta_{w}.$ Solving this for $\beta_{w}$ turns out to be easy: $\beta_{w'} \propto \sum_d \sum_{w \in d} \mathbf{1}(w' = w) \theta_d + \eta,$ where $\eta$ is a Dirichlet hyperparameter on $\beta$.

Now this bound isn’t as nice for solving for $\theta_d$. The solution will always be degenerate at a corner of the simplex. While this is not necessarily bad for modeling likelihood, it does mean that we won’t be able to get a handle on mixed-membership behavior. An alternative is to optimize a slightly different bound: playing the same trick as before but with different variables yields: $\sum_d \sum_{w \in d} \frac{\beta_{w}}{|\beta_w|_1} \log \theta^d + \log |\beta_w|_1.$ This has a solution of the form: $\theta_d \propto \sum_{w \in d} \frac{\beta_{w}}{|\beta_w|_1} + \alpha,$ where $\alpha$ is a Dirichlet hyperpamater on $\theta_d$.

So in essence what I’m proposing here is alternating between optimizing $\beta$ and $\theta$, something which has a very EM flavor, except that each of the optimization steps really optimizes a different bound. Notice here that the “E-step” which optimizes $\theta$ is completely non-iterative — it’s just a simple sum-then-normalize — unlike variational inference for LDA which requires iterative estimation of the variational parameters for each E-step. This means that this simple procedure is liable to be much faster. I didn’t do any formal speed tests, but for the results below which were implemented in R, using this procedure to estimate the parameters of the model was noticeably faster than reading in the data! The same is decidedly NOT true for variational EM.

Ok, but how bad is the bound for getting us the true parameters? This bad. Here I’m showing you the barplots of the log KL distance between the estimated values of $\theta$ and the true values of $\theta$ used to generate this synthetic data set. Lower is better. “LDA” uses the mean estimates derived from the lda-c package. “Opti” uses the optimization procedure I described in the previous post (as mentioned there, “Opti” compares favorably with “LDA”). “Fast” uses the method I described here. “Fast” is rather worse than the other methods unfortunately. When looking at the results I noticed that “Fast” was considerably “smoother” than the other methods, i.e., “Fast” produces estimates with higher entropy. As an ad hoc fix, I also computed “Fast^2” which consists of the same estimates as “Fast”, but squared to yield sharper results. That one actually does better and is within shooting distance of the other methods, but it’s not quite there yet.

Anyways, there are a lot of scenarios where speed matters and in those cases I suspect “Fast” will turn out to be accurate enough and fast enough to be of practical use. “Fast^2” does better on this case, but I have no real understanding of why that’s so. Maybe it’s a fluke or maybe it’s the solution to a more accurate estimate of the log likelihood. And perhaps understanding why it’s doing better will help derive an even better way of estimating the parameters. Maybe one of you can enlighten me.

Filed under Uncategorized

## How far apart are you and I?

As a result of a conversation I recently had, I was curious to know how far two people plucked at random from the population of the world/US would be. I used the data from the R maps package and figured it out.

Answer? Farther (or further?) than I would’ve thought. Assuming the readers of this blog are distributed like everyone else, then you and I are probably around 5000 miles apart.

Filed under Uncategorized

## I shall be telling this with a sigh somewhere ages and ages hence

Warning: If you have a vested interest in what I do after I graduate (you know who you are; if you’re not sure if this applies to you, then it doesn’t) or you don’t have the stomach for the inherently impolitic nature of job-search aporia, you should probably stop reading.

As some of you know, I’ve been wrestling with some decisions over what I want to do once I graduate (in the near future, knock on wood). And I’ve had a really hard time coming to any sort of definitive conclusion. I know a lot of you also had/have the same decisions to make.

The first order decision is whether to go the route of industry or academia. I’m really interested in having the freedom and resources to work on interesting problems; I’ve fielded arguments from both sides on whether industry or academia is best able to realize this. On the one hand, industry generally has a lot of resources (be that data or computers). And some places are engaging on problems which I think are of real research interest. But in the end you have to sing for your supper, so your eye is typically always towards some product (vague though it may be).

Academia on the other hand is relatively unfettered. It is true, however, that you are subject to the whims of inconstant moons, grants, committees, etc. You’ve got to hustle and sometimes that hustle may entail compromise on the research front. Academics also might have fewer resources on average, but they’re no slouches either. On the balance, academia provides an environment in which the talismans of research, publications, are given priority. Whether these count as progress is up for debate, I suppose, but there’s no denying that there’s lots of interesting stuff in those pages.

The second order decisions are on where one should go in academia/industry. What should one look for in an academic position? In an industry position? A smart boss? A famous boss? Smart peers? Smart underlings? Other interactions with academia/industry? A secure role? A flexible role? Other? All of the above? None of the above?

Finally, while it may seem from the above discussion that options can only cause headache, I still think options are good. So what track provides the most opportunities for change in the future? I’ve heard it argued that one should just go to academia; you can always leave. On the other hand, it’s also been argued that silicon valley changes much more quickly than the lumbering giant that is academia; you should take those opportunities when they appear. If I go to academia will I wall myself off in an ivory tower? If I go to industry, will academia then shun me?

So, do you have advice/musings for me? Feel free to comment below, or email me if it’s personal. And contact me if you want to take a more specific survey about these questions =).

## Update:

I did an informal poll of some people I know. They came out on the side of going into academia. Out of curiosity, I broke down the results according to whether the survey respondent was currently in academia or industry (there’s some fuzziness in how I categorized people: e.g., grad students were counted as academia). The result of this is here. As you can see, if I were to only query industry people, then industry would win out on my poll. People currently in academia on the other hand are overwhelmingly tilted towards academia. Go figure.