# A generative model of binary data

We are often faced with data sets whose elements are vectors of binary data. For example, a tag corpus has for each document a binary vector the length of the tag vocabulary whose elements indicate tag presence/absence. A corpus of links has for each node a binary vector indicating the adjacency to other nodes (i.e., that node’s column of the adjacency matrix).

While it is possible to shoehorn this sort of data into LDA (as many people have done), why not construct a mixture model which expressly captures the binary nature of the data. So consider the following generative model:

1. For each topic $k$ and vocabulary item $v$, draw topic probabilities $\beta_{k,v} \sim \mathrm{Beta}(\xi)$.
2. For each document $d$
1. Draw document topic proportions $\theta_d \sim \mathrm{Dir}(\alpha)$.
2. For each vocabulary item $v$
1. Draw topic assignment $z_{d,v} \sim \mathrm{Mult}(\theta_d)$.
2. Draw binary response $x_{d,v} \sim \mathrm{Bernoulli}(\beta_{z_{d,v}})$.

The parameters of the model are $K, \alpha, \xi$ and we observe each document’s binary vector $x_{d,v}$.

I implemented this using variational inference and found that it didn’t work very well.  At first I thought it was some initialization or hyperparameter issue.  In LDA we typically we initialize the $\beta$ using a slightly perturbed uniform matrix.  I implemented the analogues of all the usual suspects and none of them did very well (on a tag data set and a link data set).

One problem is that the algorithm is extremely slow; LDA takes advantage of the sparsity of the data while this model does not.  This is related to the main problem: $\theta_d$ is almost always uniform.  The reason is that vast majority of the probability mass is used to explain the observations that are not there, the zeroes.  If one looks at the zeroes, all the documents, in fact, look 99% the same.   Since there are no major differences between them, they end up with very similar topic proportions and consequently all the parameters become uniform.

Clearly, more work needs to be done.  One problem is that there’s no coupling in $\beta$.  Every element is generated independently.   What might we do to improve this…?