Monthly Archives: March 2009

Some simple attempts at feature selection

Not too long ago John Langford stopped on by and gave a fascinating talk. There were a lot of take-aways from the talk but here’s one that really got my noodle going: A lot of times we get really high-dimensional data (say text) and we want to use it to make some prediction. The high-dimensionality makes the problem intractably difficult, in part because of the curse, and in part because high-dimensional feature spaces lead to high-dimensional parameter spaces which is computationally (and space) taxing.

So what do we do? Dimensionality reduction or feature selection. A lot of effort has gone into doing these well (some popular ones these days are LDA and L1 regularization). JL came up with another way of doing dimensionality reduction: he ignores hash conflicts. At first blush this seems silly: you’ll get a form of dimensionality reduction wherein different features will be additively combined in some arbitrary way. But his results show that this actually works quite well. I’m not entirely sure why, but that got me thinking: are there simple ways of reducing feature spaces that make for very good !/$ propositions? I decided to find out by running some experiments on a scale/domain I’m a little more familiar with.

Here’s the experimental setup: I took the Cora data set (thanks to LINQS.UMD for making a munged version of this data available) which is a database of scientific papers each one with a subject categorization. In order to keep things simple, my goal will be to predict whether or not a paper is labeled as “Theory.” I’ll try to make these predictions using binary word features (i.e., the presence or absence of a word in a paper’s abstract). I’ll take these features and put them through a logistic regression model with a little bit of an L2 penalty for good measure (n.b. I also tried an SVM; the results are about the same but a little worse). My data’s been cleaned up but there are still around 1.5k features and only 2.7k data points. This is less than ideal because 1.) the dimensionality to data ratio is not too good, and 2.) storing that matrix is still pretty taxing for my poor laptop.

What I want to do is reduce the number of features to a fixed size S (in the plots below, I do the experiments using S \in \{50, 100, 200\}). I’m going to try five methods:

  • random – Uniformly randomly select S features from the feature space.
  • corr – Select the S features which have the highest individual correlation with the dependent variable.
  • cream – Select the S most frequently occurring features (the cream of the crop; they rise to the top).
  • crust – Like “cream,” except skip the first \sqrt{D} features, where D is the dimensionality of the feature space. These are the”crust” features, i.e., those that are too frequent or not frequent enough to carry much information.
  • merge – Randomly additively combine features until there are just S features left.

Here’s the pretty dotchart (the red line is how well you’d do using random guessing) of the 0-1 error estimated using 10 bootstrap replications:

Predictive performance using different feature selection mechanisms.

A few things that surprised me:

  1. You don’t really need that many features. Using 50 features doesn’t really do substantially worse than 200 if you use corr.
  2. Corr was just something I threw together on the fly; I really didn’t expect it to work so well and I haven’t actually used it in practice. I guess I should start.
  3. Cream does better than crust. Crust was motivated by the fact that when you put data into topic models like LDA, you get better looking topics when you remove overly-frequent words (since they tend to trickle to the top of every topic). I guess good looking topics don’t necessarily make for good predictions. Food for thought.
  4. Random does poorly as one might expect.
  5. Merge is just about as bad as random. I don’t know how to reconcile this with what I said at the outset except to that 1.) merge is not necessarily combining features the exact same way a hash would 2.) different features 3.) different number of features 4.) different amounts of data 5.) different data sets 6.) a different classifier.

Ok, so the data didn’t show me what I was hoping and half expecting to see. Chalk it up to different strokes for different domains. But I did find something that, while obvious in retrospect, will probably be useful going forward.



Filed under Uncategorized

Optimization instead of inference

You know, I’ve always taken it for granted that what we want to do is probabilistic inference but lately I’ve been thinking more about what we really want and how to get there.

To illustrate my point, consider our dear friend LDA. For the uninitiated, let me write down the generative process for this Bayesian mixture model of discrete data:

  1. For k \in 1\ldots K, draw \beta_k \sim \mathrm{Dir}(\eta);
  2. For d \in 1\ldots D:
    1. Draw \theta_d \sim \mathrm{Dir}(\alpha);
    2. For n \in 1\ldots N_d:
      1. Draw z_{d,n} \sim \mathrm{Mult}(\theta_d);
      2. Draw w_{d,n} \sim \mathrm{Mult}(\beta_{z_{d,n}}).

We only observe w_{d,n}; the rest of the variables are hidden from us. Now the story usually says that what we need to do is perform posterior inference, that is, determine the distribution over the hidden variables (z, \theta and sometimes \beta) given the observations. People have come up with a bunch of ways of doing this like MCMC and variational inference.

But in fact what we often care about is the mode. Just look at the way practitioners actually use these things. A typical MCMC experiment will run the updates for a hundred iterations or so, look at the final state of the parameters (under the assumption that we’ve found the mode) and then put a pretty picture of this final state in the paper.

If that’s what we actually want, why bother with the rest? Put another way, our goal is to maximize the joint likelihood of the data and the parameters we care about. One parameter we typically care about is \theta_d; however we typically don’t care about is z_{d,n}.

With these facts guiding us, we can write out the optimization problem for a particular document as \max_{\theta_d} p(\theta_d, w_d | \beta, \alpha) = \max_{\theta_d} p(\theta_d | \alpha) + \sum_n \log p(w_{d,n} | \theta_d, \beta). We can expand the probability for a single word p(w_{d,n} | \theta_d, \beta) = \sum_{z_{d,n}} p(w_{d,n} | z_{d,n}, \beta) p(z_{d,n} | \theta_d) = \theta^T \beta_{\cdot, w_{d,n}}. The portions of the prior term which are relevant are \log p(\theta_d | \alpha) = (\alpha - 1) \log \theta_d.

The objective function, in total then, is (\alpha - 1) \log \theta_d + \sum_n \log \theta^T \beta_{\cdot, w_{d,n}}. Now the solution is perhaps not entirely trivial but we can just plug this into a standard optimizer with adequate constraints to solve for the optimal \theta_d.

So how well does it do? To test this I generated a synthetic data set using what I felt were fairly typical hyperparameters: K=5, \eta = 0.1, \alpha=1 / K, D=1000 and I set the vocabulary size (i.e., the length of each \beta_k to 2000. Using these hyperparameters I followed the generative process to generate D documents. In order to decide how many words should appear in each document, I drew N_d \sim  \mathrm{Poisson}(100).

I then used the procedure above to estimate optimal values of \theta_d, which I’ll denote \hat{\theta}_{\mathrm{Opt}}. I also plugged this data set into LDA-C which performs variational inference. LDA-C estimates the posterior distribution; from its estimates I compute the mode mean using \hat{\theta}_{\mathrm{Var}} = \frac{\gamma_d}{\sum_k \gamma_{d,k}}. Let me emphasize that both methods are given the true value of the other parameters: \beta, \alpha, K.

With these estimates of \theta_d I compute their quality against the true value of \theta_d using KL-divergence. I’ve plotted the difference of the quality of \hat{\theta}_{\mathrm{Opt}} and \hat{\theta}_{\mathrm{Var}} for each document as a function of the number of words in that document. The plot is here. The points below the zero-line (colored in red) are points where the optimization described earlier actually does a better job of recovering the true \theta_d than LDA-C. The purple line (which is kind of hard to see since it’s near zero) represents the mean difference. Overall, LDA-C does slightly worse than we do.

So what we’ve got here is a fairly fast and easy way of finding some kind of “mode” for LDA. Although the modes found by the two techniques aren’t really maximizing the same thing, they both tend to do equally well at recovering the true value of the parameters. So why not just optimize?

P.S. If anybody is actually reading this, and has tried doing optimization with constraints of this sort, we should trade war stories.


Filed under Uncategorized

Randomness makes parallelization interesting

A friend of mine recently posed a problem which seems at first blush quite simple but turns out to be quite interesting. Suppose you have a program which takes T time and which is perfectly parallelizable — that is, I can split the program up into K tasks and the latency for the program will be T / K.
So I can make the latency arbitrarily small by taking K \rightarrow \infty.

So why wouldn’t you ever just set K as large as conceivably possible? Well, suppose there were a fixed startup cost associated with each task, S. Then, the latency would be S + T/K. Again, it’s clear that you should just parallelize the hell out of your program and eventually your latency will approach S.

But what if the startup cost were random? Let the startup cost be drawn from some probability distribution S_i \sim f. Now each subtask will take a different amount of time; since I’m concerned with the latency, the metric I want to optimize is \max_{i\in \{1\ldots K\}} S_i + T/K over K.

Now there’s a cool tradeoff involved in parallelization: one the one hand, splitting a task reduces the T/K part of the equation, but on the other, it increases the number of draws of S_i and therefore the expected maximum value.

It turns out that the solution to this problem leads down a huge rabbit hole involving order statistics, sample maxima, and extreme value statistics. Here are some highlights:

  • The probability distribution of the maximum value in a sample of K draws from a distribution f is given by K F(x)^{K-1} f(x) where F is the cumulative distribution function (There’s a more general form for the ith order statistic).
  • The expected value of the maximum (drawn according to the procedure above) is \int_{0}^1 F^{-1}(x^{1/K}) dx, where F^{-1} is the inverse cdf (or quantile) function. For almost all distributions this expression cannot be computed analytically (The uniform distribution being the notable exception).
  • When f = \mathrm{Unif}(a, b), the analytical solution to the expected maximum value is a + (b - a) \frac{K}{K+1}. This leads to a quadratic polynomial as the solution of the original problem. Without loss of generality, let b-a = 1. Then the optimal value of K takes the form \frac{T + \sqrt{T}}{1-T}. Note that when T > 1 the roots are negative, i.e., one should set K to infinity. The interesting range is between 0 and 1. I’ve attached a plot of this function for this range here. You’ll notice that the bend is quiet steep so that you’ll want the parallelization to either be very small or very large depending on how long your program takes relative to the startup costs.
  • When f is not uniform, things are more difficult. I’m not entirely sure what the right model is for the costs, but I think and exponential distribution might make a first cut. Using a numerical approximation, I computed latencies as a function of K for various values of T and rate parameter set to 1 here (the values of T are in the title of each plot). Because of the long tails, one needs to be much more cautious before parallelizing. For values of T explored here you generally shouldn’t to go above K = 5. Also, the different performance for different values of K can be quite substantial, but bang/buck is questionable. For example, when T=5 if you spend 5 times as many resources, you’ll go almost twice as fast. Is that a worthwhile investment? When T=1 you almost definitely don’t want to parallelize at all.

Ok, so it seemed like a simple little problem at first, but it turns out that something as straightforward sounding as parallelization can lead to some pretty interesting analysis.


Filed under Uncategorized

Another perspective on link probability functions

When deciding when a link between two documents should exist, we need to define a function of the covariates which we call a link probability function.  We have looked at many candidate functions but concentrated on two such functions:

  • \psi_\sigma = \sigma(\eta^T z_1 \circ z_2 + \nu)
  • \psi_e = \exp(\eta^T z_1 \circ z_2 + \nu)

Recently, I saw a new connections between the two functions.  This connection exists in the context of variational inference.  One of the terms we must compute in variational inference is \mathbb{E}_q[\log \psi(z_1, z_2)] where the expectation is taken over z_1, z_2 and \mathbb{E}_q[z_i] = \phi_i.   If we expand the equation using \psi_\sigma, this amounts to computing \eta^T \phi_1 \circ \phi_2 + \nu - \mathbb{E}_q[\log (1 + \exp(\eta^T z_1 \circ z_2 + \nu))].  This latter term is intractable to compute exactly, so we turn to approximation.

In previous posts, I primarily explored using the approximation \log(1 + \exp(\eta^T \phi_1 \circ \phi_2 + \nu)).   This is equivalent to approximating the function inside the expectation with a first order Taylor approximation centered around the mean.  Because of convexity, this approximation is a tangent line which forms a lower bound on the function (see graph below).   Unfortunately, this is the wrong bound for variational inference; there the quantity we are computing is supposed to be a lower bound which means that the bound on the partition function of the link probability function needs to be an upper bound.   That was a mouthful, and things seem grim but as it turns out this approximation is rather good since most of the points lie near the mean.

But why not construct an upper bound?  This is possible because the function is convex and because the covariates’ range is bounded.  The hyperplane which intersects the function at the covariates’ bounds is an upper bound for the function.  If we apply this approximation, the expectation expands into (\phi_1 \circ \phi_2)^T (\log(1 + \exp(\eta +\ nu)) - \log(1 + \exp(\nu))) +\log(1 + \exp(\nu)).   This upper bound is also depicted in the figure below.  But also note that if we set \nu' = \log(1 + \exp(\nu)) and \eta' = \log(1 + \exp(\eta + \nu)) - \log(1 + \exp(\nu)) then this expression becomes \eta'^T (\phi_1 \circ \phi_2) + \nu' which has the exact same functional form as \mathbb{E}_q[\psi_e(z_1, z_2)].  Thus, using this upper bound reduces to using \psi_e!

Two bounds on the link probability function

Thus, rather than considering two link probability functions, it is perhaps better to think of two approximations to a single link probability function.  This way of thinking also sheds some light on why perhaps \psi_e tends to perform better than \psi_\sigma.  Even though the lower-bound approximation is “better” in the sense that it is closer to the true value of the expectation, the bound is in the wrong direction for variational inference.  Maybe “cowboy” variational doesn’t work all that well after all.

Leave a comment

Filed under Uncategorized

A generative model of binary data

We are often faced with data sets whose elements are vectors of binary data. For example, a tag corpus has for each document a binary vector the length of the tag vocabulary whose elements indicate tag presence/absence. A corpus of links has for each node a binary vector indicating the adjacency to other nodes (i.e., that node’s column of the adjacency matrix).

While it is possible to shoehorn this sort of data into LDA (as many people have done), why not construct a mixture model which expressly captures the binary nature of the data. So consider the following generative model:

  1. For each topic k and vocabulary item v, draw topic probabilities \beta_{k,v} \sim \mathrm{Beta}(\xi).
  2. For each document d
    1. Draw document topic proportions \theta_d \sim \mathrm{Dir}(\alpha).
    2. For each vocabulary item v
      1. Draw topic assignment z_{d,v} \sim \mathrm{Mult}(\theta_d).
      2. Draw binary response x_{d,v} \sim \mathrm{Bernoulli}(\beta_{z_{d,v}}).

The parameters of the model are K, \alpha, \xi and we observe each document’s binary vector x_{d,v}.

I implemented this using variational inference and found that it didn’t work very well.  At first I thought it was some initialization or hyperparameter issue.  In LDA we typically we initialize the \beta using a slightly perturbed uniform matrix.  I implemented the analogues of all the usual suspects and none of them did very well (on a tag data set and a link data set).

One problem is that the algorithm is extremely slow; LDA takes advantage of the sparsity of the data while this model does not.  This is related to the main problem: \theta_d is almost always uniform.  The reason is that vast majority of the probability mass is used to explain the observations that are not there, the zeroes.  If one looks at the zeroes, all the documents, in fact, look 99% the same.   Since there are no major differences between them, they end up with very similar topic proportions and consequently all the parameters become uniform.

Clearly, more work needs to be done.  One problem is that there’s no coupling in \beta.  Every element is generated independently.   What might we do to improve this…?

Leave a comment

Filed under Uncategorized