Some simple attempts at feature selection

Not too long ago John Langford stopped on by and gave a fascinating talk. There were a lot of take-aways from the talk but here’s one that really got my noodle going: A lot of times we get really high-dimensional data (say text) and we want to use it to make some prediction. The high-dimensionality makes the problem intractably difficult, in part because of the curse, and in part because high-dimensional feature spaces lead to high-dimensional parameter spaces which is computationally (and space) taxing.

So what do we do? Dimensionality reduction or feature selection. A lot of effort has gone into doing these well (some popular ones these days are LDA and L1 regularization). JL came up with another way of doing dimensionality reduction: he ignores hash conflicts. At first blush this seems silly: you’ll get a form of dimensionality reduction wherein different features will be additively combined in some arbitrary way. But his results show that this actually works quite well. I’m not entirely sure why, but that got me thinking: are there simple ways of reducing feature spaces that make for very good !/$ propositions? I decided to find out by running some experiments on a scale/domain I’m a little more familiar with.

Here’s the experimental setup: I took the Cora data set (thanks to LINQS.UMD for making a munged version of this data available) which is a database of scientific papers each one with a subject categorization. In order to keep things simple, my goal will be to predict whether or not a paper is labeled as “Theory.” I’ll try to make these predictions using binary word features (i.e., the presence or absence of a word in a paper’s abstract). I’ll take these features and put them through a logistic regression model with a little bit of an L2 penalty for good measure (n.b. I also tried an SVM; the results are about the same but a little worse). My data’s been cleaned up but there are still around 1.5k features and only 2.7k data points. This is less than ideal because 1.) the dimensionality to data ratio is not too good, and 2.) storing that matrix is still pretty taxing for my poor laptop.

What I want to do is reduce the number of features to a fixed size $S$ (in the plots below, I do the experiments using $S \in \{50, 100, 200\}$ ). I’m going to try five methods:

random – Uniformly randomly select $S$ features from the feature space.
corr – Select the $S$ features which have the highest individual correlation with the dependent variable.
cream – Select the $S$ most frequently occurring features (the cream of the crop; they rise to the top).
crust – Like “cream,” except skip the first $\sqrt{D}$ features, where $D$ is the dimensionality of the feature space. These are the”crust” features, i.e., those that are too frequent or not frequent enough to carry much information.
merge – Randomly additively combine features until there are just $S$ features left.

Here’s the pretty dotchart (the red line is how well you’d do using random guessing) of the 0-1 error estimated using 10 bootstrap replications:

Predictive performance using different feature selection mechanisms.

A few things that surprised me:

You don’t really need that many features. Using 50 features doesn’t really do substantially worse than 200 if you use corr.
Corr was just something I threw together on the fly; I really didn’t expect it to work so well and I haven’t actually used it in practice. I guess I should start.
Cream does better than crust. Crust was motivated by the fact that when you put data into topic models like LDA, you get better looking topics when you remove overly-frequent words (since they tend to trickle to the top of every topic). I guess good looking topics don’t necessarily make for good predictions. Food for thought.
Random does poorly as one might expect.
Merge is just about as bad as random. I don’t know how to reconcile this with what I said at the outset except to that 1.) merge is not necessarily combining features the exact same way a hash would 2.) different features 3.) different number of features 4.) different amounts of data 5.) different data sets 6.) a different classifier.

Ok, so the data didn’t show me what I was hoping and half expecting to see. Chalk it up to different strokes for different domains. But I did find something that, while obvious in retrospect, will probably be useful going forward.

6 responses to “Some simple attempts at feature selection”

david blei

March 19, 2009 at 12:10 pm

interesting. the “corr” criterion is very similar to the work on independence screening by fan and lv. also, look at della pietra^2 and lafferty.

John Langford

March 27, 2009 at 1:32 am

I would expect a signed merge (like the signed hash) to work epsilon better than the additive merge, because slightly more information is preserved.

Of the techniques listed, random and merge/hash are the fast low cost ones, and I’m not surprised to see the more computationally expensive ones perform better. A challenge is: find a way to achieve the predictive performance of corr with the computational footprint of merge/hash.

slycoder

March 31, 2009 at 2:07 pm

Interesting points, John.

It’s hard for me to properly imagine all the bottlenecks that might come about when you try and scale all this stuff up, but ‘corr’ is rather convenient because all the correlations can be computed on one pass over the data keeping track of a few simple sufficient stats.

So I’d imagine a 3-phase learning process wherein you
1.) go through the data compute sufficient stats and decide which features you want to keep;
2.) create a canonical ordering (assignment to indices) of the features you want to keep and munge the data to use this canonical ordering (and throw away the unused features);
3.) train the model on these munged feature vectors.

Do you have thoughts on how hard each of these will end up being? I suspect #2 will end up being the bottleneck since the input to #3 will be small and #1 can be easily parallelized.

John Langford

April 24, 2009 at 1:43 am

An efficient parallelization of step 1 might be harder than imagined. Imagine for example that you wanted to use all 5-grams as features. To compute statistics, you need to have distinct nodes for subsets of the features, implying quite a bit of communication if you store things in an example-based format.

cyril

May 1, 2009 at 1:35 am

Interesting experiments.

From a practical point of view, only corr really makes sense as it is the only one taking into account the actual output. Traditional methods like information gain should work at least as well, one would hope.

I don’t understand this talk about computational footprint. Measures like corr/IG only require one pass over the data (or even a sample), so surely for most learning algorithms the cost is marginal compared to the actual parameter estimation?

Btw for your step 1, in a typical filesystem, the file reading would probably be the bottleneck and impose limits on the level of parallelism.

I’m also wondering about the rational for the sqrt(D) in the “crust” method. For text processing, D is likely to grow regularly with the sample size due to Zipf’s law, so there is a point at which you’re probably starting to discard important frequent features.

Thanks for a thought-provoking note!

- slycoder
  
  May 5, 2009 at 9:17 pm
  
  You’re right that step 1 is likely going to be a bottleneck; this is exactly what John was intimating in an earlier comment I think. When you have a very large feature space, you don’t have enough memory to compute the sufficient stats over all of them at once so you end up having to make lots of passes over the data one way or the other.
  
  There’s no rational for the $\sqrt{D}$ thing; I admit that it’s a total hack but it seems ok at smaller scales (say ~2k vocabulary items).