Not too long ago John Langford stopped on by and gave a fascinating talk. There were a lot of take-aways from the talk but here’s one that really got my noodle going: A lot of times we get really high-dimensional data (say text) and we want to use it to make some prediction. The high-dimensionality makes the problem intractably difficult, in part because of the curse, and in part because high-dimensional feature spaces lead to high-dimensional parameter spaces which is computationally (and space) taxing.
So what do we do? Dimensionality reduction or feature selection. A lot of effort has gone into doing these well (some popular ones these days are LDA and L1 regularization). JL came up with another way of doing dimensionality reduction: he ignores hash conflicts. At first blush this seems silly: you’ll get a form of dimensionality reduction wherein different features will be additively combined in some arbitrary way. But his results show that this actually works quite well. I’m not entirely sure why, but that got me thinking: are there simple ways of reducing feature spaces that make for very good !/$ propositions? I decided to find out by running some experiments on a scale/domain I’m a little more familiar with.
Here’s the experimental setup: I took the Cora data set (thanks to LINQS.UMD for making a munged version of this data available) which is a database of scientific papers each one with a subject categorization. In order to keep things simple, my goal will be to predict whether or not a paper is labeled as “Theory.” I’ll try to make these predictions using binary word features (i.e., the presence or absence of a word in a paper’s abstract). I’ll take these features and put them through a logistic regression model with a little bit of an L2 penalty for good measure (n.b. I also tried an SVM; the results are about the same but a little worse). My data’s been cleaned up but there are still around 1.5k features and only 2.7k data points. This is less than ideal because 1.) the dimensionality to data ratio is not too good, and 2.) storing that matrix is still pretty taxing for my poor laptop.
What I want to do is reduce the number of features to a fixed size (in the plots below, I do the experiments using
). I’m going to try five methods:
- random – Uniformly randomly select
features from the feature space.
- corr – Select the
features which have the highest individual correlation with the dependent variable.
- cream – Select the
most frequently occurring features (the cream of the crop; they rise to the top).
- crust – Like “cream,” except skip the first
features, where
is the dimensionality of the feature space. These are the”crust” features, i.e., those that are too frequent or not frequent enough to carry much information.
- merge – Randomly additively combine features until there are just
features left.
Here’s the pretty dotchart (the red line is how well you’d do using random guessing) of the 0-1 error estimated using 10 bootstrap replications:
Predictive performance using different feature selection mechanisms.
A few things that surprised me:
- You don’t really need that many features. Using 50 features doesn’t really do substantially worse than 200 if you use corr.
- Corr was just something I threw together on the fly; I really didn’t expect it to work so well and I haven’t actually used it in practice. I guess I should start.
- Cream does better than crust. Crust was motivated by the fact that when you put data into topic models like LDA, you get better looking topics when you remove overly-frequent words (since they tend to trickle to the top of every topic). I guess good looking topics don’t necessarily make for good predictions. Food for thought.
- Random does poorly as one might expect.
- Merge is just about as bad as random. I don’t know how to reconcile this with what I said at the outset except to that 1.) merge is not necessarily combining features the exact same way a hash would 2.) different features 3.) different number of features 4.) different amounts of data 5.) different data sets 6.) a different classifier.
Ok, so the data didn’t show me what I was hoping and half expecting to see. Chalk it up to different strokes for different domains. But I did find something that, while obvious in retrospect, will probably be useful going forward.