# Monthly Archives: April 2009

## Ignoring outliers during boosting

I had a discussion the other day about using the weights returned by boosting to do outlier detection. That is to say, the examples with high weight are hard to fit and are therefore likely to be mislabeled/outliers. So why not find some way of automatically ignoring these instances?

I know that BBM already magically does something along these lines, but I was wondering if there were some simple fudges for AdaBoost which would also do this. So I decided to run a quick experiment to see what would work. To set up the data, I randomly drew 100 points from $\mathbb{R}^2$ and let these be the instances. The binary labels for the instances were $+1$ iff the point fell within any of $K$ circles; that is, $y_j = +1$ if and only if $|x_j - c_i|_2^2 < r_i^2$ for any $i=1\ldots K$. Just to confound the problem a little bit I then randomly flipped the labels on 10% of the points. The first plate of each of the plots below shows this input data set with the flipped (read: noisy/mislabeled) instances in red and the $K=2$ circles in blue.

For the weak hypotheses I chose the set of circles (of arbitrary radius) which predict either +1 or -1 in their interior and abstain (i.e., predict 0) on their exterior. Since the true labels can be represented as a linear combination of weak hypotheses in principle it should be easy for boosting to recover drive the training down to 0 were it not for the noisy points. Typically AdaBoost will spend a lot of effort trying to fix up these points with little benefit with respect to generalization error.

Below I describe a couple of methods for finding and ignoring these points during the boosting process. For each I include a plot of the set of weak hypotheses after each iteration (blue circles), along with the predictions made by the combined hypothesis on each of the training instances (purple = +, green = -). Below each plot I also indicate the error on the entire training set (first number) along with the error on the non-noisy, correctly labeled points (second number) since we presumably don’t want to match the noisy points too closely.

• Stock AdaBoost – Nothing special here. This is just plain-jane AdaBoost. It does a pretty good job of driving down the training error (0.11 after 11 iterations). But the hypothesis found by AdaBoost is a rather complex combination of many circles that does not bear that much resemblance to the generating hypothesis. Also note that when the noise is removed, the error at the end is 0.07. Not so good when considering that this number could be zero after two or three iterations.
• AdaBoost /w thresholding – Here I again run stock AdaBoost except that the distribution shown to the weak learner ignores all points whose weight exceeds some threshold. Precisely, let $D$ denote the distribution over training examples which AdaBoost maintains internally. Then the distribution the weak learner sees is $D^*_i \propto \mathbf{1}(D_i \le \theta)D_i$. Here, the combined hypothesis is much simpler consisting essentially of just four circles. The predictions made by this technique really aren’t any worse than regular AdaBoost (0.12 vs 0.11 on the full training set, 0.07 in both cases on the de-noised set). So it seems like doing this is not a completely terrible idea; you make about the same predictions but with much simpler rules.
• AdaBoost /w trimming – Here, once again, we run AdaBoost except this time the distribution shown to the weak learner excludes the 10% of points with the highest mass. That is, let $R_i = \sum_{j=1}^M \mathbf{1}(D_i > D_j)$ be the rank of a training example, where $M$ denotes the total number of training examples. Then the distribution shown to the weak learner is $D^*_i \propto \mathbf{1}(R_i \le \theta)D_i$ (with some fudging I won’t get into here to break ties). The rules found by this method are unfortunately not that uncomplicated. And while some of the circles are vaguely in the spot they should be, in large part they’re all over the place. But what’s interesting here is that while the error on the training set is about the same as with the other techniques (0.12), the error on just the non-noisy ones is (0.03). So in fact, this method does the best job, in some sense, of ignoring the points which are noisy.

Practically speaking, I think I’d go with AdaBoost /w trimming as an easy approach to getting rid of label noise. Has anyone out there tried this before?

Filed under Uncategorized

## Because acyclic graphs are boring…

… here’s a link to John Langford’s fantastic blog.

Filed under Uncategorized

## Posterior inference for recessionistas

One of the things people learn during downturns is that it’s actually possible to look ab fab without spending a whole lot of money. Let’s try to apply the same enterprising spirit to posterior inference on mixture models.

Today we’ll consider a simple mixture of two Gaussians, with means set at $\mu = \langle +1, -1 \rangle$ and variance $\sigma^2 = 1$. Let’s also set the mixture components to be $\langle \theta, 1 - \theta \rangle, \theta=0.7$. This gives the following as the density of our distribution: $\theta \mathcal{N}(+1, 1) + (1 - \theta) \mathcal{N}(-1, 1)$. I’ve conveniently plotted this distribution for you here.

Now most problems in life come down to performing posterior inference, i.e., determining the distribution over the aforementioned parameters given some observations $x$ drawn from the density. In this case, let’s focus on computing the posterior density of $\theta$: $p(\theta | x, \mu, \sigma)$. We can apply Bayes’ rule to rewrite this as $p(\theta | x, \mu, \sigma) \propto p(x | \theta, \mu, \sigma)$, which is not too hard to compute. The problem is hidden in that proportionality — we don’t know what the proportionality constant should be.

So how do we get around this? Many people like MCMC, but around these parts, we have a lot of folks who use approximate variational inference. The idea behind approximate variational inference is to find a distribution which is “close” to the true posterior. Now I don’t want to get bogged down by details (search Google for “jordan 1999 variational inference” if you really want to know them), but one way of doing this approximates the desired posterior distribution by a Dirichlet distribution, $p(\theta | x, \mu, \sigma) \approx q(\theta | \gamma)$, where $\gamma$ are some Dirichlet parameters you learn as part of the variational inference procedure.

Variational inference doesn’t really guarantee all that much about the approximation you find. The usual folk wisdom is that variational inference does an ok job of finding some of the marginal modes, but underestimates the variance (to be fair, you could say the same thing about MCMC). Now I want to see how good my inference procedure turned out to be. How might I go about that? Well, suppose I have a reference point, say $\theta^*$. I can try to guesstimate that unknown proportionality constant $Z = p(x | \theta, \mu, \sigma) / p(\theta | x, \mu, \sigma) \approx p(x | \theta^*, \mu, \sigma) / q(\theta^*)$. Note that for this estimate of $Z$, there’s no guarantee that $p(x | \theta, \mu, \sigma) / Z$ integrates to one, but if $q$ happens to be equal to the true posterior, then $p(x | \theta, \mu, \sigma) / Z$ should exactly line up with $q(\theta)$. For convenience, I’m going to call this approximation to the posterior the scaled true posterior.

The big plot of all the approximations

I’ve plotted what this approximation looks like for different numbers of draws from the density (i.e., the posterior given $N$ observations). The variational approximation is given in black, and the scaled true posterior is given in purple. The green horizontal line shows the actual ratio of draws from the first mixture component for the given observations. The blue line shows the mode of the scaled true posterior (and therefore also the true posterior), which can be found with some straightforward numerical optimization. The reference point $\theta^*$ I use to estimate the proportionality parameter is the mode of the variational approximation.

Two quick observations. One, the modes (of the black and purple lines) tend to line up pretty well, except when the number of observations is really small. Two, the scaled true posterior tends to be a little bit fatter than the variational approximation which means that the variational approximation is indeed underestimating the variance. Now when faced with the discrepancy between the black and purple lines, a question lingers in my mind: how much of the difference is because of the fitting procedure (i.e., the objective we optimize and how we optimize it), and how much is due to the fact that
we’re choosing to represent the posterior with a Dirichlet distribution rather than some other distribution?

Well, let’s answer this question by trying to fit the Dirichlet distribution some other way instead. Our Dirichlet distribution has two free parameters, so we just need to find two constraints to fit them. Here’s what I decided to go with:

• Choose two points, say $\theta^{(1)}, \theta^{(2)}$. Then we can compute the ratio of the posterior probabilities for these two points exactly since the normalization cancels out: $R = p(\theta^{(1)} | x) / p(\theta^{(2)} | x) = p(x | \theta^{(1)}) / p(x | \theta^{(2)})$. It seems reasonable that the Dirichlet distribution should preserve the ratio between these two distinguished points, so we’ll make the first constraint $q(\theta^{(1)}) / q(\theta^{(2)}) = R$.
• It also seems reasonable to ensure that the mode of the approximation has the same mode as the true posterior: $\mathrm{arg}\max_\theta q(\theta) = \mathrm{arg}\max_\theta p(\theta | x)$.

It turns out that these constraints make for a simple system of linear equations. Naturally, one can also add more ratio constraints in the more general case. I found it helpful to choose $\theta^{(1)}$ to be the mode of the true posterior and $\theta^{(2)} = 0.95 \theta^{(1)}$. I performed this “inference” procedure; the result is plotted as the cyan line in the above plots. Once again, we can use this approximation to create a scaled true posterior; this scaled true posterior is plotted as the red line.

As you can see, the red line and cyan lines line up almost perfectly. Thus we’ve got a fantastic estimate of the posterior distribution. And all we had to do was to 1.) find the true mode, 2.) solve a system of linear equations. There are a lot of good tools to do this, so implementing this is no sweat. And because you can rely on really fast, efficient methods that people have put a lot of work into, it turns out that this is pretty damn fast. So maybe you don’t need all the fancy machinery of variational inference or MCMC after all; “making do” never looked so fabulous.