# Monthly Archives: November 2008

## Predictions Using Mixture Components

Suppose that I have the following model: $z_i \sim \mathrm{B}(\theta) \;\mathrm{ for }\; i=1\ldots N,$ $\lambda \sim \mathrm{Unif}(1\ldots N),$ and $p(y = 1 | \vec{z}, \lambda) = \pi_1 z_\lambda + \pi_0 (1 - z_\lambda).$ $z, \lambda$ are hidden and $y$ is observed.  This model is perhaps not as facile as it seems; the $z$‘s might be mixture components in a mixture model, i.e., you can imagine hanging other observations off of these variables.

Since $z, \lambda$ are hidden, we usually resort to some sort of approximate inference mechanism.  For the sake of argument, assume that we’ve managed to a.) estimate the parameters of the model ($\theta, \pi_1, \pi_0$) on some training set and b.) (approximately) compute the posteriors $p(\lambda)$ and $p(z_i)$ for some test document using the other parts of the model which I’m purposefully glossing over.   The question is, how do we make a prediction about $y$ on this test document using this information?

Seems like an easy enough question; and for exposition’s sake I’m going to make this even easier.  Assume that $p(\lambda) = \frac{1}{N}$.   Now, to do the prediction we marginalize out the hidden variables $p(y) = \sum_{\vec{z}} \sum_\lambda p(y | \vec{z}, \lambda) p(\vec{z}) p(\lambda) = \mathbb{E}_{p(\vec{z})}[\sum_\lambda p (y | \vec{z}, \lambda) \frac{1}{N}].$  By linearity, we can rewrite this as $\pi_1 \mathbb{E}[\bar{z}] + \pi_0 (1 - \mathbb{E}[\bar{z}]),$ where $\bar{z} = \frac{1}{N} \sum_i z_i.$  And if, as usual, I wanted the log probabilities, those would be easy too, $\log (\pi_1\mathbb{E}[\bar{z}] + \pi_0 ( 1 - \mathbb{E}[\bar{z}])).$

But suppose I were really nutter-butters and I decided to compute the log predictive likelihood another way, say by computing $\mathbb{E}_\lambda[\mathbb{E}_{\vec{z}}[\log p(y | \vec{z}, \lambda)]]$ instead of $\log \mathbb{E}_{\vec{z}}[\mathbb{E}_\lambda [p(y | \vec{z}, \lambda)]].$  It is worth pointing out that Jensen’s inequality tells us that the first term is a bound on the latter term.   Because the $\vec{z}$ is a set of indicators, the log probability has a convenient form, namely $\log p(y = 1 | \vec{z}, \lambda) = \log (\pi_1) z_\lambda + \log(\pi_0) (1 - z_\lambda).$  Applying the expectations yields $\log (\pi_1) \mathbb{E}[\bar{z}] + \log(\pi_0) ( 1-\mathbb{E}[\bar{z}]).$

Jensen’s inequality is basically a statement about a first order Taylor approximation, i.e., $\mathbb{E}[f(X)] \approx f(\mathbb{E}[X]),$ when $f$ is convex.    When is this approximation good?  We might attempt this by looking at the second order term.  I will save you the arduous derivation: $\log \mathbb{E}[p(y = 1 | \vec{z}, \lambda)] - \mathbb{E}[\log p(y = 1 | \vec{z}, \lambda)] \approx \frac{\pi_1 - \pi_0}{2 \pi_1 \mathbb{E}[\bar{z}] + 2 \pi_0 (1 - \mathbb{E}[\bar{z}])} \mathbb{E}[\bar{z}] (1 - \mathbb{E}[\bar{z}]).$

Now what better accompaniment to an equation than a plot.  I’ve plotted the 2nd order approximation to the difference (solid lines) and the true difference (dashed lines) for different values of $\pi_0$ while keeping $\pi_1$ fixed at $\sigma(4)$ where $\sigma$ is the sigmoid function.   The errors are plotted as functions of $\mathbb{E}[\bar{z}].$  I’ve also put a purple line at $\log(0.5);$ that’s the actual log probability you’d see if your predictive model just guessed $y=1$ with probability 0.5.

First thing off the bat: the error is pretty large.  For fairly reasonable settings of the parameters, the difference between the two predictive methods actually exceeds the error one would get just by guessing 50/50.  For small values of $\mathbb{E}[\bar{z}],$ the 2nd order approximation is pretty good, but for large values it greatly underestimates the error.  Should I be worried (because, for example, moving the log around the expectation is a key step in EM, variational methods, etc.)?

I was motivated to do all this because of a previous post, in which I claimed that a certain approximation of a similar nature was good; the reason there was the low covariance or the random variable.  Here, the covariance is large when the expectation is not close to either 0 or 1.  Thus the second (and higher) order terms cannot be ignored.  There are really two questions I want answered now:

1. What happens if the random variable in this model is $\bar{z}$ rather than $z$?
2. What happens if the random variable in the earlier post is $z$ instead of $\bar{z}$?

Filed under Uncategorized

## {Hoeffd|Is}ing

So in reading group today Chernoff bounds came up.  The bound basically says that if we sample random variables and sum them we are very likely to be close to the expected sum.  Estimating partition functions is a central challenge in many machine learning applications, so couldn’t we just sample and estimate the sum and have the bound guarantee that we’re not far off?

To wit, let $Y_1, Y_2, \ldots, Y_n$ be independent random variables bounded between 0 and $B$.  Let $S = \sum Y_i$.   Then $P(S - \mathbb{E}[S] \ge n\epsilon) \le \exp(-\frac{2n\epsilon^2}{B}).$

Now let us adapt this to estimating the partition function, $Z.$  Let $Y_i \equiv \mu(X_i),$ where $\mu$ is some measure bounded above by $B.$ Note that $Y_i$ is a random variable that satisfies the aforementioned conditions.  Also note that $Z = \sum \mu(X_i) = \sum Y_i$ and that $\mathbb{E}[\mu(X_i)] = \frac{Z}{|\mathscr{X}|}.$  This means that we can write the condition in the inequality as $\frac{\hat{Z} - Z}{|\mathscr{X}|} \ge \epsilon,$ where $\hat{Z}$ is the empirically estimated expected value of $\mu$.  This seems pretty good doesn’t it?

So let’s see what happens when we do things empirically.  I randomly generated Ising models by sampling canonical parameters (using the standard overspecified representation) from a N(0, 1) distribution for both individual and pair parameters.  In total, there were 40 nodes.  I estimate the partition function (or rather, the partition function divided by 2^40) using the above technique and plot the estimate as a function of iteration over 300 iterations.  I restarted the estimation 5 times to see if the different runs would converge to similar values.   Attached are said plots for 2 different models (10 runs of 300 iterations total).

Monte carlo estimates of an Ising model partition function (1/2)

Monte carlo estimates of an Ising model partition function (2/2)

It seems like this estimate does horrendously.  You might as well use a random number generator =).  The key problem is that there are sudden jumps in our estimates.  When there’s enough coupling in the model, the modes become very peaky.   Another way of looking at it is by examining the bound on the right.  Recall that this term depends on the bound on the measure, $B$.  Unfortunately, this bound can be quite large; for the parameters I’ve chosen, it is something in the neighborhood of $\exp(\frac{40 * 41}{2 * 2} | N(0, 1) | )$ and expands exponentially in the number of nodes in the graph.

So there we have it, Hoeffding’s inequality can only be applied when there isn’t much coupling and the bound on the measure is small.  As the bound increases, Hoeffding’s inequality gives us exponentially worse guarantees; unfortunately, the strongly coupled case is precisely the interesting case!

1 Comment

Filed under Uncategorized

## Another Wrinkle…

So one more wrinkle to add to the pile.  I was wondering why a “better” method should fare worse.  All the explorations below have only confirmed that.  Well, there’s one other way (other than the E-step and the M-step) in which the two differ: prediction.  The link predictions routines are different and link prediction is where the real performance differences are as well.   In the case of $\psi_e$, we treat the model as a mixed-membership model.  This means that no additional approximations are needed to compute the quantity of interest $\mathbb{E}[p(y_{ij} | z_i, z_j)]$.  That is to say, rather than calculate $\mathbb{E}[\log p(y_{ij} | z_i, z_j)]$ as one does for the ELBO during inference and then exponentiating, instead we calculate the desired marginal directly (which is easy since we treat the covariates as indicators).

A different approach is used for $\psi_\sigma$.  There, we compute the expected log likelihood, as in the ELBO, and then exponentiate.  We can do this by applying a first-order approximation; this basically linearizes this term and allows us to move the expectation freely around.  How much is lost by this?

Instead of answering this question directly, I ask another question; how much is gained by doing the right thing on $\psi_e$.  I rewrote that computation to better mirror what we were doing in the $\psi_\sigma$ case.   Answer: $e > \sigma > e'$ where $e'$ is my short-hand for $\psi_e$ with the incorrect prediction scheme.   So we see that all of the gains that you get by going with $\psi_e$ evaporate when we change how prediction is done.  This indicates that maybe the real culprit is how we predict using $\psi_\sigma$

1 Comment

Filed under Uncategorized

In the previous post, I showed how to compute the approximate objective function for regularization.  I implemented such a harness in R with some real data from one of my runs to prototype it.  Directly using nlm seems to work in fits and starts, perhaps because the optimization routine does not use gradients.  The next step then is to compute the gradients.   The gradient of the zero’th order term is $\sigma(\mu) (\vec{p}^+ \circ \vec{p}^+),$ where $\mu = \eta^T \vec{p} \circ \vec{p} + \nu$ and $\vec{p}^+ = \langle \vec{p}, 1\rangle$.

The second order term is where it gets hard.  Taking the derivative by parts, we get $\frac{1}{2} A''(\mu) \nabla \mathrm{Var}(x) + \frac{1}{2}\mathrm{Var}(x)A'''(\mu)(\vec{p}^+ \circ \vec{p}^+)$. To compute $\nabla \mathrm{Var}(x)$, we use the derivation in the previous post to show that this is equal to $2 \eta \circ (V - \mathrm{Diag}(C)) + C \eta + \eta \circ \mathrm{Diag}(C)$.

I implement all this in R and it seems to converge erratically.  Further more, the “converged” values perform worse than the not fitting at all.   So to summarize results:

• Both $\psi_e$ and $\psi_\sigma$ perform best without any M-Step fitting.
• $\psi_\sigma > \psi_e$ when the first is unfit but the second is fit.
• $\psi_e > \psi_\sigma$ when both are fit or both are unfit.

This is all very peculiar.  In order to ameliorate the fitting process, I collapsed the parameter space so that $\vec{\eta} = \langle \eta, \ldots, \eta\rangle$ so that there are two parameters.   This seems to converge ok, and give results comparable to without fitting for $\psi_\sigma$.  When I do the similar parameter collapsing in $\psi_e$ it performs worse.

In summary, it seems like across the board not fitting is still the best.

Filed under Uncategorized

## Regularized Logistic Regression

Currently there are two regularization penalties, and this is sort of a hack.  Ideally, we’d want to stick with one that is consistent across methods.  This involves simulating additional non-links.   In other words, we want to add $\log p(y = 0 | \psi_1) + \log p(y = 0 | \psi_2) + \ldots + \log p(y = 0 | \psi_n)$ to the likelihood we wish to optimize, where $\psi$ is drawn from a distribution of our choosing.  Note that this becomes equivalent to $\mathbb{E}[\log p(y = 0 | \psi)] = -\mathbb{E}[A(\eta^T \psi + nu)]$, where $A(x) = \log(1 + \exp(x))$.   This is intractable to do exactly so we use the same old Taylor trick.

This is very important: first-order is NOT enough.  To see why this is, consider that what a first order approximation really says is that you can replace a set of points with their mean.  Well, if you did this in logistic regression for all the 0’s and all the 1’s you’d get a single point for the 0’s and a single point for the 1’s, i.e., complete separability.  And therefore your estimates will diverge.

So what are the ingredients to a second-order approximation?  We first need to compute the gradients of the partition function: $A'(x) = \sigma(x)$, $A''(x) = \sigma(x) \sigma(-x)$, $A'''(x) = \sigma(x) \sigma(-x) (1 - 2\sigma(x))$.  The other thing we need is the variance of $x = \eta^T \psi + \nu$.  Note that since variance is shift invariant, we can just compute the variance of $x = \eta^T \psi$.

We can expand this by $\mathrm{Var}(x) = \sum_i \sum_j \eta_i \eta_j \mathrm{Cov}(\psi_i, \psi_j)$.  Typically, covariance has two forms, one for when $i=j$ and when for when $i \ne j$.  For convenience, we will denote the first $V_i$ and the latter $C_{ij}$.  Then this can be rewritten as $\sum_i \eta_i^2 (V_i - C_{ii}) + \sum_i \sum_j \eta_i \eta_j C_{ij}$.

At this point we need to make about the distribution of $\psi$.  We will assume that $\psi = z_1 \circ z_2$ where $z \sim \mathrm{Dir}(\alpha \vec{p})$.  $C_{ij} = \mathbb{E}[z_{i} z_{j}]^2- p_i^2p_j^2$.  $\mathbb{E}[z_i z_j ] = \mathrm{Cov}(z_i, z_j) + p_i p_j = -p_i p_j \frac{1}{\alpha + 1} + p_i p_j = \frac{\alpha}{\alpha + 1} p_i p_j.$  Consequently, $C_{ij} = -\frac{2\alpha + 1}{(\alpha+1)^2} (p_ip_j)^2$.

On to the next term: $V_i = \mathbb{E}[z_i^2]^2 - p_{i}^4$.  Using common properties of the Dirichlet, $\mathbb{E}[z_{i}^2] = \mathrm{Var}(z_i) + p_{i}^2 = \frac{p_i (1 - p_i)}{\alpha + 1} + p_{i}^2 = \frac{p_i (1 + \alpha p_i)}{\alpha + 1}$. This yields $V_i = p_i^2 \frac{1 + 2 \alpha p_i }{(\alpha + 1)^2} + C_{ii}$.

Finally, notice that $\sum_i \sum_j \eta_i \eta_j C_{ij} = -\frac{2\alpha + 1}{(\alpha + 1)^2} (\eta^T (\vec{p} \circ \vec{p}))^2$.

Filed under Uncategorized

## What’s the deal with the logistic response?

Below I asked the question of why an approximation which is more accurate seems to be performing worse than one which is more haphazard.  The two methods differ in two principal ways:

• The approximate gradient computed during the E-step.
• The optimization of the parameters in the M-step.

In the E-step they differ mainly in how they push a node’s topic distribution toward its neighbors’.  With the simple but incorrect method, the gradient is proportional to $\eta^t z'$ while the correct method gives $\eta^t z'(1 - \sigma(\eta^t z \circ z' + \nu))$.  Intuitively, the last method tells us that the closer we are to the right answer, the less we need to push.

I futzed with the code to make it so that we push more like the incorrect method, but this didn’t seem to affect the results.  I suspect that $\eta^t z \circ z' + \nu$ is always pretty small so this doesn’t have much of an impact.  Then I tried changing the M-Step.  In particular, I tried removing the M-Step fitting with $\psi_\sigma$.  It turns out that this “fit” performs about as well as $\psi_e$.  Examining the fits shows that $eta$ under $\psi_\sigma$ is consistently smaller than for $\psi_e$.  Why is this?  I hypothesize that the L2 regularization penalty is causing this to fail.  Having to determine the optimal penalty is a pain; I originally added this term because the fits were diverging.  Perhaps the right thing to do is figure out why they were diverging in the first place.

Filed under Uncategorized

## Reproducing experiments and rediscovering alpha

It is a good idea to always make sure to make experiments exactly reproducible.  Sometimes in the heat of a paper deadline, a bunch of experiments are run with different parameter settings and what not and a few months down the road one might be hard pressed to remember exactly how you got those numbers.  I’ve learned this the hard way so for the last few submissions I’ve always tried to make sure something in the CVS encodes the precise settings necessary.

Last night, while looking up what I did, I rediscovered that alpha is important sometimes.  Not so much because of scale, but because of the distribution over topics it reports.  My experience is that fitting alpha in an EM context rarely works; it either totally messes up your inference or it just reproduces the alpha that you fed in initially.  But I did get into the habit of having the model fit and spit out alpha after the very last iteration of EM, just get some idea of what the topic distributions look like.

Usually you get something which is pretty uniform over your K topics.  But sometimes you don’t.  In old-fashioned LDA, this usually happens when you have too many topics.  With models such as RTM, you can also get this effect because your graph manifold may have cliques of different sizes.  In these cases, when you’re computing predictive word likelihood of an empty document, you are liable to perform much worse than baseline if you assume a symmetric Dirichlet prior rather than one which reflects the topic distribution on your training set.