Thursday, September 14, 2017

Learning vs Inference

In machine learning, the terms "learning" and "inference" are used often and it's not always clear what is meant. For example, is "variational inference" and neural network "inferencing" the same thing? Usually not!

When the deep learning crowd says "inference", what they mean is "perform a prediction." When the bayesian crowd says "inference", what they really mean is "compute the posterior distribution of a latent variable model"; e.g., for a model p(x, z) = p(x | z)p(z) and data x, compute p(z | x).

When someone says learning, they mean their model has a set of parameters and they need to find a good setting of those parameters; e.g. their model is actually p(x, z) = p(x | z; theta)p(z) and they want to find theta. For frequentists, this means finding a point estimate of the parameters. For a bayesian, learning is inference because the parameters are treated as latent, random variables, so they can use normal variational inference or MCMC to compute a distribution over the parameters. Note that for 99% of deep learning practitioners, training is synonymous with learning as the practitioner will likely only be computing a single set of values (aka point estimate) for their weights.

Sunday, September 10, 2017

Why is the ELBO difficult to optimize?

The task of bayesian inference is to compute the posterior p(z | x) of the model p(x, z) = p(x|z)p(z) where z is a latent variable. This is often intractable, so a bayesian may resort to approximating it with some easier distribution q(z) — this method is called variational inference . Usually the approximating distribution is chosen from a parameterized family of distributions. Thus q(z) is really the variational distribution q(z; lambda) for some chosen value lambda. How is this parameter lambda chosen?

A common method is to minimize the KL-divergence between the variational distribution and the true posterior, KL[q(z; lambda) || p(z | x)]. Of course, the posterior isn’t actually known, but if we break down this KL divergence, we see the following:

KL[q(z) || p(z | x)] = E_q(z)[log(q(z)) - log(p(z|x))]
= E_q(z)[log(q(z)) - log(p(x, z)/p(x))]
= E_q(z)[log(q(z)) - log(p(x, z)) + log(p(x))]
= E_q(z)[log(q(z)) - log(p(x, z))] + log(p(x))
= E_q(z)[log(q(z)) - log(p(x|z)) - log(p(z))] + log(p(x))

The KL divergence can be minimized up to a constant (the marginal likelihood) wrt the variational parameters since we know how to compute p(x, z) per the model’s definition. This term that we want to minimize — E_q(z)[log(q(z)) - log(p(x|z)) - log(p(z))] — is the negative evidence lower bound (ELBO). So, back to the original title question — why is the ELBO difficult (but not impossible) to optimize?

It actually isn’t difficult to optimize for certain classes of models. If all the latent variables are chosen to be independent in the variational distribution (i.e. the mean-field assumption) and the model is conditionally conjugate, then one can use coordinate ascent variational inference.  This method is fast to compute. Unfortunately, the restrictions it places on the model are also quite limiting. It is when the model is unconstrained is optimizing the ELBO difficult. But again, why?

If nothing is known about the model p(x|y)p(y) except a method to evaluate it, we are relegated to optimizing the ELBO by gradient descent. This entails computing the gradient of the ELBO (excuse my abuse of notation):

Gradient of E_q(z)[log(q(z)) - log(p(x|z)) - log(p(z))]
= gradient of (integral of q(z)*(log(q(z)) - log(p(x|z)) - log(p(z)))dz)
= gradient of (integral of q(z)*log(q(z))*dz) - (integral of q(z)*(log(p(x|z)) - log(p(z))*dz))
= integral of (gradient of q(z)*log(q(z))*dz)) - (integral of (gradient of q(z))*(log(p(x|z)) - log(p(z))*dz))

where the I've shortened q(z; lambda) to q(z) and the gradient is wrt lambda.

When I first saw this equation, I wasn’t sure what the big deal was. This is because I wasn’t accustom to staring at math equations all day, but to any (applied) mathematician it should be obvious. I’ll put in bold font: computing the gradient of the ELBO *requires computing some likely high-dimensional, likely intractable integral.* Why is it intractable? Well, you may get lucky and the integral has an analytic solution, but in general that won’t be true. Also, because  this quantity no longer takes the form of an expectation, it can’t easily be estimated by Monte Carlo. It may be possible to do so if z is discrete and its sample space is small (as you could exhaustively evaluate the integrand at all z), but that implies z is very low dimensional which may not be true either.

Luckily there exist some methods to side-step these issues by placing mild(ish)  constraints on either the variational family of distributions or both the variational family and the model. These methods are the score function gradient and reparameterization gradient, but I won’t be discussing them in this post — they’re best left to be explained another day (or by Google)..

Thursday, September 7, 2017

Expectation Maximization vs Variational Bayes

I constantly find myself forgetting the details of the EM algorithm, variational bayes, and what exactly the difference is between the two. To avoid confusion in the future, I wrote the following note.

Q: What is the difference between EM and Variational Bayes (aka "Variational Bayesian EM")?
A: Both are iterative methods for learning about the parameters theta and the posterior p(z | x, theta) in a latent variable model p(x, z | theta) by trying to a maximize a lower bound to the marginal likelihood p(x) (or p(x | theta) when using EM). The difference between the two techniques lies in following:

  1. VB assumes that, even when given a setting of the parameters theta, it is not tractable to compute the posterior p(z | x, theta). Thus, it must be approximated by some variational distribution q(z). On the other hand, EM makes the opposite assumption - given a setting of the parameters theta, it will compute the exact posterior p(z | x, theta). In this sense, VB is a relaxation of EM.
  2. VB is a bayesian technique while EM is a frequentist technique. As such, VB treats theta as a latent, random variable and computes a distribution q(theta) that approximates the posterior distribution p(theta | x). EM, a maximum likelihood technique, computes a point estimate to the parameters theta instead of a full distribution.
With these differences in mind, we can describe the two algorithms using a similar framework.

EM’s objective is to maximize a lower bound to p(x | theta),  E_q(z)[log(p(x, z | theta))].
VB’s objective is to maximize a lower bound to p(x), E_q(z, theta)[log(p(x, z, theta))].

EM first maximizes the objective by fixing theta and then finding an optimal setting of q(z). This optimal setting is always calculated to be the posterior p(z | x, theta).
VM first maximizes the objective by fixing q(theta) and finding a better setting of q(z)  (often by some gradient method).

EM maximizes the objective by fixing q(z) to the posterior from the previous E-step, and then calculates a maximum likelihood point estimate of the parameters theta.
VM maximizes the objective by fixing q(z) and then finding a better setting of q(theta) (again, often by some gradient method).

The E and M step are then repeated until convergence (i.e. until the point at which the log marginal likelihood no longer improves).

One last point: there is another algorithm called "Variational EM". This algorithm is just the E-step of VB combined with the M-step of EM. Since parameters are not treated as random variables, it is not a bayesian technique.