## Thursday, September 7, 2017

### Expectation Maximization vs Variational Bayes

I constantly find myself forgetting the details of the EM algorithm, variational bayes, and what exactly the difference is between the two. To avoid confusion in the future, I wrote the following note.

Q: What is the difference between EM and Variational Bayes (aka "Variational Bayesian EM")?
A: Both are iterative methods for learning about the parameters theta and the posterior p(z | x, theta) in a latent variable model p(x, z | theta) by trying to a maximize a lower bound to the marginal likelihood p(x) (or p(x | theta) when using EM). The difference between the two techniques lies in following:

1. VB assumes that, even when given a setting of the parameters theta, it is not tractable to compute the posterior p(z | x, theta). Thus, it must be approximated by some variational distribution q(z). On the other hand, EM makes the opposite assumption - given a setting of the parameters theta, it will compute the exact posterior p(z | x, theta). In this sense, VB is a relaxation of EM.
2. VB is a bayesian technique while EM is a frequentist technique. As such, VB treats theta as a latent, random variable and computes a distribution q(theta) that approximates the posterior distribution p(theta | x). EM, a maximum likelihood technique, computes a point estimate to the parameters theta instead of a full distribution.
With these differences in mind, we can describe the two algorithms using a similar framework.

Objective:
EM’s objective is to maximize a lower bound to p(x | theta),  E_q(z)[log(p(x, z | theta))].
VB’s objective is to maximize a lower bound to p(x), E_q(z, theta)[log(p(x, z, theta))].

E-step:
EM first maximizes the objective by fixing theta and then finding an optimal setting of q(z). This optimal setting is always calculated to be the posterior p(z | x, theta).
VM first maximizes the objective by fixing q(theta) and finding a better setting of q(z)  (often by some gradient method).

M-step:
EM maximizes the objective by fixing q(z) to the posterior from the previous E-step, and then calculates a maximum likelihood point estimate of the parameters theta.
VM maximizes the objective by fixing q(z) and then finding a better setting of q(theta) (again, often by some gradient method).

The E and M step are then repeated until convergence (i.e. until the point at which the log marginal likelihood no longer improves).

One last point: there is another algorithm called "Variational EM". This algorithm is just the E-step of VB combined with the M-step of EM. Since parameters are not treated as random variables, it is not a bayesian technique.