The EM Algorithm

The expectation-maximization algorithm is an iterative approach to find a (local) maximum of the likelihood that can be computationally much more tractable than directly maximizing the likelihood.

It is particularly useful in problems that have missing data or latent (unobserved) variables.

It is in a general class of optimization algorithms that alternate between two types of operations, constructing or estimating something, and maximizing or minimizing something.

We observe data \((X, Y) \sim {\mathbb P}_\theta\), where \(\{{\mathbb P}_\theta : \theta \in \Theta\}\) is a known parametric family of probability distributions, but the particular value of \(\theta\) is unknown.

We want to estimate \(\theta\) using maximum likelihood, that is, to find

\[\begin{equation*} \hat{\theta} \in \arg \max_{\eta \in \Theta} {\mathbb P}_\eta (X,Y). \end{equation*}\]

It’s common to work with the log likelihood; since the logarithm is a monotonic function, the value of \(\eta\) that maximizes the log likelihood is the same as the value that maximizes the likelihood:

\[\begin{equation*} \hat{\theta} \in \arg \max_{\eta \in \Theta} \ln {\mathbb P}_\eta (X,Y) = \arg \max_{\eta \in \Theta} L(\eta). \end{equation*}\]

The EM Algorithm TEST

Initialize starting guess \(\hat{\theta}_0\). Repeat until convergence (which is not guaranteed) or boredom:

  • E-step: Find \(Q(\eta, \hat{\theta}_i) \equiv {\mathbb E}_{\hat{\theta}_i} \ln {\mathbb P}_{\eta} ((X, Y) | X=x)\)

  • M-step: Set \(\hat{\theta}_{i+1} \in \arg\max_{\eta \in \Theta} Q(\eta, \hat{\theta}_i)\)

It’s a theorem that \(\ln {\mathbb P}_{\hat{\theta}_{i+1}}(x) \ge \ln {\mathbb P}_{\hat{\theta}_i}(x)\).

However, the algorithm need not converge to a maximum of the likelihood. It does behave well for exponential families.

Why does it work?

Recall Jensen’s Inequality: if \(f\) is a convex function, then \(\mathbb{E} f(X) \ge f(\mathbb{E}\mathbb{X})\).

Jensen’s inequality implies that \(Q(\eta, \hat{\theta}_i) \le \ln {\mathbb P}_{\hat{\theta}_i}(x, z)\).

Note that if, in fact, \(Y\) is deterministic, we are free to define \({\mathbb P}_{\eta} ((X, Z) | X=x)\) however we wish (as long as we give positive probability to every \(y\)): the conditional expectation with respect to that distribution will still lower bound the value at the true \(y\), by Jensen’s inequality.