This post explains Expectation-Maximization (EM) algorithm from scratch in a fairly concise fashion. The material is based on my own notes, which of course come from a variety of great resources online that are listed in the references section.
EM is one of the most elegant and widely used machine learning algorithms but is sometimes not thoroughly introduced in introductory machine learning courses. What is so elegant about EM is that, as we shall see, it originates from nothing but the most fundamental laws of probability.
Many variants of EM have been developed, and an important class of statistical machine learning methods called variational inference also has a strong connection to EM. The core ideas and derivatives of EM find many applications in both classical statistical machine learning and models that involve deep neural networks, making it worthwhile to have an intuitive and thorough understanding of it, which is what this post attempts to provide.
- Random variables , probability distribution
- Probability density function (PDF) , evaluated at value : with as a shorthand
- PDF with parameter is noted as or equivalently
- Expectation of according to distribution :
- A set is noted as or calligraphic letter
Supposed we had data coming from a distribution , and we want to come up with a model for parameterized by : or equivalent noted as to best approximate the real data distribution. Further assume all the data samples are independent and identically distributed (iid) with .
To find under a maximum likelihood scheme we do
Motivation for EM
We might encounter situations where, in addition to observed data , we have missing or hidden data . It might literally be data that is missing for some reason. Or, more interestingly, it might be due to our modeling choice. We might prefer to have a model with a set of meaningful but hidden variables that help explain the “causes” of . Good examples of this category would be Gaussian (or other kind of) mixture models, and LDA.
Note to myself: examples when we introduces latent variables just for the sake of making the optimization problem easier?
In either case, we will need to have a model for calculating the joint distribution of and , , which may arise from assumptions (in the case of missing data) or from models of marginal density functions and . In such cases, the log likelihood can be expressed as
Direct maximization of with respect to might be challenging, due to the summation over inside the log. But the problem would be much easier if we knew the values of . It is simply the original maximum likelihood problem with all data available.
The collection of is called the complete data. Naturally, is the incomplete data and is the latent data/variable.
Roughly speaking, EM algorithm is an iterative method that let us to guess based on (and current estimate of model parameter ). With the guessed “fill-in” we now have complete data and we optimize the log likelihood with respect to . We thus iteratively improve our guess of latent variable and parameter . We repeat this process until convergence.
In slightly more detail, instead of guessing a single value we guess the distribution of given , i.e. . then optimize the expected log likelihood for complete data, i.e. , with respect to which serves as a proxy (lower bound) for the true objective . Repeat until converge.
(Note in fact guessing a single value for is also a valid strategy. It corresponds to a variant of EM and is what we do in the well-known K-means algorithm, where we guess a “hard” label on each data points.)
The nice thing about EM is that it comes with theoretical guarantee of monotonic improvement on the true objective even through we directly work with a proxy (lower bound) of it. Note however the rate of convergence will depend on the problem and the convergence is not guaranteed to be towards the global optima.
As before, we start with the log likelihood
Here I switched the summation over to integral assuming is continuous, just to hint this is a possibility. The last step used Jensen’s inequality and the fact log function is strictly concave. So far we do not have any restrictions on the distribution , apart from being a probability density function and it is positive where is.
Using the result above, let’s define the last quantity as . It is usually called ELBO (Evidence Lower BOund) as it is a lower bound of .
Just to reiterate what we have done so far: our goal is to maximize , we exchanged the place of the log and integral over and got a lower bound .
We can show that the difference between and is
where we used the fact Kullback-Leibler (KL) divergence is defined as
In general, KL divergence is always nonnegative and is zero if and only if . So in our case, the equality holds if and only if . When this happens, we say the bound is tight. In this case, it makes sense to note as to make the dependence on clear.
EM algorithm and monotonicity guarantee
The EM algorithm is remarkably simple and it goes as follows.
- E-step (of -th iteration):
- Let , which is calculated as shown in Eq.
- Due to our particular choice of , at current estimate of the bond is tight:
- Maximize with respect to , see Eq.
- This step improves ELBO by finding a better :
The calculation in E-step is
Just to spell out the function that we maximize in M-step.
With the preparation earlier we can also easily show the theoretical guarantee on monotonic improvement over the optimization objective .
Why the “E” in E-step
By the way, the reason it is called E-step is because in that step we do the necessary calculation to figure out the form of as a function of which we then optimize in the M-step. The form of is the expectation of the log likelihood of complete data over the estimated distribution of the latent variable .
EM as maximization-maximization
Because the particular choice in E-step is to have diminishing , thus E-step can be viewed as maximizing with respect to and M-step as maximization with respect to . So we are doing alternating maximization on the EBLO with respect to and .
This maximization-maximization view offers justification for partial E-step (when the required calculation in exact E-step is intractable) and partial M-step (i.e. only find a that increases the ELBO rather than maximizes it). Under this view, the direct maximization on ELBO as a goal offers a strong connection to variational inference as will be discussed briefly below.
Example: Gaussian Mixture
In the context of Gaussian Mixture Model (GMM), associated with takes the value , where is the number of Gaussians in the model. Thus indicates which Gaussian cluster observed data point belongs to. The set of parameter includes those parameterize the marginal distribution of , . , with and . Also, include those parametrized the conditional distribution of .
Variants and extensions of EM
GEM and CEM
A popular variant to EM is that in Eq. we merely find a that increases (rather than maximizes) . It is easy to see and the monotonicity guarantee still holds in this situation. This algorithm is proposed in the original EM paper and called Generalized EM (GEM).
Another variant is the point-estimate version we mentioned earlier, where instead of having in the E-step, we take to be a single value - the most probable one, i.e. or equivalently taking . In this case, the integral in is greatly simplified, but the first equality in does not hold any more and we lose the theoretical guarantee. This algorithm is also called Classification EM (CEM).
As we can see in Eq. , we need to go through all data points in order to update , which could be long process for large data sets. In much of the same spirit as stochastic gradient descent, we could sample subsets of data and run the E- and M-step on these mini batches. The same idea can be used for variational inference mentioned below, on the update of global latent variables (such as ).
The computation of the optimal , i.e. in E-step might be intractable. Especially, the integral in the denominator of Eq. does not have closed form solution for many interesting models. In this case we can take the view of EM as maximization-maximization and try to come up with better and better to improve the ELBO. In order to proceed with such variational optimization tasks, we need to specify the functional family from which we will choose . Depending on the assumptions a number of interesting algorithms have been developed. The most popular one is probably mean-field approximation.
Note that in a typical variational inference framework, the parameter is treated as first class variables that we would do inference on (i.e. getting ) rather than taking a maximum likelihood single point estimation, so become part of the latent variables and absorbed into the notation . Thus, includes global variables such as and local variables such as the latent labels associated with each data point .
In mean-field method the constraint we put on is that it factorizes, i.e. . This is saying that all latent variables are mutual independent, by assumption. This seemingly simple assumption brings in remarkable simplifications in the calculation of integrals and especially the expectations of log likelihood involved. It leads to a coordinate ascent variational inference (CAVI) algorithm that allows closed-form iterative calculation for certain model family. The coordinate updates on local variables corresponds to the E-step in EM, while the updates on global variables corresponds to the M-step in EM.
For more about this topic see: D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational Inference: A Review for Statisticians,” J. Am. Stat. Assoc., vol. 112, no. 518, pp. 859–877, 2017.
Todo: add citation in text; for now just core dumped some references here
In no particular order:
A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Stat. Soc. Ser. B Methodol., vol. 39, no. 1, pp. 1–38, 1977.
R. M. Neal and G. E. Hinton, “A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants,” Learn. Graph. Model., pp. 355–368, 1998.
J. A. Bilmes, “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” ReCALL, vol. 1198, no. 510, p. 126, 1998.
A. Roche, “EM algorithm and variants: an informal tutorial,” pp. 1–17, 2011.
M. R. Gupta, “Theory and Use of the EM Algorithm,” Found. Trends® Signal Process., vol. 4, no. 3, pp. 223–296, 2010.
M. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “Introduction to variational methods for graphical models,” Mach. Learn., vol. 37, no. 2, pp. 183–233, 1999.
M. J. Wainwright and M. Jordan, “Graphical Models, Exponential Families, and Variational Inference,” Found. Trends® Mach. Learn., vol. 1, no. 1–2, pp. 1–305, 2007.
M. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic Variational Inference,” vol. 14, pp. 1303–1347, 2012.
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational Inference: A Review for Statisticians,” J. Am. Stat. Assoc., vol. 112, no. 518, pp. 859–877, 2017.
S. Mohamed, “Variational Inference for Machine Learning,” no. February, 2015.
Z. Ghahramani, “Variational Methods The Expectation Maximization ( EM ) algorithm,” no. April, 2003.