Jekyll2021-01-21T00:30:45+00:00https://dingran.github.io/feed.xmlRan DingRan Ding's homepageRan DingExponential-Min and Gumbel-Max2019-01-01T00:00:00+00:002019-01-01T00:00:00+00:00https://dingran.github.io/Gumbel\[\newcommand{\argmin}{\mathop{\mathrm{argmin}}}
\newcommand{\argmax}{\mathop{\mathrm{argmax}}}\]
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>
<h2 id="introduction">Introduction</h2>
<p>I originally wanted to write down the proof for the Gumbel-max trick but soon realized it is actually the same idea as a much more common problem: <em>exponential race</em>. So, in this note let’s go from this common problem and arrive at the Gumbel-max trick.</p>
<h2 id="competing-alarms">Competing Alarms</h2>
<p>As a preparation let’s solve a probability problem first.</p>
<hr />
<p>There are \(N\) clocks started simultaneously, such that the \(i\)-th clock rings after a random time \(T_i \sim \text{Exp}(\lambda_i)\)</p>
<ul>
<li>
<p>(1) Designate \(X\) as the random time after which some clock (i.e any one of the clocks) rings, find the distribution of \(X\)</p>
</li>
<li>
<p>(2) Find the probability of the \(i\)-th clock rings first</p>
</li>
</ul>
<hr />
<p>Let \(X = \min \{T_1, T_2, \dots, T_n \}\) and \(F_X(t)\) and \(F_{T_i}(t)\) be the CDFs. We also have \(F_{T_i}(t) = 1- e^{-\lambda_it}\).</p>
<p>Following order statistics of \(\min\), we have \(P(X>t) = \prod_{i=1}^N P(T_i>t)\) or equivalently,</p>
\[1 - F_X(t) = \prod_{i=1}^N (1-F_{T_i}(t)) = \prod_{i=1}^N e^{-\lambda_it}
= e^{-\sum_{i=1}^N \lambda_it}\]
<p>Therefore</p>
\[\begin{equation}
X \sim \text{Exp}(\lambda_X = \sum_{i=1}^N \lambda_i)
\label{part1}
\end{equation}\]
<p>i.e. the \(\min\) of a set of i.i.d exponential random variables is still an exponential random variable with the rate \(\lambda_X\) being the sum of the rates of that set of random variables.</p>
<p>For the second part of the problem, we can consider two competing alarms \(T_1\) and \(T_2\) to begin with. Our goal is to find \(P(T_1<T_2)\).</p>
\[\begin{split}
P(T_1 < T_2) & = \int_0^{+\infty} \int_{t_1}^{+\infty} P(T_1=t_1) P(T_2=t_2) dt_2 dt_1 \\\\
&= \int_0^{+\infty} P(T_1=t_1) \left(1-F_{T_2}(t_1)\right) dt_1 \\\\
&= \int_0^{+\infty} \lambda_1 e^{-\lambda_1 t_1} e^{-\lambda_2 t_1} dt_1 \\\\
& = \frac{\lambda_1}{\lambda_1+\lambda_2}
\end{split}\]
<p>Now, let’s consider one specific clock \(T_k\) versus all the rest, noted as \(T_{-k} = \min \{T_i\}_{i \neq k}\). According to \(\ref{part1}\), we know that \(T_{-k} \sim \text{Exp}(\sum_{i\neq k} \lambda_i)\). Using the result above we have the solution for part (2) as follows</p>
\[\begin{equation}
P(T_k \text{ rings first}) = P(T_k<T_{-k}) = \frac{\lambda_k}{\lambda_k+\sum_{i\neq k}\lambda_i} = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}
\label{part2}
\end{equation}\]
<p>Of course, we can do the integration directly and get the same result</p>
\[\begin{split}
P(T_k<T_{-k}) & = \int_0^{+\infty} P(T_k=t_k) \left( \idotsint_{t_k}^{+\infty} \prod_{i\neq k}P(T_i=t_i) dt_i \right) dt_k \\\\
& = \int_0^{+\infty} P(T_k=t_k) \left( \prod_{i\neq k} \left(1-F_{T_i}(t_k)\right) \right) dt_k \\\\
& = \int_0^{+\infty} \lambda_k \exp{\left(-\lambda_k t_k\right)} \exp{\left(-\sum_{i \neq k}\lambda_i t_k\right)} dt_k \\\\
& = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}
\end{split}\]
<p>By the way, this setup with multiple exponential random variables and we look for the first arrival is also called <em>exponential race</em>.</p>
<h2 id="exponential-min-trick">Exponential-Min Trick<a name="argmin"></a></h2>
<p>I just made up the name “Exponential-Min”. The better name for this section is probably <em>Sampling from Multinomial by Argmining</em>.</p>
<p>Suppose we have a set of positive numbers \([\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]\). Correspondingly we have a normalized probabiilty vector \(\vec{p}=[p_1, p_2, p_3, \dots, p_N]\), where \(p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}\). This probability vector specifies a multinominal distribution over \(N\) choices.</p>
<p>Now, if we were to get a sample \(\{1, 2, \dots, N\}\) according to this multinominal distribution specified by \(\vec{p}\) (which is fundamentally specified by \([\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]\)), what should we do?</p>
<p>Normally, we do the following:</p>
<ol>
<li>We have \([\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]\)</li>
<li>We compute \(\vec{p}=[p_1, p_2, p_3, \dots, p_N]\), where \(p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}\).</li>
<li>We generate a uniform random number \(Q\) between 0 and 1, i.e. \(Q \sim \text{Uniform}(0,1)\)</li>
<li>We figure out where \(Q\) lands, i.e. if \(p_i < Q < p_{i+1}\) we return \(i\). (Of couse we should use \(p_0=0\) and \(p_{N+1}=1\))</li>
</ol>
<p>But that’s the boring way. Now we have this new Exponential-Min trick, we can do the following:</p>
<ol>
<li>We have \([\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]\)</li>
<li>We don’t compute \(\vec{p}\); instead we sample \(T_i \sim \text{Exp}(\lambda_i)\) for \(i=1, 2, \dots, N\), i.e. we have a total of \(N\) samples, one from each \(\text{Exp}(\lambda_i)\)</li>
<li>We now take \(\argmin([T_1, T_2, \dots, T_N])\) as our result sample</li>
<li>We proved in \(\ref{part2}\) that such a result sample indeed follows multinominal distribution specified by \(\vec{p}=[p_1, p_2, p_3, \dots, p_N]\), where \(p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}\).</li>
</ol>
<p>Thus, somehow we ended up <em>Sampling from Multinomial by Argmining</em>!</p>
<h2 id="gumbel-distribution">Gumbel Distribution</h2>
<p>Now let’s move on to Gumbel distribution from Exponential distribution.</p>
<p>Gumbel distribution with unit scale (\(\beta=1\)) is parameterized by location parameter \(\mu\). \(\text{Gumble}(\mu)\) has CDF and PDF as follows</p>
\[\text{CDF: } F(x; \mu)=e^{-e^{-(x-\mu)}}\]
\[\text{PDF: }f(x; \mu) = e^{-\left((x-\mu)+e^{-(x-\mu)}\right)}\]
<p>Given a set of \(N\) independent Gumbel random variables \(G_i\), each with their own parameter \(\mu_i\), i.e. \(G_i \sim \text{Gumbel}(\mu_i)\).</p>
<p>Gumbel distribution has two properties that are quite analogous the <em>exponential race</em> example above.</p>
<ul>
<li>(1) Let \(Z = \max \{G_i \}\), then \(Z \sim \text{Gumble}\left(\mu_Z = \log \sum_{i=1}^N e^{\mu_i} \right)\)</li>
</ul>
<p>The proof is straightforward and similar to above:</p>
\[F_Z(x) = \prod_{i=1}^N F_{G_i}(x) = \prod_{i=1}^N e^{-e^{-(x-\mu_i)}} = e^{-\sum_{i=1}^N e^{-(x-\mu_i)}} = e^{-e^{-x} \sum_{i=1}^N e^{\mu_i}} = e^{-e^{-(x-\mu_Z)}}\]
<ul>
<li>(2) A corollary of the above is that the probability of \(Z_k\) being the max is \(P(Z_k > Z_{-k}) = \frac{e^{\mu_k}}{\sum_{i=1}^N e^{\mu_i}}\)</li>
</ul>
<h2 id="gumbel-max-trick">Gumbel-Max Trick</h2>
<p>Now here we can tell nearly an identical/parallel story as in the section <a href="#argmin">Exponential-Min Trick</a>. And, this section should really be called <em>Sampling from Multinomial by Argmaxing</em>.</p>
<p>The main differences are</p>
<ul>
<li>The numbers (parameters) \(\mu_i\) can be potentially negative, whereas \(\lambda_i\) must be positive</li>
<li>The probability vector is determined by \(p_k = \frac{e^{\mu_k}}{\sum_{i=1}^N e^{\mu_i}}\) instead of \(p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}\)</li>
<li>We generate samples with \(G_i \sim \text{Gumbel}(\mu_i)\) instead of \(T_i \sim \text{Exp}(\lambda_i)\)</li>
<li>We take \(\argmax\) over \(G_i\) instead of taking \(\argmin\) over \(T_i\)</li>
</ul>
<h2 id="when-is-gumbel-max-trick-useful">When is Gumbel-Max Trick Useful?</h2>
<p>It seems a lot of work to sample multinominal by argmaxing over Gumble samples (or argmining over Exponential samples). In what situation do we ever want to do this?</p>
<p>The short answer is that Gumble-Max trick allows us to make a sampling step <strong>differentiable</strong>. Specifically, it makes sampling from multinomial distribution differentiable. We’ll take a closer look at this in a future post but pause for a second and think about it. We are saying it is possible to differentiate through the action of drawing a discrete sample from a multinomial distribution! This was a pretty surprising/amazing possibility to me.</p>
<p>Regarding downstream applications, differentiating through sampling is an important “trick” in neural network based variational inference in general. Multinomial discrete random variables are prevalent in many learning problems. Gumble-max trick allows us to work with them in many interesting neural variational inference problems, which we will look into in future posts.</p>Ran DingExponential-min and Gumbel-max tricks for sampling from a multinomial distribution by taking the argmin and argmax.Recent Progress in Language Modeling2018-10-09T00:00:00+00:002018-10-09T00:00:00+00:00https://dingran.github.io/LM\[\newcommand{\argmin}{\mathop{\mathrm{argmin}}}
\newcommand{\argmax}{\mathop{\mathrm{argmax}}}\]
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>
<h2 id="overview">Overview</h2>
<p>This page is a high-level summary / notes of various recent results in language modeling with little explanations. Papers to cover are as follows:</p>
<p><strong>[1] AWD Language Model</strong></p>
<ul>
<li>Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. “Regularizing and optimizing LSTM language models.” arXiv preprint arXiv:1708.02182 (2017).</li>
</ul>
<p><strong>[2] Neural Cache</strong></p>
<ul>
<li>Grave, Edouard, Armand Joulin, and Nicolas Usunier. “Improving neural language models with a continuous cache.” arXiv preprint arXiv:1612.04426 (2016).</li>
</ul>
<p><strong>[3] Dynamic Evaluation</strong></p>
<ul>
<li>Krause, Ben, et al. “Dynamic evaluation of neural sequence models.” arXiv preprint arXiv:1709.07432 (2017).</li>
</ul>
<p><strong>[4] Memory-based Parameter Adaptation (MbPA)</strong></p>
<ul>
<li>Sprechmann, Pablo, et al. “Memory-based parameter adaptation.” arXiv preprint arXiv:1802.10542 (2018).</li>
</ul>
<p><strong>[5] Hebbian Softmax</strong></p>
<ul>
<li>Rae, Jack W., et al. “Fast Parametric Learning with Activation Memorization.” arXiv preprint arXiv:1803.10049 (2018).</li>
</ul>
<p><strong>[6] Higher-rank LM / Mixture-of-Softmax (MoS)</strong></p>
<ul>
<li>Yang, Zhilin, et al. “Breaking the softmax bottleneck: A high-rank RNN language model.” arXiv preprint arXiv:1711.03953 (2017).</li>
</ul>
<p>This is by no means an exhaustive literature review - they are only a selection of a few of the most recent state-of-the-art results. AWD LM [1] has almost become the de-facto baseline LM for many of the other papers, where the main innovations area special version of <strong>A</strong>veraged SGD (ASGD) along with DropConnection based <strong>W</strong>eight <strong>D</strong>ropping regularization in the hidden -to-hidden mapping of a LSTM model.</p>
<p>It has been found a global LM is ineffective in reacting to local patterns at test time, such as once a rare word appears furthe reappearance in the peoximty is much more likely than predicted by a global LM. To allow for faster reaction to local patterns, [2 - 5] propose various schemes involving a fast-learning non-parametric component and blend its predictions or parameters with the global learned parametric LM. A quick comparsion of these 4 papers are in the table below.</p>
<table>
<thead>
<tr>
<th>Ref</th>
<th>Method</th>
<th>Modifications to training?</th>
<th>Adapation needed at test time?</th>
</tr>
</thead>
<tbody>
<tr>
<td>[2]</td>
<td>Keeping key-value store with keys being previous (fixed size) output hidden states and value being correct labels. This non-parametric cache provides a local LM based on nearest-neighbor lookup. This is then interpolated with global LM for final prediction.</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>[3]</td>
<td>Similar to [2] but instead of doing nearest-neighor over saved hidden-steates, here we fit recent history with gradient descent thus providing a slightly adjusted model, i.e. parameters are adapted, not just predictions, to recent history. One concern I would have is whether the continuous adapation would let the model run away too far from the initial trained model.</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>[4]</td>
<td>Similar to [3], but the test-time gradient descent produces a local model that is discarded after use for prediction, i.e. unlike [3] the change of paramters due to local memory does not carry over to next time step. Thus this is quite closely related to meta-leanring. Another minor point, the gradient descent does not go through the full network, but stops at the so-called embedding layer, which is usually a layer close to the output, extracting fairly abstract features.</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr>
<td>[5]</td>
<td>Recent output hidden states are accumulated into one vector using exponential moving average and then directly updated to output linear mapping parameter matrix. Two sets of update rules are used at training. Non-parametric leanring are tapered off as words are seen more frequently. Different from [2-4], this method incorporates fast learning at training time not just fast adapation at test time.</td>
<td>Yes</td>
<td>No</td>
</tr>
</tbody>
</table>
<p>Table 1. Comparison of methods in Ref [2-5]</p>
<p>And finally, [6] highlights and mostly solved a fairly general problem of softmax over product produced by rank-limited matrices which is common in the decoder in a LM.</p>
<hr />
<h2 id="awd-lm">AWD LM</h2>
<h2 id="neural-cache">Neural Cache</h2>
<h2 id="dynamic-evaluation">Dynamic Evaluation</h2>
<h2 id="memory-based-parameter-adaptation-mbpa">Memory-based Parameter Adaptation (MbPA)</h2>
<h2 id="hebbian-softmax">Hebbian Softmax</h2>
<h2 id="higher-rank-lm--mixture-of-softmax-mos">Higher-rank LM / Mixture-of-Softmax (MoS)</h2>
<iframe src="https://drive.google.com/file/d/1nMS1FnJ8xQPcZ06JokXDU4f5Z8O_riig/preview?usp=sharing" width="100%" height="600"></iframe>Ran DingA brief overview of various techniques in recent language model (LM) literatures including AWD LM, the use of cache, dynamic evaluation, other memory-based non-parametric components to enhance learned parametric LM, and finally, recent progress in high-rank LM.Quantitative Interview Preparation Guide2018-05-05T00:00:00+00:002018-05-05T00:00:00+00:00https://dingran.github.io/PP<h2 id="what-is-this">What is this</h2>
<p>A short list of resources and topics covering the essential quantitative tools for Data Scientists, Machine Learning Engineers/Scientists, Quant Developers/Researchers and those who are preparing to interview for these roles.</p>
<p>At a high-level we can divide things into 3 main areas:</p>
<ol>
<li>Machine Learning</li>
<li>Coding</li>
<li>Math (calculus, linear algebra, probability, etc)</li>
</ol>
<p>Depending on the type of roles, the emphasis can be quite different. For example, AI/ML interviews might go deeper into the latest deep learning models, while quant interviews might cast a wide net on various kinds of math puzzles. Interviews for research-oriented roles might be lighter on coding problems or at least emphasize on algorithms instead of software designs or tooling.</p>
<h2 id="list-of-resources">List of resources</h2>
<p>A minimalist list of the best/most practical ones:</p>
<p><img src="https://dingran.github.io/assets/images/PP/cs229.png" alt="" />
<img src="https://dingran.github.io/assets/images/PP/mit6006.jpg" alt="" />
<img src="https://dingran.github.io/assets/images/PP/stats110.jpg" alt="" /></p>
<p>Machine Learning:</p>
<ul>
<li>Course on classic ML: Andrew Ng’s CS229 (there are several different versions, <a href="https://www.coursera.org/learn/machine-learning">the Cousera one</a> is easily accessible. There is also an <a href="https://www.youtube.com/playlist?list=PLA89DCFA6ADACE599">older version</a> recorded at Stanford)</li>
<li>Book on classic ML: Alpaydin’s Intro to ML <a href="https://www.amazon.com/Introduction-Machine-Learning-Adaptive-Computation/dp/026201243X/ref=la_B001KD8D4G_1_2?s=books&ie=UTF8&qid=1525554938&sr=1-2">link</a></li>
<li>Course with a deep learing focus: <a href="http://cs231n.stanford.edu/">CS231</a> from Stanford, lectures available on Youtube.</li>
</ul>
<blockquote>
<p>If you are just breaking into the field I think the above are enough, stop there and move on to other areas of preparation. Here are a few very optional items, mostly on deep learning, in case you have more time:</p>
<ul>
<li>Overview book on deep learning: <a href="http://neuralnetworksanddeeplearning.com/">Neural Networks and Deep Learning</a> by Michael Nielson.</li>
<li>Amazing book on deep laerning for NLP: <a href="https://www.amazon.com/Language-Processing-Synthesis-Lectures-Technologies-ebook/dp/B071FGKZMH">Neural Network Methods for Natural Language Processing</a> by Yoav Goldberg</li>
<li>Pick one of those Udacity nanodegrees on deep learning / self-driving cars to get some hands on experience with deep learning frameworks (Tensorflow, Pytorch, MXNet)</li>
</ul>
</blockquote>
<p>Coding:</p>
<ul>
<li>Course: MIT OCW 6006 <a href="https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/">link</a></li>
<li>Book: Cracking the Coding Interview <a href="https://www.amazon.com/Cracking-Coding-Interview-Programming-Questions/dp/098478280X">link</a></li>
<li>Practice sites: <a href="https://leetcode.com/">Leetcode</a>, <a href="https://www.hackerrank.com/">HackerRank</a></li>
<li>SQL tutorial: from <a href="https://community.modeanalytics.com/sql/">Mode Analytics</a></li>
</ul>
<p>Math:</p>
<ul>
<li>Calculus and Linear Algebra: undergrad class would be the best, refresher notes from CS229 <a href="http://cs229.stanford.edu/section/cs229-linalg.pdf">link</a></li>
<li>Probability: Harvard Stats110 <a href="https://projects.iq.harvard.edu/stat110/home">link</a>; <a href="https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1466575573/ref=pd_lpo_sbs_14_t_2?_encoding=UTF8&psc=1&refRID=5W11QQ7WW4DFE0Q89N7V">book</a> from the same professor.</li>
<li>Statistics: Shaum’s Outline <a href="https://www.amazon.com/Schaums-Outline-Statistics-5th-Outlines/dp/0071822526">link</a>.</li>
<li>[Optional] Numerical Methods and Optimization: these are two different topics really, college courses are probably the best bet. I have yet to find good online courses for them. But don’t worry, most interviews won’t really touch on them.</li>
</ul>
<h2 id="list-of-topics">List of topics</h2>
<p>Here is a list of topics from which interview questions are often derived. The depth and trickiness of the questions certainly depend on the role and the company.</p>
<p>Under topic I try to add a few bullet points of the key things you should know.</p>
<h3 id="machine-learning">Machine learning</h3>
<ul>
<li>Models (roughly in decreasing order of frequency)
<ul>
<li>Linear regression
- e.g. assumptions, multicollinearity, derive from scratch in linear algebra form</li>
<li>Logistic regression
<ul>
<li>be able to write out everything from scratch: from definitng a classficiation problem to the gradient updates</li>
</ul>
</li>
<li>Decision trees/forest
- e.g. how does a tree/forest grow, on a pseudocode level</li>
<li>Clustering algorithms
<ul>
<li>e.g. K-means, agglomerative clustering</li>
</ul>
</li>
<li>SVM
<ul>
<li>e.g. margin-based loss objectives, how do we use support vectors, prime-dual problem</li>
</ul>
</li>
<li>Generative vs discriminative models
<ul>
<li>e.g. Gaussian mixture, Naive Bayes</li>
</ul>
</li>
<li>Anomaly/outlier detection algorithms (DBSCAN, LOF etc)</li>
<li>Matrix factorization based models</li>
</ul>
</li>
<li>Training methods
<ul>
<li>Gradient descent, SGD and other popular variants
- Understand momentum, how they work, and what are the diffrences between the popular ones (RMSProp, Adgrad, Adadelta, Adam etc)
- Bonus point: when to not use momentum?</li>
<li>EM algorithm
- Andrew’s <a href="http://cs229.stanford.edu/notes/cs229-notes8.pdf">lecture notes</a> are great, also see <a href="https://dingran.github.io/EM/">this</a></li>
<li>Gradient boosting</li>
</ul>
</li>
<li>Learning theory / best practice (see Andrew’s advice <a href="http://cs229.stanford.edu/materials/ML-advice.pdf">slides</a>)
<ul>
<li>Bias vs variance, regularization</li>
<li>Feature selection</li>
<li>Model validation</li>
<li>Model metrics</li>
<li>Ensemble method, boosting, bagging, bootstraping</li>
</ul>
</li>
<li>Generic topics on deep learning
<ul>
<li>Feedforward networks</li>
<li>Backpropagation and computation graph
<ul>
<li>I really liked the <a href="https://gist.github.com/dingran/154a524003c86ecab4a949c538afa766">miniflow</a> project Udacity developed</li>
<li>In addition, be absolutely familiar with doing derivatives with matrix and vectors, see <a href="http://cs231n.stanford.edu/vecDerivs.pdf">Vector, Matrix, and Tensor Derivatives</a> by Erik Learned-Miller and <a href="http://cs231n.stanford.edu/handouts/linear-backprop.pdf">Backpropagation for a Linear Layer</a> by Justin Johnson</li>
</ul>
</li>
<li>CNN, RNN/LSTM/GRU</li>
<li>Regularization in NN, dropout, batch normalization</li>
</ul>
</li>
</ul>
<h3 id="coding-essentials">Coding essentials</h3>
<p>There are a lot of resources online on how to prepare for this, some are already listed in the resources section. I think the key thing is to pick a language and know it really well. For example, for Python, if you claim you know it well and have used in some non-trivial code base. I’ll assume you know why <code class="language-plaintext highlighter-rouge">abstract base class</code> exists, how decorator works in general and what <code class="language-plaintext highlighter-rouge">@property</code> means along with a few language-specific data structures (like <code class="language-plaintext highlighter-rouge">OrderedDict</code>, <code class="language-plaintext highlighter-rouge">deque</code>, <code class="language-plaintext highlighter-rouge">defaultdict</code> and etc).</p>
<p>The bare minimum of coding concepts you need to know well.</p>
<ul>
<li>Data structures:
<ul>
<li>array, dict, link list, tree, heap, graph, ways of representing sparse matrices</li>
</ul>
</li>
<li>Sorting algorithms:
<ul>
<li>see <a href="https://brilliant.org/wiki/sorting-algorithms/">this</a> from brilliant.org</li>
<li>I think in real life you’ll most likely never implement a sorting algorithm, but I think <code class="language-plaintext highlighter-rouge">quick sort</code> is very cool and <code class="language-plaintext highlighter-rouge">quick select</code> / <code class="language-plaintext highlighter-rouge">partition</code> is used in many other places, so take a look.</li>
</ul>
</li>
<li>Tree/Graph related algorithms
<ul>
<li>Traversal (BFS, DFS)</li>
<li>Shortest path (two sided BFS, dijkstra)</li>
</ul>
</li>
<li>Trees related
<ul>
<li>TBA</li>
</ul>
</li>
<li>Heap related
<ul>
<li>TBA</li>
</ul>
</li>
<li>Recursion and dynamic programming
<ul>
<li>TBA</li>
</ul>
</li>
</ul>
<h3 id="calculus">Calculus</h3>
<p>Just to spell things out</p>
<ul>
<li>Derivatives
<ul>
<li>Product rule, chain rule, power rule, L’Hospital’s rule,</li>
<li>Partial and total derivative</li>
<li>Things worth remembering
<ul>
<li>common function’s derivatives</li>
<li>limits and approximations</li>
</ul>
</li>
<li>Applications of derivatives: e.g. <a href="https://math.stackexchange.com/questions/1619911/why-ex-is-always-greater-than-xe">this</a></li>
</ul>
</li>
<li>Integration
<ul>
<li>Power rule, integration by sub, integration by part</li>
<li>Change of coordinates</li>
</ul>
</li>
<li>Taylor expansion
<ul>
<li>Single and multiple variables</li>
<li>Taylor/McLauren series for common functions</li>
<li>Derive Newton-Raphson</li>
</ul>
</li>
<li>ODEs, PDEs (common ways to solve them analytically)</li>
</ul>
<h3 id="linear-algebra">Linear algebra</h3>
<ul>
<li>Vector and matrix multiplication</li>
<li>Matrix operations (transpose, determinant, inverse etc)</li>
<li>Types of matrices (symmetric, Hermition, orthogonal etc) and their properties</li>
<li>Eigenvalue and eigenvectors</li>
<li>Matrix calculus (gradients, hessian etc)</li>
<li>Useful theorems</li>
<li>Matrix decomposition</li>
<li>Concrete applications in ML and optimization</li>
</ul>
<h3 id="probability">Probability</h3>
<p>Solving probability interview questions is really all about pattern recognition. To do well, do plenty of exercise from <a href="https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1466575573/ref=pd_lpo_sbs_14_t_2?_encoding=UTF8&psc=1&refRID=5W11QQ7WW4DFE0Q89N7V">this</a> and <a href="https://www.amazon.com/Practical-Guide-Quantitative-Finance-Interviews/dp/1438236662">this</a>. This topic is particularly heavy in quant interviews and usually quite light in ML/AI/DS interviews.</p>
<ul>
<li>Basic concepts
<ul>
<li>Event, outcome, random variable, probability and probability distributions</li>
</ul>
</li>
<li>Combinatorics
<ul>
<li>Permutation</li>
<li>Combinations</li>
<li>Inclusion-exclusion</li>
</ul>
</li>
<li>Conditional probability
<ul>
<li>Bayes rule</li>
<li>Law of total probability</li>
</ul>
</li>
<li>Probability Distributions
<ul>
<li>Expectation and variance equations</li>
<li>Discrete probability and stories</li>
<li>Continuous probability: uniform, gaussian, poisson</li>
</ul>
</li>
<li>Expectations, variance, and covariance
<ul>
<li>Linearity of expectation
<ul>
<li>solving problems with this theorem and symmetry</li>
</ul>
</li>
<li>Law of total expectation</li>
<li>Covariance and correlation</li>
<li>Independence implies zero correlation</li>
<li>Hash collision probability</li>
</ul>
</li>
<li>Universality of Uniform distribution
<ul>
<li>Proof</li>
<li>Circle problem</li>
</ul>
</li>
<li>Order statistics
<ul>
<li>Expectation of min and max and random variable</li>
</ul>
</li>
<li>Graph-based solutions involving multiple random variables
<ul>
<li>e.g. breaking sticks, meeting at the train station, frog jump (simplex)</li>
</ul>
</li>
<li>Approximation method: Central Limit Theorem
<ul>
<li>Definition, examples (unfair coins, Monte Carlo integration)</li>
<li><a href="https://github.com/dingran/quant-notes/blob/master/prob/central_limit_theorem.ipynb">Example question</a></li>
</ul>
</li>
<li>Approximation method: Poisson Paradigm
<ul>
<li>Definition, examples (duplicated draw, near birthday problem)</li>
</ul>
</li>
<li>Poisson count/time duality
<ul>
<li>Poisson from poissons</li>
</ul>
</li>
<li>Markov chain tricks
<ul>
<li>Various games, introduction of martingale</li>
</ul>
</li>
</ul>
<h3 id="statistics">Statistics</h3>
<ul>
<li>Z-score, p-value</li>
<li>t-test, F-test, Chi2 test (know when to use which)</li>
<li>Sampling methods</li>
<li>AIC, BIC</li>
</ul>
<h3 id="optional-numerical-methods-and-optimization">[Optional] Numerical methods and optimization</h3>
<ul>
<li>Computer errors (e.g. float)</li>
<li>Basic root finding (newton method, bisection, secant etc)</li>
<li>Interpolating</li>
<li>Numerical integration and difference</li>
<li>Numerical linear algebra
<ul>
<li>Solving linear equations, direct methods (understand complexities here) and iterative methods (e.g. conjugate gradient), maybe BFGS</li>
<li>Matrix decompositions/transformations (e.g. QR, Givens, LU, SVD etc)</li>
<li>Eigenvalue solvers (e.g. power iteration, Arnoldi/Lanczos etc)</li>
</ul>
</li>
<li>ODE solvers (explicit, implicit)</li>
<li>Finite-difference method, finite-element method</li>
<li>Optimization topics: linear programming (and convex opt in general), calculus of variations</li>
</ul>Ran DingA short list of resources and topics covering the essential quantitative tools for Data Scientists, Machine Learning Engineers/Scientists, Quant Developers/Researchers and those who are preparing to interview for these roles.Recent Progress in Neural Variational Inference2018-03-08T00:00:00+00:002018-03-08T00:00:00+00:00https://dingran.github.io/NVI<iframe src="https://drive.google.com/file/d/1PbUU94Cf6EsR9AxND_WbREHq3q3jrt1c/preview?usp=sharing" width="100%" height="600"></iframe>Ran DingA literature survey of recent papers on Neural Variational Inference (NVI) and its application in topic modeling.Brief Survey of Generative Models2017-12-20T00:00:00+00:002017-12-20T00:00:00+00:00https://dingran.github.io/GM\[\newcommand{\argmin}{\mathop{\mathrm{argmin}}}
\newcommand{\argmax}{\mathop{\mathrm{argmax}}}\]
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>
<h2 id="overview">Overview</h2>
<p>This page is a high-level summary of various generative models with little explanations. Models to cover are as follows:</p>
<p><strong>Variational Autoencoders (VAE)</strong></p>
<ul>
<li>Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).</li>
</ul>
<p><strong>Adversarial Variational Bayes (AVB)</strong></p>
<p>Extention to VAE to use non-Gaussian encoders</p>
<ul>
<li>Mescheder, Lars, Sebastian Nowozin, and Andreas Geiger. “Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks.” arXiv preprint arXiv:1701.04722 (2017).</li>
</ul>
<p><strong>Generative Adverserial Networks (GAN)</strong></p>
<ul>
<li>Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.</li>
</ul>
<p><strong>Generalized divergence minimization GAN (\(f\)-GAN)</strong></p>
<ul>
<li>Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka. “f-gan: Training generative neural samplers using variational divergence minimization.” Advances in Neural Information Processing Systems. 2016.</li>
</ul>
<p><strong>Wasserstein GAN (WGAN)</strong></p>
<ul>
<li>Arjovsky, Martin, Soumith Chintala, and Léon Bottou. “Wasserstein gan.” arXiv preprint arXiv:1701.07875 (2017).</li>
</ul>
<p><strong>Adversarial Autoencoders (AAE)</strong></p>
<ul>
<li>Makhzani, Alireza, et al. “Adversarial autoencoders.” arXiv preprint arXiv:1511.05644 (2015).</li>
</ul>
<p><strong>Wasserstein Auto-Encoder (WAE)</strong></p>
<ul>
<li>Tolstikhin, Ilya, et al. “Wasserstein Auto-Encoders.” arXiv preprint arXiv:1711.01558 (2017).</li>
</ul>
<p><strong>Cramer GAN</strong></p>
<ul>
<li>Bellemare, Marc G., et al. “The Cramer Distance as a Solution to Biased Wasserstein Gradients.” arXiv preprint arXiv:1705.10743 (2017).</li>
</ul>
<hr />
<h2 id="vae">VAE</h2>
<h3 id="model-setup">Model setup:</h3>
<ul>
<li>Recognition model: \(q_\phi(z \vert x) = \mathcal N(\mu=h_1(x), \sigma^2 \mathbf I=h_2(x)\mathbf I)\)</li>
<li>Assumed fixed prior: \(p(z) = \mathcal N(0,\mathbf I)\)</li>
<li>Generation model: \(p_\theta(x \vert z) = \mathcal N(\mu=g_1(z), \sigma^2 \mathbf I=g_2(z)\mathbf I)\)
<ul>
<li>Implied (but intractable) posterior: \(p_\theta(z\vert x)\)</li>
</ul>
</li>
</ul>
<h3 id="key-equations">Key equations:</h3>
\[\begin{equation}
\begin{split}
\log p_\theta (x^i) = D_{KL}(q_\phi(z\vert x^i)\| p_\theta(z\vert x^i) + \mathcal L(\theta, \phi, x^i)
\end{split}
\end{equation}\]
\[\begin{equation}
\begin{split}
\mathcal L(\theta, \phi, x^i)
&= \mathbb{E}_{z\sim q_\phi(z\vert x^i)}[\log p_\theta(x^i,z) - \log q_\phi(z\vert x^i)]\\\\
&= \mathbb{E}_{z\sim q_\phi(z\vert x^i)}[\log p_\theta(x^i,z)] + H[q_\phi(z\vert x^i)]\\\\
&= \mathbb{E}_{z\sim q_\phi(z\vert x^i)}[\log p_\theta(x^i \vert z)] - D_{KL}[q_\phi(z\vert x^i) \| p(z)]\\\\
\end{split} \label{vae_elbo}
\end{equation}\]
<h3 id="optimization-objective">Optimization objective:</h3>
\[\hat\theta, \hat\phi = \argmax_{\theta, \phi} \sum_i \mathcal L(\theta, \phi, x^i)\]
<h3 id="gradient-friendly-monte-carlo">Gradient-friendly Monte Carlo:</h3>
<p>Difficulties in calculating \(\mathcal L(\theta, \phi, x^i)\):</p>
<ul>
<li>Due to the generality of \(q\) and \(p\) (typically a neural network), the expectation in \(\ref{vae_elbo}\) does not have an analytical form. So we need to resort to Monte Carlo estimation.</li>
<li>Furthermore, direct sampling \(z\) according to \(q\) poses difficulty in taking derivative against parameters \(\phi\) that parameterizes the distribution \(q\).</li>
</ul>
<p>Solution: Reparameterization Trick</p>
<p>Find smooth and invertible transformation \(z=g_\phi(\epsilon)\) such that with \(\epsilon\) drawn from a <em>fixed</em> (non-parameterized) distribution \(p(\epsilon)\) we have \(z \sim q(z; \phi)\), so</p>
\[\mathbb{E}_{z\sim q(z;\phi)}[f(z)] = \mathbb{E}_{\epsilon\sim p(\epsilon)}[f(g_\phi(\epsilon))]\]
<p>For the Normal distribution used here (\(q_\phi(z\vert x)\)), it is convenient to use location-scale transformation, \(z=\mu+\sigma * \epsilon\) with \(\epsilon \sim \mathcal N(0,\mathbf I)\).</p>
\[\begin{equation}
\widetilde{\mathcal{L}}(\theta, \phi, x^i) = \frac{1}{L} \sum_{l=1}^L \left( \log p_\theta(x^i \vert z^{i,l})] - D_{KL}[q_\phi(z^{i,l}\vert x^i) \| p(z^{i,l}) \right)
\end{equation}\]
\[z^{i,l} = \mu_{x^i} + \sigma_{x^i} * \epsilon ^{i,l} ~~\text{and}~~ \epsilon^l \sim \mathcal N(0,\mathbf I)\]
<p>For total \(N\) data points with mini batch size \(M\):</p>
\[\begin{equation}
\begin{split}
{\mathcal L}(\theta, \phi; X) = \sum_{i=1}^N \mathcal L(\theta, \phi, x^i) \approx \widetilde {\mathcal L^M}(\theta, \phi; X) = \frac{N}{M} \sum_{i=1}^M \widetilde {\mathcal L}(\theta, \phi, x^i)
\end{split}
\end{equation}\]
<p>For sufficiently large batch size \(M\), the inner loop sample size \(L\) can be set to 1. Due to stochastic mini batch gradient descent and stochastic expectation estimation, this is also called <em>doubly stochastic estimation</em>.</p>
<h3 id="using-non-gaussian-encoders">Using non-Gaussian encoders</h3>
<blockquote>
<p>Todo: discuss AVB paper</p>
</blockquote>
<h3 id="gumble-trick-for-discrete-latent-variables">Gumble trick for discrete latent variables</h3>
<p>Ref for this section:</p>
<ol>
<li>Gumble max trick <a href="https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/">https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/</a></li>
<li>Balog, Matej, et al. “Lost Relatives of the Gumbel Trick.” arXiv preprint arXiv:1706.04161 (2017).</li>
<li>Jang, Eric, Shixiang Gu, and Ben Poole. “Categorical reparameterization with gumbel-softmax.” arXiv preprint arXiv:1611.01144 (2016).</li>
</ol>
<p>Gumble distribution:</p>
<hr />
<h2 id="f-gan-and-gan">\(f\)-GAN and GAN</h2>
<h3 id="prelude-on-f-divergence-and-its-variational-lower-bound">Prelude on \(f\)-divergence and its variational lower bound</h3>
<p>The f-divergence family</p>
\[\begin{equation}
D_f = \int_{\mathcal X} q(x) ~ f\left( \frac{p(x)}{q(x)}\right) dx
\end{equation}\label{f_div}\]
<p>where the <em>generator function</em> \(f: \mathbb{R}_{+} \rightarrow \mathbb{R}\) is a convex, lower-semicontinuous function satisfying \(f(1) = 0\).</p>
<p>Every convex, lower-semicontinuous function has a <em>convex conjugate</em> function \(f^c\), also known as <em>Fenchel conjugate</em>. This function is defined as</p>
\[\begin{equation}
f^c(t) = \underset {u \in \text{dom}_f}{\text{sup}} \{ut - f(u)\}
\end{equation}\]
<p>Function \(f^c\) is again convex and lower-semicontinuous and the pair \((f,f^c)\) is dual to each other, i.e. \(\left(f^{c}\right)^c=f\). So we can represent \(f\) as</p>
\[\begin{equation}
f(t) = \underset {t \in \text{dom}_{f^c}}{\text{sup}} \{tu - f^c(t)\}
\end{equation}\]
<p>With this we can establish a lower bound for estimating the f-divergence in general</p>
\[\begin{equation}
\begin{split}
D_f(P \| Q) &= \int_{\mathcal X} q(x) \underset {t \in \text{dom}_{f^c}}{\text{sup}} \{t \frac{p(x)}{q(x)} - f^c(t)\} dx \\\\
& \ge \underset {T \in {\mathcal T}} {\text{sup}} \int_{\mathcal X} \left( p(x)T(x) - q(x)f^c(T(x)) \right) dx \\\\
& = \underset {T \in {\mathcal T}} {\text{sup}} \left( \mathbb{E}_{x\sim P}[T(x)] - \mathbb{E}_{x\sim Q}[f^c(T(x))] \right)
\end{split} \label{f_lowerbound}
\end{equation}\]
<p>where \(\mathcal T\) is an arbitrary class of functions \(T: \mathcal X \rightarrow \mathbb R\). The inequality is due to Jessen’s inequality and constraints imposed by \(\mathcal T\).</p>
<p>The bond is tight for</p>
\[\begin{equation}
T^*(x) = f' \left(\frac{p(x)}{q(x)} \right)
\end{equation}\]
<h3 id="generative-adversarial-training">Generative adversarial training</h3>
<p>Suppose our goal is to come up with a distribution \(Q\) (model) that is close to \(P\) (the data distribution) and the similarity score (loss) is measured by \(D_f(P \| Q)\). However the direct calculation of \(\ref{f_div}\) is intractable, such as the case where the functional form of \(P\) is unknown and \(Q\) is a complex model parameterized by a neural network.</p>
<p>To be specific:</p>
<ul>
<li>Evaluating \(q(x)\) at any \(x\) is easy, but integrating it is hard due to lack of easy functional form.</li>
<li>For \(p(x)\), we do not know how to evaluate it at any \(x\)</li>
<li>Sampling from both \(P\) and \(Q\) are easy. Because drawing from data set approximates \(x \sim P\) and we can make the model \(Q\) take random vectors as input which are easy to produce.</li>
</ul>
<p>Fortunately, we can sample from both of them easily. In this case, \(\ref{f_lowerbound}\) offers a way to estimate the lower bound of the divergence. We would need to maximize this lower bound by changing \(T\) so that it is close to the true divergence, then minimize it over \(Q\). This is formally stated as follows.</p>
\[\begin{equation}
F(\theta, \omega) = \mathbb{E}_{x\sim P}[T_\omega(x)] + \mathbb{E}_{x\sim Q_\theta}[-f^c(T_\omega(x))]
\end{equation}\]
\[\begin{equation}
\hat \theta = \argmin_\theta \max_\omega F(\theta, \omega)
\end{equation}\]
<p>To ensure that the output of \(T_\omega\) respects the domain of \({f^c}\), we define \(T_\omega(x) = g_f(V_\omega(x))\), where \(V_\omega: \mathcal X \rightarrow \mathbb R\) without any range constraints on the output and \(g_f: \mathbb R \rightarrow \text{dom}_{f^c}\) is an output activation function specific to the \(f\)-divergence used with suitable output ranges.</p>
<h3 id="gan">GAN</h3>
<p>For the original GAN, with a divergence target similar to Jensen-Shannon
\(\begin{equation}
F(\theta, \omega) = \mathbb{E}_{x\sim P}[\log D_\omega(x)] + \mathbb{E}_{x\sim Q_\theta}[\log(1-D_\omega(x))]
\end{equation}\)
with \(D_\omega = 1/(1+e^{-V_\omega(x)})\)
which corresponds to the following</p>
<p>\(g_f(\nu)= \log(1/(1+e^{-\nu}))\)
\(T_\omega(x) = \log (D_\omega(x)) = g_f(V_\omega(x))\)
\(f^c(t) = -\log (1-\exp(t))\)
\(\log (1-D_\omega(x)) = -f^c(T_\omega(x))\)</p>
<h3 id="practical-considerations-in-adversarial-training">Practical considerations in adversarial training</h3>
<blockquote>
<p>Todo: log trick, DCGAN heuristics</p>
</blockquote>
<h3 id="example-divergence-and-their-related-functions">Example divergence and their related functions</h3>
<table>
<thead>
<tr>
<th>Name</th>
<th>\(D_f(P\vert Q)\)</th>
<th>Generator \(f(u)\)</th>
<th>\(T^*(x)\)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Forward KL</td>
<td>\(\int p(x) \log \frac{p(x)}{q(x)} dx\)</td>
<td>\(u\log u\)</td>
<td>\(1 +\log \frac{q(x)}{p(x)}\)</td>
</tr>
<tr>
<td>Reverse KL</td>
<td>\(\int q(x) \log \frac{p(x)}{q(x)} dx\)</td>
<td>\(-\log u\)</td>
<td>\(- \frac{q(x)}{p(x)}\)</td>
</tr>
<tr>
<td>Jensen-Shannon</td>
<td>\(\frac{1}{2} \int p(x) \log \frac{2p(x)}{p(x)+q(x)} + q(x) \log \frac{2q(x)}{p(x)+q(x)} dx\)</td>
<td>\(u\log u - (u+1) \log \frac{u+1}{2}\)</td>
<td>\(\log \frac{2p(x)}{p(x)+q(x)}\)</td>
</tr>
<tr>
<td>GAN</td>
<td>\(\int p(x) \log \frac{2p(x)}{p(x)+q(x)} + q(x) \log \frac{2q(x)}{p(x)+q(x)} dx -\log(4)\)</td>
<td>\(u\log u - (u+1) \log (u+1)\)</td>
<td>\(\log \frac{p(x)}{p(x)+q(x)}\)</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th>Name</th>
<th>Conjugate \(f^c(t)\)</th>
<th>\(\text{dom}_{f^c}\)</th>
<th>Output activation \(g_f\)</th>
<th>\(f'(1)\)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Forward KL</td>
<td>\(\exp(t-1)\)</td>
<td>\(\mathbb R\)</td>
<td>\(\nu\)</td>
<td>\(1\)</td>
</tr>
<tr>
<td>Reverse KL</td>
<td>\(-1-\log(-t)\)</td>
<td>\(\mathbb R_{-}\)</td>
<td>\(-\exp(\nu)\)</td>
<td>\(-1\)</td>
</tr>
<tr>
<td>Jensen-Shannon</td>
<td>\(-\log(2-\exp(t))\)</td>
<td>\(t < \log(2)\)</td>
<td>\(\log(2) - \log(1+\exp(-\nu))\)</td>
<td>\(0\)</td>
</tr>
<tr>
<td>GAN</td>
<td>\(-\log(1-\exp(t))\)</td>
<td>\(\mathbb R_{-}\)</td>
<td>\(- \log(1+\exp(-\nu))\)</td>
<td>\(-\log(2)\)</td>
</tr>
</tbody>
</table>
<hr />
<h2 id="wgan-and-wae">WGAN and WAE</h2>
<h3 id="optimal-transport-ot">Optimal transport (OT)</h3>
<p>Kantorovich formulated the optimization target in optimal transport problems as follows</p>
\[\begin{equation}
W_c(P_X, P_G) = \underset{\Gamma \in \mathcal P(x \sim P_X, y \sim P_Y)}{\text{inf}} \mathbb{E}_{x,y \sim \Gamma}[c(x,y)]
\end{equation}\]
<p>where \(\mathcal P(X\sim P_X, Y\sim P_Y)\) is a set of all join distributions of \((X,Y)\) with marginals \(P_X\) and \(P_Y\).</p>
<h3 id="wasserstein-distance">Wasserstein distance</h3>
<p>When \(c(x,y) = \| x-y \| ^p\) for \(p \ge 1\), \(W_c^{1/p}\) is called p-Wasserstein distance.</p>
\[\begin{equation}
W_p(P_X, P_G) = \underset{\Gamma \in \mathcal P(x \sim P_X, y \sim P_Y)}{\text{inf}} \mathbb{E}_{x,y \sim \Gamma}[\|x - y\|^p]
\end{equation}\]
<p>The optimization problem is highly intractable in general, due to the constraint. However when \(p=1\), Kantorovich-Rubinstein duality holds:</p>
\[\begin{equation}
W_1(P_X, P_G) = \underset{f \in \text{\{1-Lipschitz\}}}{\text{sup}} \mathbb{E}_{x\sim P_X}[f(x)] - \mathbb{E}_{y\sim P_Y}[f(y)]
\end{equation}\]
<p>The family of divergences from \(f\)-divergence only consider the relative probability (the ratio between two probability density functions) and do not measure the closeness of the underlying outcomes. With disjoint support or overlapping support but intersections that yield zero measure, the divergence between a target distribution and a \(\theta\)-parameterized distribution might not be continuous with respect to \(\theta\). Wasserstein distance on the other hand does take into account the underlying topology of the outcomes and is continuous and differentiable almost everywhere with respect to \(\theta\) and thus almost always provide useful gradient for optimization.</p>
<h3 id="wasserstein-gan-wgan">Wasserstein GAN (WGAN)</h3>
<p>Following the dual form of \(W_1\), we can form a generative-adversarial model for a data distribution \(P_D\) and model \(Q_\theta\) with auxiliary function \(f\) that is 1-Lipschitz continuous.</p>
\[\begin{equation}
\hat \theta = \argmin_\theta \underset{f \in \text{\{1-Lipschitz\}}}{\text{sup}} \mathbb{E}_{x\sim P_D}[f(x)] - \mathbb{E}_{x\sim Q_\theta}[f(x)]
\end{equation}\]
<p>$$ Practical considerations for WGAN</p>
<blockquote>
<p>Todo: Gradient clipping with K-Lipschitz constraint on \(f\); Soft gradient penalty (WGAN-GP)</p>
</blockquote>
<h3 id="wasserstein-auto-encoder-wae">Wasserstein Auto-encoder (WAE)</h3>
<p>Rather than working with the dual or Wasserstein distance, which only holds for \(W_1\), we can also work with the primal form directly. As shown in <em>Tolstikhin, Ilya, et al. “Wasserstein Auto-Encoders.”</em> the following holds when we have a deterministic decoder mapping latent variable \(Z\) to \(Y\) through \(y=G(z)\):</p>
\[\begin{equation}
\begin{split}
W_c(P_X, P_G) = W_c^\dagger(P_X, P_G) &= \underset{P \in \mathcal P(x \sim P_X, z \sim P_Z)}{\text{inf}} \mathbb{E}_{x,y \sim P}[c(x, G(z)]\\\\
&= \underset{Q: Q_Z = P_Z}{\text{inf}} \mathbb{E}_{x\sim P_X} \mathbb{E}_{z \sim Q(Z\vert X)}[c(x, G(z)]
\end{split}
\end{equation}\]
<p>The constraint put on \(Q(Z\vert X)\) is that its marginal needs to equal to \(P(Z)\). To have a feasible optimization problem we relax this constraint with the following constraint-free optimization target with a penalty that assess the closeness between \(Q(Z)\) and \(P(Z)\) via any reasonable divergence. This new objective is named <em>penalized optimal transport</em> (POT).</p>
\[\begin{equation}
D_{POT/WAE}(P_X, P_G) := \underset{Q \in \mathcal Q}{\text{inf}} \mathbb{E}_{x\sim P_X} \mathbb{E}_{z \sim Q(Z\vert X)}[c(x, G(z)] + \lambda \cdot D_{Z} (Q_Z, P_Z)
\end{equation}\]
<p>If the divergence between \(P_Z\) and \(Q_Z\) is intractable to directly calculate, we could use generative-adversarial training to approximate it (see \(f\)-GAN).</p>
<blockquote>
<p>Note: if decoder is probabilistic instead of deterministic, we would only have \(W_c(P_X, P_G) \le W_c^\dagger(P_X, P_G)\), so we are minimizing an upper bound of the true OT cost.</p>
</blockquote>
<blockquote>
<p>Thought: the original paper used JS divergence for \(D_Z\), how about we use Wasserstein distance for \(D_Z\).</p>
</blockquote>
<blockquote>
<p>Todo: discuss connections to AAE paper</p>
</blockquote>Ran DingA high-level summary of various generative models including Variational Autoencoders (VAE), Generative Adverserial Networks (GAN), and their notable extentions and generalizations, such as f-GAN, Adversarial Variational Bayes (AVB), Wasserstein GAN, Wasserstein Auto-Encoder (WAE), Cramer GAN and etc.EM Algorithm Recap2017-12-15T00:00:00+00:002017-12-15T00:00:00+00:00https://dingran.github.io/EM\[\newcommand{\argmin}{\mathop{\mathrm{argmin}}}
\newcommand{\argmax}{\mathop{\mathrm{argmax}}}
\renewcommand{\vec}[1]{\boldsymbol{#1}}\]
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>
<h2 id="introduction">Introduction</h2>
<p>This post explains Expectation-Maximization (EM) algorithm from scratch in a fairly concise fashion. The material is based on my own notes, which of course come from a variety of great resources online that are listed in the references section.</p>
<p>EM is one of the most elegant and widely used machine learning algorithms but is sometimes not thoroughly introduced in introductory machine learning courses. What is so elegant about EM is that, as we shall see, it originates from nothing but the most fundamental laws of probability.</p>
<p>Many variants of EM have been developed, and an important class of statistical machine learning methods called variational inference also has a strong connection to EM. The core ideas and derivatives of EM find many applications in both classical statistical machine learning and models that involve deep neural networks, making it worthwhile to have an intuitive and thorough understanding of it, which is what this post attempts to provide.</p>
<h2 id="notation">Notation</h2>
<!--- comment - Vector: $$\vec x$$; matrix $$\vec X$$. --->
<ul>
<li>Random variables \(X\), probability distribution \(P(X)\)</li>
<li>Probability density function (PDF) \(p(\cdot)\), evaluated at value \(x\): \(p(X=x)\) with \(p(x)\) as a shorthand</li>
<li>PDF with parameter \(\theta\) is noted as \(p_\theta(x)\) or equivalently \(p(x\vert \theta)\)</li>
<li>Expectation of \(f(x)\) according to distribution \(P\): \(\mathbb{E}_{x\sim P}\left[f(x)\right]\)</li>
<li>A set is noted as \({x_i}\) or calligraphic letter \(\mathcal X\)</li>
</ul>
<h2 id="maximum-likelihood">Maximum likelihood</h2>
<p>Supposed we had data coming from a distribution \(P_D(X)\), and we want to come up with a model for \(x\) parameterized by \(\theta\): \(p(x;\theta)\) or equivalent noted as \(p_{\theta}(x)\) to best approximate the real data distribution. Further assume all the data samples are independent and identically distributed (iid) with \(P_D(X)\).</p>
<p>To find \(\theta\) under a maximum likelihood scheme we do</p>
\[\begin{equation}
\begin{split}
\hat{\theta}_{MLE} &= \argmax_{\theta} \ell(\theta) \\\\
&= \argmax_{\theta} \sum_{i} \log\left( p_{\theta}(x_i) \right)
\end{split}
\end{equation}\]
<h2 id="motivation-for-em">Motivation for EM</h2>
<p>We might encounter situations where, in addition to observed data \({x_i}\), we have missing or hidden data \({z_i}\). It might literally be data that is missing for some reason. Or, more interestingly, it might be due to our modeling choice. We might prefer to have a model with a set of meaningful but hidden variables \({z_i}\) that help explain the “causes” of \({x_i}\). Good examples of this category would be Gaussian (or other kind of) mixture models, and LDA.</p>
<blockquote>
<p>Note to myself: examples when we introduces latent variables just for the sake of making the optimization problem easier?</p>
</blockquote>
<p>In either case, we will need to have a model for calculating the joint distribution of \(x\) and \(z\), \(p(x,z;\theta)\), which may arise from assumptions (in the case of missing data) or from models of marginal density functions \(p(z; \theta)\) and \(p(x\vert z; \theta)\). In such cases, the log likelihood can be expressed as</p>
\[\begin{equation}
\begin{split}
\ell(\theta) &= \sum_i \log\left( p_{\theta}(x_i) \right)\\\\ &= \sum_i \log\left( \sum_{z} p_{\theta}(x_i, Z=z) \right)\\\\
&= \sum_i \log\left( \sum_{z} p_{\theta}(x_i\vert Z=z)p_{\theta}(Z=z) \right)
\end{split}
\end{equation}\]
<p>Direct maximization of with respect to \(\theta\) might be challenging, due to the summation over \(z\) inside the log. But the problem would be much easier if we knew the values of \(z\). It is simply the original maximum likelihood problem with all data available.</p>
\[\begin{equation}
\begin{split}
\ell(\theta) &= \sum_i \log\left(p_{\theta}(x_i\vert Z=z_i)p_{\theta}(Z=z_i) \right) \\\\
&= \sum_i \log\left(p_{\theta}(x_i, z_i) \right)
\end{split}
\end{equation}\]
<p>The collection of \(({x_i}, {z_i})\) is called the <em>complete</em> data. Naturally, \({x_i}\) is the <em>incomplete</em> data and \({z_i}\) is the <em>latent</em> data/variable.</p>
<p>Roughly speaking, EM algorithm is an iterative method that let us to guess \(z_i\) based on \(x_i\) (and current estimate of model parameter \(\hat\theta\)). With the guessed “fill-in” \(z_i\) we now have <em>complete</em> data and we optimize the log likelihood \(\ell(\theta)\) with respect to \(\theta\). We thus iteratively improve our guess of latent variable \(z\) and parameter \(\theta\). We repeat this process until convergence.</p>
<p>In slightly more detail, instead of guessing a single value \(z\) we guess the distribution of \(z\) given \(x\), i.e. \(p(z\vert x;\hat\theta)\). then optimize the expected log likelihood for <em>complete</em> data, i.e. \(\sum_i \mathbb{E}_{z \sim p(z\vert x_i;\hat\theta)}\log p_\theta (x_i, z)\), with respect to \(\theta\) which serves as a proxy (lower bound) for the true objective \(\sum_i \log p_{\theta}(x_i)\). Repeat until converge.</p>
<p>(Note in fact guessing a single value for \(z\) is also a valid strategy. It corresponds to a variant of EM and is what we do in the well-known K-means algorithm, where we guess a “hard” label on each data points.)</p>
<p>The nice thing about EM is that it comes with theoretical guarantee of monotonic improvement on the true objective even through we directly work with a proxy (lower bound) of it. Note however the rate of convergence will depend on the problem and the convergence is not guaranteed to be towards the global optima.</p>
<h2 id="formulation">Formulation</h2>
<p>As before, we start with the log likelihood</p>
\[\begin{equation}
\begin{split}
\ell(\theta) &= \sum_i \log\left( p_{\theta}(x_i) \right) \\\\
&= \sum_i \log\left( \int p_{\theta}(x_i, z) dz \right)\\\\
&= \sum_i \log\left( \int \frac{p_{\theta}(x_i, z)}{q(z)} q(z) dz \right) \\\\
&= \sum_i \log\left( \mathbb{E}_{z \sim Q} \left[ \frac {p_{\theta}(x_i, z)}{q(z)} \right] \right)\\\\
&\ge \sum_i \mathbb{E}_{z \sim Q} \left[\log\left( \frac {p_{\theta}(x_i,z)}{q(z)} \right) \right]\\\\
\label{eq:jensen}
\end{split}
\end{equation}\]
<p>Here I switched the summation over \(z\) to integral assuming \(z\) is continuous, just to hint this is a possibility. The last step used Jensen’s inequality and the fact log function is strictly concave. So far we do not have any restrictions on the distribution \(Q\), apart from \(q(z)\) being a probability density function and it is positive where \(p_\theta(x_i,z)\) is.</p>
<p>Using the result above, let’s define the last quantity as \(\mathcal L(q,\theta)\). It is usually called ELBO (Evidence Lower BOund) as it is a lower bound of \(\ell(\theta)\).</p>
\[\begin{equation}
\mathcal L(q,\theta) = \sum_i \mathbb{E}_{z \sim Q} \left[\log\left( \frac {p_{\theta}(x_i,z)}{q(z)} \right) \right]
\end{equation}\]
<p>Just to reiterate what we have done so far: our goal is to maximize \(\ell(\theta)\), we exchanged the place of the log and integral over \(z\) and got a lower bound \(\mathcal L\).</p>
<p>We can show that the difference between \(\ell(\theta)\) and \(\mathcal L(q,\theta)\) is</p>
\[\begin{equation}
\begin{split}
\ell(\theta) - \mathcal L(q,\theta) & = \sum_i \int q(z) \left(log(p_\theta(x_i)) - \log\left(\frac{p_\theta(x_i,z)}{q(z)}\right)\right) dz\\\\
&= \sum_i \int q(z) \log\left(\frac{q(z)}{\frac{p_\theta(x_i,z)}{p_\theta(x)}}\right) dz \\\\
&= \sum_i \int q(z) \log\left(\frac{q(z)}{p_\theta(z\vert x_i)}\right) dz \\\\
&= \sum_i D_{KL}(q(z) \| p_\theta(z\vert x_i))
\end{split}
\end{equation}\]
<p>where we used the fact Kullback-Leibler (KL) divergence \(D_{KL}\) is defined as</p>
\[D_{KL}(P \| Q)= \int p(x) \log \left( \frac{p(x)}{q(x)} \right) dx = \mathbb{E}_{x\sim P}[\log(\frac{p(x)}{q(x)}]\]
<p>In general, KL divergence is always nonnegative and is zero if and only if \(q(x) = p(x)\). So in our case, the equality \(\ell(\theta) = \mathcal L(q,\theta)\) holds if and only if \(q(z) = p_\theta(z\vert x_i)\). When this happens, we say the bound is tight. In this case, it makes sense to note \(q(z)\) as \(q(z\vert x_i)\) to make the dependence on \(x_i\) clear.</p>
<h2 id="em-algorithm-and-monotonicity-guarantee">EM algorithm and monotonicity guarantee</h2>
<p>The EM algorithm is remarkably simple and it goes as follows.</p>
<ul>
<li>E-step (of \(t\)-th iteration):
<ul>
<li>Let \(q^t(z) = p(z \vert x_i; \hat\theta^{t-1})\), which is calculated as shown in Eq. \(\ref{eq:E}\)</li>
<li>Due to our particular choice of \(q^t\), at current estimate of \(\hat\theta^{t-1}\) the bond is tight: \(\mathcal L(q^t,\hat\theta^{t-1}) = \ell(\hat\theta^{t-1})\)</li>
</ul>
</li>
<li>M-step
<ul>
<li>Maximize \(\mathcal L(q^t,\theta)\) with respect to \(\theta\), see Eq. \(\ref{eq:M}\)</li>
<li>This step improves ELBO by finding a better \(\theta\): \(\mathcal L(q^t,\theta^t) \ge \mathcal L(q^t,\theta^{t-1})\)</li>
</ul>
</li>
</ul>
<p>The calculation in <strong>E-step</strong> is</p>
\[\begin{equation}\label{eq:E}
p(z\vert x_i; \hat\theta^{t-1}) = \frac{p(x_i\vert z; \hat\theta^{t-1})p(z; \hat\theta^{t-1})}{\int p(x_i\vert z; \hat\theta^{t-1})p(z; \hat\theta^{t-1}) dz}
\end{equation}\]
<p>Just to spell out the function \(\mathcal L(q^t,\theta)\) that we maximize in <strong>M-step</strong>.</p>
\[\begin{equation}
\begin{split}
\hat\theta^t &= \argmax_{\theta} \mathcal L(q^t,\theta) \\\\
&= \argmax_{\theta} \sum_i \mathbb{E}_{z \sim Q^t} \left[\log\left(p(x_i,z;\theta) \right) \right] \\\\
&= \argmax_{\theta} \sum_i \int p(z\vert x_i; \hat\theta^{t-1}) \log\left(p(x_i,z;\theta)\right) dz \\\\
\end{split}
\label{eq:M}
\end{equation}\]
<p>With the preparation earlier we can also easily show the theoretical guarantee on monotonic improvement over the optimization objective \(\ell(\theta)\).</p>
\[\begin{equation}\label{eq:monotone}
\ell(\theta^{t-1}) \underset{E-step}{=} \mathcal L(q^t,\theta^{t-1}) \underset{M-step}{\le} \mathcal L(q^t,\theta^t) \underset{Jensen}{\le} \ell(\theta^{t})
\end{equation}\]
<h3 id="why-the-e-in-e-step">Why the “E” in E-step</h3>
<p>By the way, the reason it is called E-step is because in that step we do the necessary calculation to figure out the form of \(\mathcal L(q,\theta)\) as a function of \(\theta\) which we then optimize in the M-step. The form of \(\mathcal L(q,\theta)\) is the <strong>expectation</strong> of the log likelihood of <em>complete</em> data over the estimated distribution of the latent variable \(z\).</p>
<h3 id="em-as-maximization-maximization">EM as maximization-maximization</h3>
<p>Because the particular choice \(q^t(z)\) in E-step is to have diminishing \(D_{KL}(q(z) \| p_\theta(z\vert x_i))\), thus E-step can be viewed as maximizing \(\mathcal L(q,\hat\theta^{t-1})\) with respect to \(q\) and M-step as maximization with respect to \(\theta\). So we are doing alternating maximization on the EBLO with respect to \(q\) and \(\theta\).</p>
\[\begin{equation}
\begin{split}
& \text{E-step:}\hspace{4pt}q^t(z) = \argmax_q \mathcal L(q,\hat\theta^{t-1})\\\\
& \text{M-step:}\hspace{4pt}\hat\theta^t = \argmax_\theta \mathcal L(q^t,\theta)
\end{split}
\end{equation}\]
<p>This maximization-maximization view offers justification for partial E-step (when the required calculation in exact E-step is intractable) and partial M-step (i.e. only find a \(\theta\) that increases the ELBO rather than maximizes it). Under this view, the direct maximization on ELBO as a goal offers a strong connection to <strong>variational inference</strong> as will be discussed briefly below.</p>
<h3 id="example-gaussian-mixture">Example: Gaussian Mixture</h3>
<p>In the context of Gaussian Mixture Model (GMM), \(z_i\) associated with \(x_i\) takes the value \({1,2,\dots\,n_{g}}\), where \({n_g}\) is the number of Gaussians in the model. Thus \(z_i\) indicates which Gaussian cluster observed data point \(x_i\) belongs to. The set of parameter \(\theta\) includes those parameterize the marginal distribution of \(z\), \(P(Z;\vec \pi)\). \(\vec \pi = [\pi_1, \pi_2, \dots, \pi_{n_g}]\), with \(\sum_i^{n_g} \pi_i = 1\) and \(\pi_i > 0\). Also, \(\theta\) include those parametrized the conditional distribution of \(P(X \vert Z=z_i; \mu_i, \sigma_i) \sim \mathcal N(\mu_i, \sigma_i)\).</p>
<p>For a detailed walk-through see Andrew Ng’s CS229 lecture <a href="http://cs229.stanford.edu/notes/cs229-notes8.pdf">notes</a> and <a href="https://www.youtube.com/watch?v=ZZGTuAkF-Hw">video</a></p>
<h2 id="variants-and-extensions-of-em">Variants and extensions of EM</h2>
<h3 id="gem-and-cem">GEM and CEM</h3>
<p>A popular variant to EM is that in Eq. \(\ref{eq:M}\) we merely find a \(\hat\theta^t\) that increases (rather than maximizes) \(\mathcal L(q^t,\theta)\). It is easy to see \(\ref{eq:monotone}\) and the monotonicity guarantee still holds in this situation. This algorithm is proposed in the original EM paper and called <em>Generalized EM (GEM)</em>.</p>
<p>Another variant is the point-estimate version we mentioned earlier, where instead of having \(q^t(z) = p(z\vert x_i; \hat\theta^{t-1})\) in the E-step, we take \(z\) to be a single value - the most probable one, i.e. \(\hat{z}^t=argmax_z p(z\vert x_i; \hat\theta^{t-1})\) or equivalently taking \(q^t(z) = \delta(z-\hat{z}^t)\). In this case, the integral in \(\ref{eq:M}\) is greatly simplified, but the first equality in \(\ref{eq:monotone}\) does not hold any more and we lose the theoretical guarantee. This algorithm is also called <em>Classification EM (CEM)</em>.</p>
<h3 id="stochastic-em">Stochastic EM</h3>
<p>As we can see in Eq. \(\ref{eq:M}\), we need to go through all data points in order to update \(\theta\), which could be long process for large data sets. In much of the same spirit as stochastic gradient descent, we could sample subsets of data and run the E- and M-step on these mini batches. The same idea can be used for variational inference mentioned below, on the update of <em>global</em> latent variables (such as \(\theta\)).</p>
<h3 id="variational-inference">Variational inference</h3>
<p>The computation of the optimal \(q(z)\), i.e. \(q(z) = p(z \vert x_i; \hat\theta_{t-1})\) in E-step might be intractable. Especially, the integral in the denominator of Eq. \(\ref{eq:E}\) does not have closed form solution for many interesting models. In this case we can take the view of EM as maximization-maximization and try to come up with better and better \(q(z)\) to improve the ELBO. In order to proceed with such variational optimization tasks, we need to specify the functional family \(\mathcal Q\) from which we will choose \(q(z)\). Depending on the assumptions a number of interesting algorithms have been developed. The most popular one is probably <strong>mean-field approximation</strong>.</p>
<p>Note that in a typical variational inference framework, the parameter \(\theta\) is treated as first class variables that we would do inference on (i.e. getting \(p(\theta\vert x)\)) rather than taking a maximum likelihood single point estimation, so \(\theta\) become part of the latent variables and absorbed into the notation \(z\). Thus, \(z\) includes <em>global</em> variables such as \(\theta\) and <em>local</em> variables such as the latent labels \(z_i\) associated with each data point \(x_i\).</p>
<p>In mean-field method the constraint we put on \(q(z)\) is that it factorizes, i.e. \(q(z) = \prod_k q_k(z_k)\). This is saying that all latent variables are mutual independent, by assumption. This seemingly simple assumption brings in remarkable simplifications in the calculation of integrals and especially the expectations of log likelihood involved. It leads to a coordinate ascent variational inference (CAVI) algorithm that allows closed-form iterative calculation for certain model family. The coordinate updates on <em>local</em> variables corresponds to the E-step in EM, while the updates on <em>global</em> variables corresponds to the M-step in EM.</p>
<p>For more about this topic see: D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, <a href="https://arxiv.org/abs/1601.00670">“Variational Inference: A Review for Statisticians,”</a> J. Am. Stat. Assoc., vol. 112, no. 518, pp. 859–877, 2017.</p>
<hr />
<h2 id="references">References</h2>
<blockquote>
<p>Todo: add citation in text; for now just core dumped some references here</p>
</blockquote>
<p>In no particular order:</p>
<ol>
<li>
<p>A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Stat. Soc. Ser. B Methodol., vol. 39, no. 1, pp. 1–38, 1977.</p>
</li>
<li>
<p>R. M. Neal and G. E. Hinton, “A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants,” Learn. Graph. Model., pp. 355–368, 1998.</p>
</li>
<li>
<p>J. A. Bilmes, “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” ReCALL, vol. 1198, no. 510, p. 126, 1998.</p>
</li>
<li>
<p>A. Roche, “EM algorithm and variants: an informal tutorial,” pp. 1–17, 2011.</p>
</li>
<li>
<p>M. R. Gupta, “Theory and Use of the EM Algorithm,” Found. Trends® Signal Process., vol. 4, no. 3, pp. 223–296, 2010.</p>
</li>
<li>
<p>M. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “Introduction to variational methods for graphical models,” Mach. Learn., vol. 37, no. 2, pp. 183–233, 1999.</p>
</li>
<li>
<p>M. J. Wainwright and M. Jordan, “Graphical Models, Exponential Families, and Variational Inference,” Found. Trends® Mach. Learn., vol. 1, no. 1–2, pp. 1–305, 2007.</p>
</li>
<li>
<p>M. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic Variational Inference,” vol. 14, pp. 1303–1347, 2012.</p>
</li>
<li>
<p>D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational Inference: A Review for Statisticians,” J. Am. Stat. Assoc., vol. 112, no. 518, pp. 859–877, 2017.</p>
</li>
<li>
<p>S. Mohamed, “Variational Inference for Machine Learning,” no. February, 2015.</p>
</li>
<li>
<p>Z. Ghahramani, “Variational Methods The Expectation Maximization ( EM ) algorithm,” no. April, 2003.</p>
</li>
</ol>Ran DingA quick walk-through of Expectation-Maximization (EM) algorithm.