Jekyll2019-11-16T02:10:32+00:00https://dingran.github.io/feed.xmlRan DingRan Ding's homepageRan DingExponential-Min and Gumbel-Max2019-01-01T00:00:00+00:002019-01-01T00:00:00+00:00https://dingran.github.io/Gumbel<script type="math/tex; mode=display">\newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}}</script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> <h2 id="introduction">Introduction</h2> <p>I originally wanted to write down the proof for the Gumbel-max trick but soon realized it is actually the same idea as a much more common problem: <em>exponential race</em>. So, in this note let’s go from this common problem and arrive at the Gumbel-max trick.</p> <h2 id="competing-alarms">Competing Alarms</h2> <p>As a preparation let’s solve a probability problem first.</p> <hr /> <p>There are <script type="math/tex">N</script> clocks started simultaneously, such that the <script type="math/tex">i</script>-th clock rings after a random time <script type="math/tex">T_i \sim \text{Exp}(\lambda_i)</script></p> <ul> <li> <p>(1) Designate <script type="math/tex">X</script> as the random time after which some clock (i.e any one of the clocks) rings, find the distribution of <script type="math/tex">X</script></p> </li> <li> <p>(2) Find the probability of the <script type="math/tex">i</script>-th clock rings first</p> </li> </ul> <hr /> <p>Let <script type="math/tex">X = \min \{T_1, T_2, \dots, T_n \}</script> and <script type="math/tex">F_X(t)</script> and <script type="math/tex">F_{T_i}(t)</script> be the CDFs. We also have <script type="math/tex">F_{T_i}(t) = 1- e^{-\lambda_it}</script>.</p> <p>Following order statistics of <script type="math/tex">\min</script>, we have <script type="math/tex">P(X>t) = \prod_{i=1}^N P(T_i>t)</script> or equivalently,</p> <script type="math/tex; mode=display">1 - F_X(t) = \prod_{i=1}^N (1-F_{T_i}(t)) = \prod_{i=1}^N e^{-\lambda_it} = e^{-\sum_{i=1}^N \lambda_it}</script> <p>Therefore</p> <script type="math/tex; mode=display">\begin{equation} X \sim \text{Exp}(\lambda_X = \sum_{i=1}^N \lambda_i) \label{part1} \end{equation}</script> <p>i.e. the <script type="math/tex">\min</script> of a set of i.i.d exponential random variables is still an exponential random variable with the rate <script type="math/tex">\lambda_X</script> being the sum of the rates of that set of random variables.</p> <p>For the second part of the problem, we can consider two competing alarms <script type="math/tex">T_1</script> and <script type="math/tex">T_2</script> to begin with. Our goal is to find <script type="math/tex">% <![CDATA[ P(T_1<T_2) %]]></script>.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{split} P(T_1 < T_2) & = \int_0^{+\infty} \int_{t_1}^{+\infty} P(T_1=t_1) P(T_2=t_2) dt_2 dt_1 \\\\ &= \int_0^{+\infty} P(T_1=t_1) \left(1-F_{T_2}(t_1)\right) dt_1 \\\\ &= \int_0^{+\infty} \lambda_1 e^{-\lambda_1 t_1} e^{-\lambda_2 t_1} dt_1 \\\\ & = \frac{\lambda_1}{\lambda_1+\lambda_2} \end{split} %]]></script> <p>Now, let’s consider one specific clock <script type="math/tex">T_k</script> versus all the rest, noted as <script type="math/tex">T_{-k} = \min \{T_i\}_{i \neq k}</script>. According to <script type="math/tex">\ref{part1}</script>, we know that <script type="math/tex">T_{-k} \sim \text{Exp}(\sum_{i\neq k} \lambda_i)</script>. Using the result above we have the solution for part (2) as follows</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} P(T_k \text{ rings first}) = P(T_k<T_{-k}) = \frac{\lambda_k}{\lambda_k+\sum_{i\neq k}\lambda_i} = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i} \label{part2} \end{equation} %]]></script> <p>Of course, we can do the integration directly and get the same result</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{split} P(T_k<T_{-k}) & = \int_0^{+\infty} P(T_k=t_k) \left( \idotsint_{t_k}^{+\infty} \prod_{i\neq k}P(T_i=t_i) dt_i \right) dt_k \\\\ & = \int_0^{+\infty} P(T_k=t_k) \left( \prod_{i\neq k} \left(1-F_{T_i}(t_k)\right) \right) dt_k \\\\ & = \int_0^{+\infty} \lambda_k \exp{\left(-\lambda_k t_k\right)} \exp{\left(-\sum_{i \neq k}\lambda_i t_k\right)} dt_k \\\\ & = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i} \end{split} %]]></script> <p>By the way, this setup with multiple exponential random variables and we look for the first arrival is also called <em>exponential race</em>.</p> <h2 id="exponential-min-trick">Exponential-Min Trick<a name="argmin"></a></h2> <p>I just made up the name “Exponential-Min”. The better name for this section is probably <em>Sampling from Multinomial by Argmining</em>.</p> <p>Suppose we have a set of positive numbers <script type="math/tex">[\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]</script>. Correspondingly we have a normalized probabiilty vector <script type="math/tex">\vec{p}=[p_1, p_2, p_3, \dots, p_N]</script>, where <script type="math/tex">p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}</script>. This probability vector specifies a multinominal distribution over <script type="math/tex">N</script> choices.</p> <p>Now, if we were to get a sample <script type="math/tex">\{1, 2, \dots, N\}</script> according to this multinominal distribution specified by <script type="math/tex">\vec{p}</script> (which is fundamentally specified by <script type="math/tex">[\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]</script>), what should we do?</p> <p>Normally, we do the following:</p> <ol> <li>We have <script type="math/tex">[\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]</script></li> <li>We compute <script type="math/tex">\vec{p}=[p_1, p_2, p_3, \dots, p_N]</script>, where <script type="math/tex">p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}</script>.</li> <li>We generate a uniform random number <script type="math/tex">Q</script> between 0 and 1, i.e. <script type="math/tex">Q \sim \text{Uniform}(0,1)</script></li> <li>We figure out where <script type="math/tex">Q</script> lands, i.e. if <script type="math/tex">% <![CDATA[ p_i < Q < p_{i+1} %]]></script> we return <script type="math/tex">i</script>. (Of couse we should use <script type="math/tex">p_0=0</script> and <script type="math/tex">p_{N+1}=1</script>)</li> </ol> <p>But that’s the boring way. Now we have this new Exponential-Min trick, we can do the following:</p> <ol> <li>We have <script type="math/tex">[\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]</script></li> <li>We don’t compute <script type="math/tex">\vec{p}</script>; instead we sample <script type="math/tex">T_i \sim \text{Exp}(\lambda_i)</script> for <script type="math/tex">i=1, 2, \dots, N</script>, i.e. we have a total of <script type="math/tex">N</script> samples, one from each <script type="math/tex">\text{Exp}(\lambda_i)</script></li> <li>We now take <script type="math/tex">\argmin([T_1, T_2, \dots, T_N])</script> as our result sample</li> <li>We proved in <script type="math/tex">\ref{part2}</script> that such a result sample indeed follows multinominal distribution specified by <script type="math/tex">\vec{p}=[p_1, p_2, p_3, \dots, p_N]</script>, where <script type="math/tex">p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}</script>.</li> </ol> <p>Thus, somehow we ended up <em>Sampling from Multinomial by Argmining</em>!</p> <h2 id="gumbel-distribution">Gumbel Distribution</h2> <p>Now let’s move on to Gumbel distribution from Exponential distribution.</p> <p>Gumbel distribution with unit scale (<script type="math/tex">\beta=1</script>) is parameterized by location parameter <script type="math/tex">\mu</script>. <script type="math/tex">\text{Gumble}(\mu)</script> has CDF and PDF as follows</p> <script type="math/tex; mode=display">\text{CDF: } F(x; \mu)=e^{-e^{-(x-\mu)}}</script> <script type="math/tex; mode=display">\text{PDF: }f(x; \mu) = e^{-\left((x-\mu)+e^{-(x-\mu)}\right)}</script> <p>Given a set of <script type="math/tex">N</script> independent Gumbel random variables <script type="math/tex">G_i</script>, each with their own parameter <script type="math/tex">\mu_i</script>, i.e. <script type="math/tex">G_i \sim \text{Gumbel}(\mu_i)</script>.</p> <p>Gumbel distribution has two properties that are quite analogous the <em>exponential race</em> example above.</p> <ul> <li>(1) Let <script type="math/tex">Z = \max \{G_i \}</script>, then <script type="math/tex">Z \sim \text{Gumble}\left(\mu_Z = \log \sum_{i=1}^N e^{\mu_i} \right)</script></li> </ul> <p>The proof is straightforward and similar to above:</p> <script type="math/tex; mode=display">F_Z(x) = \prod_{i=1}^N F_{G_i}(x) = \prod_{i=1}^N e^{-e^{-(x-\mu_i)}} = e^{-\sum_{i=1}^N e^{-(x-\mu_i)}} = e^{-e^{-x} \sum_{i=1}^N e^{\mu_i}} = e^{-e^{-(x-\mu_Z)}}</script> <ul> <li>(2) A corollary of the above is that the probability of <script type="math/tex">Z_k</script> being the max is <script type="math/tex">P(Z_k > Z_{-k}) = \frac{e^{\mu_k}}{\sum_{i=1}^N e^{\mu_i}}</script></li> </ul> <h2 id="gumbel-max-trick">Gumbel-Max Trick</h2> <p>Now here we can tell nearly an identical/parallel story as in the section <a href="#argmin">Exponential-Min Trick</a>. And, this section should really be called <em>Sampling from Multinomial by Argmaxing</em>.</p> <p>The main differences are</p> <ul> <li>The numbers (parameters) <script type="math/tex">\mu_i</script> can be potentially negative, whereas <script type="math/tex">\lambda_i</script> must be positive</li> <li>The probability vector is determined by <script type="math/tex">p_k = \frac{e^{\mu_k}}{\sum_{i=1}^N e^{\mu_i}}</script> instead of <script type="math/tex">p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}</script></li> <li>We generate samples with <script type="math/tex">G_i \sim \text{Gumbel}(\mu_i)</script> instead of <script type="math/tex">T_i \sim \text{Exp}(\lambda_i)</script></li> <li>We take <script type="math/tex">\argmax</script> over <script type="math/tex">G_i</script> instead of taking <script type="math/tex">\argmin</script> over <script type="math/tex">T_i</script></li> </ul> <h2 id="when-is-gumbel-max-trick-useful">When is Gumbel-Max Trick Useful?</h2> <p>It seems a lot of work to sample multinominal by argmaxing over Gumble samples (or argmining over Exponential samples). In what situation do we ever want to do this?</p> <p>The short answer is that Gumble-Max trick allows us to make a sampling step <strong>differentiable</strong>. Specifically, it makes sampling from multinomial distribution differentiable. We’ll take a closer look at this in a future post but pause for a second and think about it. We are saying it is possible to differentiate through the action of drawing a discrete sample from a multinomial distribution! This was a pretty surprising/amazing possibility to me.</p> <p>Regarding downstream applications, differentiating through sampling is an important “trick” in neural network based variational inference in general. Multinomial discrete random variables are prevalent in many learning problems. Gumble-max trick allows us to work with them in many interesting neural variational inference problems, which we will look into in future posts.</p>Ran DingExponential-min and Gumbel-max tricks for sampling from a multinomial distribution by taking the argmin and argmax.Recent Progress in Language Modeling2018-10-09T00:00:00+00:002018-10-09T00:00:00+00:00https://dingran.github.io/LM<script type="math/tex; mode=display">\newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}}</script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> <h2 id="overview">Overview</h2> <p>This page is a high-level summary / notes of various recent results in language modeling with little explanations. Papers to cover are as follows:</p> <p><strong> AWD Language Model</strong></p> <ul> <li>Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. “Regularizing and optimizing LSTM language models.” arXiv preprint arXiv:1708.02182 (2017).</li> </ul> <p><strong> Neural Cache</strong></p> <ul> <li>Grave, Edouard, Armand Joulin, and Nicolas Usunier. “Improving neural language models with a continuous cache.” arXiv preprint arXiv:1612.04426 (2016).</li> </ul> <p><strong> Dynamic Evaluation</strong></p> <ul> <li>Krause, Ben, et al. “Dynamic evaluation of neural sequence models.” arXiv preprint arXiv:1709.07432 (2017).</li> </ul> <p><strong> Memory-based Parameter Adaptation (MbPA)</strong></p> <ul> <li>Sprechmann, Pablo, et al. “Memory-based parameter adaptation.” arXiv preprint arXiv:1802.10542 (2018).</li> </ul> <p><strong> Hebbian Softmax</strong></p> <ul> <li>Rae, Jack W., et al. “Fast Parametric Learning with Activation Memorization.” arXiv preprint arXiv:1803.10049 (2018).</li> </ul> <p><strong> Higher-rank LM / Mixture-of-Softmax (MoS)</strong></p> <ul> <li>Yang, Zhilin, et al. “Breaking the softmax bottleneck: A high-rank RNN language model.” arXiv preprint arXiv:1711.03953 (2017).</li> </ul> <p>This is by no means an exhaustive literature review - they are only a selection of a few of the most recent state-of-the-art results. AWD LM  has almost become the de-facto baseline LM for many of the other papers, where the main innovations area special version of <strong>A</strong>veraged SGD (ASGD) along with DropConnection based <strong>W</strong>eight <strong>D</strong>ropping regularization in the hidden -to-hidden mapping of a LSTM model.</p> <p>It has been found a global LM is ineffective in reacting to local patterns at test time, such as once a rare word appears furthe reappearance in the peoximty is much more likely than predicted by a global LM. To allow for faster reaction to local patterns, [2 - 5] propose various schemes involving a fast-learning non-parametric component and blend its predictions or parameters with the global learned parametric LM. A quick comparsion of these 4 papers are in the table below.</p> <table> <thead> <tr> <th>Ref</th> <th>Method</th> <th>Modifications to training?</th> <th>Adapation needed at test time?</th> </tr> </thead> <tbody> <tr> <td></td> <td>Keeping key-value store with keys being previous (fixed size) output hidden states and value being correct labels. This non-parametric cache provides a local LM based on nearest-neighbor lookup. This is then interpolated with global LM for final prediction.</td> <td>No</td> <td>No</td> </tr> <tr> <td></td> <td>Similar to  but instead of doing nearest-neighor over saved hidden-steates, here we fit recent history with gradient descent thus providing a slightly adjusted model, i.e. parameters are adapted, not just predictions, to recent history. One concern I would have is whether the continuous adapation would let the model run away too far from the initial trained model.</td> <td>No</td> <td>Yes</td> </tr> <tr> <td></td> <td>Similar to , but the test-time gradient descent produces a local model that is discarded after use for prediction, i.e. unlike  the change of paramters due to local memory does not carry over to next time step. Thus this is quite closely related to meta-leanring. Another minor point, the gradient descent does not go through the full network, but stops at the so-called embedding layer, which is usually a layer close to the output, extracting fairly abstract features.</td> <td>No</td> <td>Yes</td> </tr> <tr> <td></td> <td>Recent output hidden states are accumulated into one vector using exponential moving average and then directly updated to output linear mapping parameter matrix. Two sets of update rules are used at training. Non-parametric leanring are tapered off as words are seen more frequently. Different from [2-4], this method incorporates fast learning at training time not just fast adapation at test time.</td> <td>Yes</td> <td>No</td> </tr> </tbody> </table> <p>Table 1. Comparison of methods in Ref [2-5]</p> <p>And finally,  highlights and mostly solved a fairly general problem of softmax over product produced by rank-limited matrices which is common in the decoder in a LM.</p> <hr /> <h2 id="awd-lm">AWD LM</h2> <h2 id="neural-cache">Neural Cache</h2> <h2 id="dynamic-evaluation">Dynamic Evaluation</h2> <h2 id="memory-based-parameter-adaptation-mbpa">Memory-based Parameter Adaptation (MbPA)</h2> <h2 id="hebbian-softmax">Hebbian Softmax</h2> <h2 id="higher-rank-lm--mixture-of-softmax-mos">Higher-rank LM / Mixture-of-Softmax (MoS)</h2> <iframe src="https://drive.google.com/file/d/1nMS1FnJ8xQPcZ06JokXDU4f5Z8O_riig/preview?usp=sharing" width="100%" height="600"></iframe>Ran DingA brief overview of various techniques in recent language model (LM) literatures including AWD LM, the use of cache, dynamic evaluation, other memory-based non-parametric components to enhance learned parametric LM, and finally, recent progress in high-rank LM.Preparations for DS/AI/ML/Quant2018-05-05T00:00:00+00:002018-05-05T00:00:00+00:00https://dingran.github.io/PP<h2 id="what-is-this">What is this</h2> <p>A short list of resources and topics covering the essential quantitative tools for data scientists, AI/machine learning practitioners, quant developers/researchers and those who are preparing to interview for these roles.</p> <p>At a high-level we can divide things into 3 main areas:</p> <ol> <li>Machine Learning</li> <li>Coding</li> <li>Math (calculus, linear algebra, probability, etc)</li> </ol> <p>Depending on the type of roles, the emphasis can be quite different. For example, AI/ML interviews might go deeper into the latest deep learning models, while quant interviews might cast a wide net on various kinds of math puzzles. Interviews for research-oriented roles might be lighter on coding problems or at least emphasize on algorithms instead of software designs or tooling.</p> <h2 id="list-of-resources">List of resources</h2> <p>A minimalist list of the best/most practical ones:</p> <p><img src="https://dingran.github.io/assets/images/PP/cs229.png" alt="" /> <img src="https://dingran.github.io/assets/images/PP/mit6006.jpg" alt="" /> <img src="https://dingran.github.io/assets/images/PP/stats110.jpg" alt="" /></p> <p>Machine Learning:</p> <ul> <li>Course on classic ML: Andrew Ng’s CS229 (there are several different versions, <a href="https://www.coursera.org/learn/machine-learning">the Cousera one</a> is easily accessible. There is also an <a href="https://www.youtube.com/playlist?list=PLA89DCFA6ADACE599">older version</a> recorded at Stanford)</li> <li>Book on classic ML: Alpaydin’s Intro to ML <a href="https://www.amazon.com/Introduction-Machine-Learning-Adaptive-Computation/dp/026201243X/ref=la_B001KD8D4G_1_2?s=books&amp;ie=UTF8&amp;qid=1525554938&amp;sr=1-2">link</a></li> <li>Course with a deep learing focus: <a href="http://cs231n.stanford.edu/">CS231</a> from Stanford, lectures available on Youtube.</li> </ul> <blockquote> <p>If you are just breaking into the field I think the above are enough, stop there and move on to other areas of preparation. Here are a few very optional items, mostly on deep learning, in case you have more time:</p> </blockquote> <blockquote> <ul> <li>Overview book on deep learning: <a href="https://www.deeplearningbook.org/">Deep Leanring</a> by Ian Goodfellow et al.</li> <li>Amazing book on deep laerning NLP: Yoav Goldberg’s <a href="https://www.amazon.com/Language-Processing-Synthesis-Lectures-Technologies-ebook/dp/B071FGKZMH">Neural Network Methods for Natural Language Processing</a></li> <li>Pick one of those Udacity nanodegrees on deep learning / self-driving cars</li> <li>Hands on exercises on deep learning: Pytorch and MXNet/Gluon are easier to pick up compared to Tensorflow. For anyone of them, you can find plenty of hands on examples online. My biased recommendation is <a href="https://d2l.ai/">https://d2l.ai/</a> using MXNet/Gluon created by people at Amazon (it came from <a href="https://github.com/zackchase/mxnet-the-straight-dope">mxnet-the-straight-dope</a>)</li> </ul> </blockquote> <p>Coding:</p> <ul> <li>Course: MIT OCW 6006 <a href="https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/">link</a></li> <li>Book: Cracking the Coding Interview <a href="https://www.amazon.com/Cracking-Coding-Interview-Programming-Questions/dp/098478280X">link</a></li> <li>Practice sites: <a href="https://leetcode.com/">Leetcode</a>, <a href="https://www.hackerrank.com/">HackerRank</a></li> <li>SQL tutorial: from <a href="https://community.modeanalytics.com/sql/">Mode Analytics</a></li> </ul> <p>Math:</p> <ul> <li>Calculus and Linear Algebra: undergrad class would be the best, refresher notes from CS229 <a href="http://cs229.stanford.edu/section/cs229-linalg.pdf">link</a></li> <li>Probability: Harvard Stats110 <a href="https://projects.iq.harvard.edu/stat110/home">link</a>; <a href="https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1466575573/ref=pd_lpo_sbs_14_t_2?_encoding=UTF8&amp;psc=1&amp;refRID=5W11QQ7WW4DFE0Q89N7V">book</a> from the same professor.</li> <li>Statistics: Shaum’s Outline <a href="https://www.amazon.com/Schaums-Outline-Statistics-5th-Outlines/dp/0071822526">link</a>.</li> <li>[Optional] Numerical Methods and Optimization: these are two different topics really, college courses are probably the best bet. I have yet to find good online courses for them. But don’t worry, most interviews won’t really touch on them.</li> </ul> <h2 id="list-of-topics">List of topics</h2> <p>Here is a list of topics from which interview questions are often derived. The depth and trickiness of the questions certainly depend on the role and the company.</p> <p>Under topic I try to add a few bullet points of the key things you should know.</p> <h3 id="machine-learning">Machine learning</h3> <ul> <li>Models (roughly in decreasing order of frequency) <ul> <li>Linear regression - e.g. assumptions, multicollinearity, derive from scratch in linear algebra form</li> <li>Logistic regression <ul> <li>be able to write out everything from scratch: from definitng a classficiation problem to the gradient updates</li> </ul> </li> <li>Decision trees/forest - e.g. how does a tree/forest grow, on a pseudocode level</li> <li>Clustering algorithms <ul> <li>e.g. K-means, agglomerative clustering</li> </ul> </li> <li>SVM <ul> <li>e.g. margin-based loss objectives, how do we use support vectors, prime-dual problem</li> </ul> </li> <li>Generative vs discriminative models <ul> <li>e.g. Gaussian mixture, Naive Bayes</li> </ul> </li> <li>Anomaly/outlier detection algorithms (DBSCAN, LOF etc)</li> <li>Matrix factorization based models</li> </ul> </li> <li>Training methods <ul> <li>Gradient descent, SGD and other popular variants - Understand momentum, how they work, and what are the diffrences between the popular ones (RMSProp, Adgrad, Adadelta, Adam etc) - Bonus point: when to not use momentum?</li> <li>EM algorithm - Andrew’s <a href="http://cs229.stanford.edu/notes/cs229-notes8.pdf">lecture notes</a> are great, also see <a href="https://dingran.github.io/EM/">this</a></li> <li>Gradient boosting</li> </ul> </li> <li>Learning theory / best practice (see Andrew’s advice <a href="http://cs229.stanford.edu/materials/ML-advice.pdf">slides</a>) <ul> <li>Bias vs variance, regularization</li> <li>Feature selection</li> <li>Model validation</li> <li>Model metrics</li> <li>Ensemble method, boosting, bagging, bootstraping</li> </ul> </li> <li>Generic topics on deep learning <ul> <li>Feedforward networks</li> <li>Backpropagation and computation graph <ul> <li>I really liked the <a href="https://gist.github.com/dingran/154a524003c86ecab4a949c538afa766">miniflow</a> project Udacity developed</li> <li>In addition, be absolutely familiar with doing derivatives with matrix and vectors, see <a href="http://cs231n.stanford.edu/vecDerivs.pdf">Vector, Matrix, and Tensor Derivatives</a> by Erik Learned-Miller and <a href="http://cs231n.stanford.edu/handouts/linear-backprop.pdf">Backpropagation for a Linear Layer</a> by Justin Johnson</li> </ul> </li> <li>CNN, RNN/LSTM/GRU</li> <li>Regularization in NN, dropout, batch normalization</li> </ul> </li> </ul> <h3 id="coding-essentials">Coding essentials</h3> <p>The bare minimum of coding concepts you need to know well.</p> <ul> <li>Data structures: <ul> <li>array, dict, link list, tree, heap, graph, ways of representing sparse matrices</li> </ul> </li> <li>Sorting algorithms: <ul> <li>see <a href="https://brilliant.org/wiki/sorting-algorithms/">this</a> from brilliant.org</li> </ul> </li> <li>Tree/Graph related algorithms <ul> <li>Traversal (BFS, DFS)</li> <li>Shortest path (two sided BFS, dijkstra)</li> </ul> </li> <li>Recursion and dynamic programming</li> </ul> <h3 id="calculus">Calculus</h3> <p>Just to spell things out</p> <ul> <li>Derivatives <ul> <li>Product rule, chain rule, power rule, L’Hospital’s rule,</li> <li>Partial and total derivative</li> <li>Things worth remembering <ul> <li>common function’s derivatives</li> <li>limits and approximations</li> </ul> </li> <li>Applications of derivatives: e.g. <a href="https://math.stackexchange.com/questions/1619911/why-ex-is-always-greater-than-xe">this</a></li> </ul> </li> <li>Integration <ul> <li>Power rule, integration by sub, integration by part</li> <li>Change of coordinates</li> </ul> </li> <li>Taylor expansion <ul> <li>Single and multiple variables</li> <li>Taylor/McLauren series for common functions</li> <li>Derive Newton-Raphson</li> </ul> </li> <li>ODEs, PDEs (common ways to solve them analytically)</li> </ul> <h3 id="linear-algebra">Linear algebra</h3> <ul> <li>Vector and matrix multiplication</li> <li>Matrix operations (transpose, determinant, inverse etc)</li> <li>Types of matrices (symmetric, Hermition, orthogonal etc) and their properties</li> <li>Eigenvalue and eigenvectors</li> <li>Matrix calculus (gradients, hessian etc)</li> <li>Useful theorems</li> <li>Matrix decomposition</li> <li>Concrete applications in ML and optimization</li> </ul> <h3 id="probability">Probability</h3> <p>Solving probability interview questions is really all about pattern recognition. To do well, do plenty of exercise from <a href="https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1466575573/ref=pd_lpo_sbs_14_t_2?_encoding=UTF8&amp;psc=1&amp;refRID=5W11QQ7WW4DFE0Q89N7V">this</a> and <a href="https://www.amazon.com/Practical-Guide-Quantitative-Finance-Interviews/dp/1438236662">this</a>. This topic is particularly heavy in quant interviews and usually quite light in ML/AI/DS interviews.</p> <ul> <li>Basic concepts <ul> <li>Event, outcome, random variable, probability and probability distributions</li> </ul> </li> <li>Combinatorics <ul> <li>Permutation</li> <li>Combinations</li> <li>Inclusion-exclusion</li> </ul> </li> <li>Conditional probability <ul> <li>Bayes rule</li> <li>Law of total probability</li> </ul> </li> <li>Probability Distributions <ul> <li>Expectation and variance equations</li> <li>Discrete probability and stories</li> <li>Continuous probability: uniform, gaussian, poisson</li> </ul> </li> <li>Expectations, variance, and covariance <ul> <li>Linearity of expectation <ul> <li>solving problems with this theorem and symmetry</li> </ul> </li> <li>Law of total expectation</li> <li>Covariance and correlation</li> <li>Independence implies zero correlation</li> <li>Hash collision probability</li> </ul> </li> <li>Universality of Uniform distribution <ul> <li>Proof</li> <li>Circle problem</li> </ul> </li> <li>Order statistics <ul> <li>Expectation of min and max and random variable</li> </ul> </li> <li>Graph-based solutions involving multiple random variables <ul> <li>e.g. breaking sticks, meeting at the train station, frog jump (simplex)</li> </ul> </li> <li>Approximation method: Central Limit Theorem <ul> <li>Definition, examples (unfair coins, Monte Carlo integration)</li> <li><a href="https://github.com/dingran/quant-notes/blob/master/prob/central_limit_theorem.ipynb">Example question</a></li> </ul> </li> <li>Approximation method: Poisson Paradigm <ul> <li>Definition, examples (duplicated draw, near birthday problem)</li> </ul> </li> <li>Poisson count/time duality <ul> <li>Poisson from poissons</li> </ul> </li> <li>Markov chain tricks <ul> <li>Various games, introduction of martingale</li> </ul> </li> </ul> <h3 id="statistics">Statistics</h3> <ul> <li>Z-score, p-value</li> <li>t-test, F-test, Chi2 test (know when to use which)</li> <li>Sampling methods</li> <li>AIC, BIC</li> </ul> <h3 id="optional-numerical-methods-and-optimization">[Optional] Numerical methods and optimization</h3> <ul> <li>Computer errors (e.g. float)</li> <li>Basic root finding (newton method, bisection, secant etc)</li> <li>Interpolating</li> <li>Numerical integration and difference</li> <li>Numerical linear algebra <ul> <li>Solving linear equations, direct methods (understand complexities here) and iterative methods (e.g. conjugate gradient), maybe BFGS</li> <li>Matrix decompositions/transformations (e.g. QR, Givens, LU, SVD etc)</li> <li>Eigenvalue solvers (e.g. power iteration, Arnoldi/Lanczos etc)</li> </ul> </li> <li>ODE solvers (explicit, implicit)</li> <li>Finite-difference method, finite-element method</li> <li>Optimization topics: linear programming (and convex opt in general), calculus of variations</li> </ul>Ran DingA short list of resources and topics covering the essential quantitative tools for data scientists, AI/machine learning practitioners, quant developers/researchers and those who are preparing to interview for these roles.Recent Progress in Neural Variational Inference2018-03-08T00:00:00+00:002018-03-08T00:00:00+00:00https://dingran.github.io/NVI<iframe src="https://drive.google.com/file/d/1PbUU94Cf6EsR9AxND_WbREHq3q3jrt1c/preview?usp=sharing" width="100%" height="600"></iframe>Ran DingA literature survey of recent papers on Neural Variational Inference (NVI) and its application in topic modeling.Brief Survey of Generative Models2017-12-20T00:00:00+00:002017-12-20T00:00:00+00:00https://dingran.github.io/GM<script type="math/tex; mode=display">\newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}}</script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> <h2 id="overview">Overview</h2> <p>This page is a high-level summary of various generative models with little explanations. Models to cover are as follows:</p> <p><strong>Variational Autoencoders (VAE)</strong></p> <ul> <li>Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).</li> </ul> <p><strong>Adversarial Variational Bayes (AVB)</strong></p> <p>Extention to VAE to use non-Gaussian encoders</p> <ul> <li>Mescheder, Lars, Sebastian Nowozin, and Andreas Geiger. “Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks.” arXiv preprint arXiv:1701.04722 (2017).</li> </ul> <p><strong>Generative Adverserial Networks (GAN)</strong></p> <ul> <li>Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.</li> </ul> <p><strong>Generalized divergence minimization GAN (<script type="math/tex">f</script>-GAN)</strong></p> <ul> <li>Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka. “f-gan: Training generative neural samplers using variational divergence minimization.” Advances in Neural Information Processing Systems. 2016.</li> </ul> <p><strong>Wasserstein GAN (WGAN)</strong></p> <ul> <li>Arjovsky, Martin, Soumith Chintala, and Léon Bottou. “Wasserstein gan.” arXiv preprint arXiv:1701.07875 (2017).</li> </ul> <p><strong>Adversarial Autoencoders (AAE)</strong></p> <ul> <li>Makhzani, Alireza, et al. “Adversarial autoencoders.” arXiv preprint arXiv:1511.05644 (2015).</li> </ul> <p><strong>Wasserstein Auto-Encoder (WAE)</strong></p> <ul> <li>Tolstikhin, Ilya, et al. “Wasserstein Auto-Encoders.” arXiv preprint arXiv:1711.01558 (2017).</li> </ul> <p><strong>Cramer GAN</strong></p> <ul> <li>Bellemare, Marc G., et al. “The Cramer Distance as a Solution to Biased Wasserstein Gradients.” arXiv preprint arXiv:1705.10743 (2017).</li> </ul> <hr /> <h2 id="vae">VAE</h2> <h3 id="model-setup">Model setup:</h3> <ul> <li>Recognition model: <script type="math/tex">q_\phi(z \vert x) = \mathcal N(\mu=h_1(x), \sigma^2 \mathbf I=h_2(x)\mathbf I)</script></li> <li>Assumed fixed prior: <script type="math/tex">p(z) = \mathcal N(0,\mathbf I)</script></li> <li>Generation model: <script type="math/tex">p_\theta(x \vert z) = \mathcal N(\mu=g_1(z), \sigma^2 \mathbf I=g_2(z)\mathbf I)</script> <ul> <li>Implied (but intractable) posterior: <script type="math/tex">p_\theta(z\vert x)</script></li> </ul> </li> </ul> <h3 id="key-equations">Key equations:</h3> <script type="math/tex; mode=display">\begin{equation} \begin{split} \log p_\theta (x^i) = D_{KL}(q_\phi(z\vert x^i)\| p_\theta(z\vert x^i) + \mathcal L(\theta, \phi, x^i) \end{split} \end{equation}</script> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{split} \mathcal L(\theta, \phi, x^i) &= \mathbb{E}_{z\sim q_\phi(z\vert x^i)}[\log p_\theta(x^i,z) - \log q_\phi(z\vert x^i)]\\\\ &= \mathbb{E}_{z\sim q_\phi(z\vert x^i)}[\log p_\theta(x^i,z)] + H[q_\phi(z\vert x^i)]\\\\ &= \mathbb{E}_{z\sim q_\phi(z\vert x^i)}[\log p_\theta(x^i \vert z)] - D_{KL}[q_\phi(z\vert x^i) \| p(z)]\\\\ \end{split} \label{vae_elbo} \end{equation} %]]></script> <h3 id="optimization-objective">Optimization objective:</h3> <script type="math/tex; mode=display">\hat\theta, \hat\phi = \argmax_{\theta, \phi} \sum_i \mathcal L(\theta, \phi, x^i)</script> <h3 id="gradient-friendly-monte-carlo">Gradient-friendly Monte Carlo:</h3> <p>Difficulties in calculating <script type="math/tex">\mathcal L(\theta, \phi, x^i)</script>:</p> <ul> <li>Due to the generality of <script type="math/tex">q</script> and <script type="math/tex">p</script> (typically a neural network), the expectation in <script type="math/tex">\ref{vae_elbo}</script> does not have an analytical form. So we need to resort to Monte Carlo estimation.</li> <li>Furthermore, direct sampling <script type="math/tex">z</script> according to <script type="math/tex">q</script> poses difficulty in taking derivative against parameters <script type="math/tex">\phi</script> that parameterizes the distribution <script type="math/tex">q</script>.</li> </ul> <p>Solution: Reparameterization Trick</p> <p>Find smooth and invertible transformation <script type="math/tex">z=g_\phi(\epsilon)</script> such that with <script type="math/tex">\epsilon</script> drawn from a <em>fixed</em> (non-parameterized) distribution <script type="math/tex">p(\epsilon)</script> we have <script type="math/tex">z \sim q(z; \phi)</script>, so</p> <script type="math/tex; mode=display">\mathbb{E}_{z\sim q(z;\phi)}[f(z)] = \mathbb{E}_{\epsilon\sim p(\epsilon)}[f(g_\phi(\epsilon))]</script> <p>For the Normal distribution used here (<script type="math/tex">q_\phi(z\vert x)</script>), it is convenient to use location-scale transformation, <script type="math/tex">z=\mu+\sigma * \epsilon</script> with <script type="math/tex">\epsilon \sim \mathcal N(0,\mathbf I)</script>.</p> <script type="math/tex; mode=display">\begin{equation} \widetilde{\mathcal{L}}(\theta, \phi, x^i) = \frac{1}{L} \sum_{l=1}^L \left( \log p_\theta(x^i \vert z^{i,l})] - D_{KL}[q_\phi(z^{i,l}\vert x^i) \| p(z^{i,l}) \right) \end{equation}</script> <script type="math/tex; mode=display">z^{i,l} = \mu_{x^i} + \sigma_{x^i} * \epsilon ^{i,l} ~~\text{and}~~ \epsilon^l \sim \mathcal N(0,\mathbf I)</script> <p>For total <script type="math/tex">N</script> data points with mini batch size <script type="math/tex">M</script>:</p> <script type="math/tex; mode=display">\begin{equation} \begin{split} {\mathcal L}(\theta, \phi; X) = \sum_{i=1}^N \mathcal L(\theta, \phi, x^i) \approx \widetilde {\mathcal L^M}(\theta, \phi; X) = \frac{N}{M} \sum_{i=1}^M \widetilde {\mathcal L}(\theta, \phi, x^i) \end{split} \end{equation}</script> <p>For sufficiently large batch size <script type="math/tex">M</script>, the inner loop sample size <script type="math/tex">L</script> can be set to 1. Due to stochastic mini batch gradient descent and stochastic expectation estimation, this is also called <em>doubly stochastic estimation</em>.</p> <h3 id="using-non-gaussian-encoders">Using non-Gaussian encoders</h3> <blockquote> <p>Todo: discuss AVB paper</p> </blockquote> <h3 id="gumble-trick-for-discrete-latent-variables">Gumble trick for discrete latent variables</h3> <p>Ref for this section:</p> <ol> <li>Gumble max trick <a href="https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/">https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/</a></li> <li>Balog, Matej, et al. “Lost Relatives of the Gumbel Trick.” arXiv preprint arXiv:1706.04161 (2017).</li> <li>Jang, Eric, Shixiang Gu, and Ben Poole. “Categorical reparameterization with gumbel-softmax.” arXiv preprint arXiv:1611.01144 (2016).</li> </ol> <p>Gumble distribution:</p> <hr /> <h2 id="f-gan-and-gan"><script type="math/tex">f</script>-GAN and GAN</h2> <h3 id="prelude-on-f-divergence-and-its-variational-lower-bound">Prelude on <script type="math/tex">f</script>-divergence and its variational lower bound</h3> <p>The f-divergence family</p> <script type="math/tex; mode=display">\begin{equation} D_f = \int_{\mathcal X} q(x) ~ f\left( \frac{p(x)}{q(x)}\right) dx \end{equation}\label{f_div}</script> <p>where the <em>generator function</em> <script type="math/tex">f: \mathbb{R}_{+} \rightarrow \mathbb{R}</script> is a convex, lower-semicontinuous function satisfying <script type="math/tex">f(1) = 0</script>.</p> <p>Every convex, lower-semicontinuous function has a <em>convex conjugate</em> function <script type="math/tex">f^c</script>, also known as <em>Fenchel conjugate</em>. This function is defined as</p> <script type="math/tex; mode=display">\begin{equation} f^c(t) = \underset {u \in \text{dom}_f}{\text{sup}} \{ut - f(u)\} \end{equation}</script> <p>Function <script type="math/tex">f^c</script> is again convex and lower-semicontinuous and the pair <script type="math/tex">(f,f^c)</script> is dual to each other, i.e. <script type="math/tex">\left(f^{c}\right)^c=f</script>. So we can represent <script type="math/tex">f</script> as</p> <script type="math/tex; mode=display">\begin{equation} f(t) = \underset {t \in \text{dom}_{f^c}}{\text{sup}} \{tu - f^c(t)\} \end{equation}</script> <p>With this we can establish a lower bound for estimating the f-divergence in general</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{split} D_f(P \| Q) &= \int_{\mathcal X} q(x) \underset {t \in \text{dom}_{f^c}}{\text{sup}} \{t \frac{p(x)}{q(x)} - f^c(t)\} dx \\\\ & \ge \underset {T \in {\mathcal T}} {\text{sup}} \int_{\mathcal X} \left( p(x)T(x) - q(x)f^c(T(x)) \right) dx \\\\ & = \underset {T \in {\mathcal T}} {\text{sup}} \left( \mathbb{E}_{x\sim P}[T(x)] - \mathbb{E}_{x\sim Q}[f^c(T(x))] \right) \end{split} \label{f_lowerbound} \end{equation} %]]></script> <p>where <script type="math/tex">\mathcal T</script> is an arbitrary class of functions <script type="math/tex">T: \mathcal X \rightarrow \mathbb R</script>. The inequality is due to Jessen’s inequality and constraints imposed by <script type="math/tex">\mathcal T</script>.</p> <p>The bond is tight for</p> <script type="math/tex; mode=display">\begin{equation} T^*(x) = f' \left(\frac{p(x)}{q(x)} \right) \end{equation}</script> <h3 id="generative-adversarial-training">Generative adversarial training</h3> <p>Suppose our goal is to come up with a distribution <script type="math/tex">Q</script> (model) that is close to <script type="math/tex">P</script> (the data distribution) and the similarity score (loss) is measured by <script type="math/tex">D_f(P \| Q)</script>. However the direct calculation of <script type="math/tex">\ref{f_div}</script> is intractable, such as the case where the functional form of <script type="math/tex">P</script> is unknown and <script type="math/tex">Q</script> is a complex model parameterized by a neural network.</p> <p>To be specific:</p> <ul> <li>Evaluating <script type="math/tex">q(x)</script> at any <script type="math/tex">x</script> is easy, but integrating it is hard due to lack of easy functional form.</li> <li>For <script type="math/tex">p(x)</script>, we do not know how to evaluate it at any <script type="math/tex">x</script></li> <li>Sampling from both <script type="math/tex">P</script> and <script type="math/tex">Q</script> are easy. Because drawing from data set approximates <script type="math/tex">x \sim P</script> and we can make the model <script type="math/tex">Q</script> take random vectors as input which are easy to produce.</li> </ul> <p>Fortunately, we can sample from both of them easily. In this case, <script type="math/tex">\ref{f_lowerbound}</script> offers a way to estimate the lower bound of the divergence. We would need to maximize this lower bound by changing <script type="math/tex">T</script> so that it is close to the true divergence, then minimize it over <script type="math/tex">Q</script>. This is formally stated as follows.</p> <script type="math/tex; mode=display">\begin{equation} F(\theta, \omega) = \mathbb{E}_{x\sim P}[T_\omega(x)] + \mathbb{E}_{x\sim Q_\theta}[-f^c(T_\omega(x))] \end{equation}</script> <script type="math/tex; mode=display">\begin{equation} \hat \theta = \argmin_\theta \max_\omega F(\theta, \omega) \end{equation}</script> <p>To ensure that the output of <script type="math/tex">T_\omega</script> respects the domain of <script type="math/tex">{f^c}</script>, we define <script type="math/tex">T_\omega(x) = g_f(V_\omega(x))</script>, where <script type="math/tex">V_\omega: \mathcal X \rightarrow \mathbb R</script> without any range constraints on the output and <script type="math/tex">g_f: \mathbb R \rightarrow \text{dom}_{f^c}</script> is an output activation function specific to the <script type="math/tex">f</script>-divergence used with suitable output ranges.</p> <h3 id="gan">GAN</h3> <p>For the original GAN, with a divergence target similar to Jensen-Shannon <script type="math/tex">\begin{equation} F(\theta, \omega) = \mathbb{E}_{x\sim P}[\log D_\omega(x)] + \mathbb{E}_{x\sim Q_\theta}[\log(1-D_\omega(x))] \end{equation}</script> with <script type="math/tex">D_\omega = 1/(1+e^{-V_\omega(x)})</script> which corresponds to the following</p> <p><script type="math/tex">g_f(\nu)= \log(1/(1+e^{-\nu}))</script> <script type="math/tex">T_\omega(x) = \log (D_\omega(x)) = g_f(V_\omega(x))</script> <script type="math/tex">f^c(t) = -\log (1-\exp(t))</script> <script type="math/tex">\log (1-D_\omega(x)) = -f^c(T_\omega(x))</script></p> <h3 id="practical-considerations-in-adversarial-training">Practical considerations in adversarial training</h3> <blockquote> <p>Todo: log trick, DCGAN heuristics</p> </blockquote> <h3 id="example-divergence-and-their-related-functions">Example divergence and their related functions</h3> <table> <thead> <tr> <th>Name</th> <th><script type="math/tex">D_f(P\vert Q)</script></th> <th>Generator <script type="math/tex">f(u)</script></th> <th><script type="math/tex">T^*(x)</script></th> </tr> </thead> <tbody> <tr> <td>Forward KL</td> <td><script type="math/tex">\int p(x) \log \frac{p(x)}{q(x)} dx</script></td> <td><script type="math/tex">u\log u</script></td> <td><script type="math/tex">1 +\log \frac{q(x)}{p(x)}</script></td> </tr> <tr> <td>Reverse KL</td> <td><script type="math/tex">\int q(x) \log \frac{p(x)}{q(x)} dx</script></td> <td><script type="math/tex">-\log u</script></td> <td><script type="math/tex">- \frac{q(x)}{p(x)}</script></td> </tr> <tr> <td>Jensen-Shannon</td> <td><script type="math/tex">\frac{1}{2} \int p(x) \log \frac{2p(x)}{p(x)+q(x)} + q(x) \log \frac{2q(x)}{p(x)+q(x)} dx</script></td> <td><script type="math/tex">u\log u - (u+1) \log \frac{u+1}{2}</script></td> <td><script type="math/tex">\log \frac{2p(x)}{p(x)+q(x)}</script></td> </tr> <tr> <td>GAN</td> <td><script type="math/tex">\int p(x) \log \frac{2p(x)}{p(x)+q(x)} + q(x) \log \frac{2q(x)}{p(x)+q(x)} dx -\log(4)</script></td> <td><script type="math/tex">u\log u - (u+1) \log (u+1)</script></td> <td><script type="math/tex">\log \frac{p(x)}{p(x)+q(x)}</script></td> </tr> </tbody> </table> <table> <thead> <tr> <th>Name</th> <th>Conjugate <script type="math/tex">f^c(t)</script></th> <th><script type="math/tex">\text{dom}_{f^c}</script></th> <th>Output activation <script type="math/tex">g_f</script></th> <th><script type="math/tex">f'(1)</script></th> </tr> </thead> <tbody> <tr> <td>Forward KL</td> <td><script type="math/tex">\exp(t-1)</script></td> <td><script type="math/tex">\mathbb R</script></td> <td><script type="math/tex">\nu</script></td> <td><script type="math/tex">1</script></td> </tr> <tr> <td>Reverse KL</td> <td><script type="math/tex">-1-\log(-t)</script></td> <td><script type="math/tex">\mathbb R_{-}</script></td> <td><script type="math/tex">-\exp(\nu)</script></td> <td><script type="math/tex">-1</script></td> </tr> <tr> <td>Jensen-Shannon</td> <td><script type="math/tex">-\log(2-\exp(t))</script></td> <td><script type="math/tex">% <![CDATA[ t < \log(2) %]]></script></td> <td><script type="math/tex">\log(2) - \log(1+\exp(-\nu))</script></td> <td><script type="math/tex">0</script></td> </tr> <tr> <td>GAN</td> <td><script type="math/tex">-\log(1-\exp(t))</script></td> <td><script type="math/tex">\mathbb R_{-}</script></td> <td><script type="math/tex">- \log(1+\exp(-\nu))</script></td> <td><script type="math/tex">-\log(2)</script></td> </tr> </tbody> </table> <hr /> <h2 id="wgan-and-wae">WGAN and WAE</h2> <h3 id="optimal-transport-ot">Optimal transport (OT)</h3> <p>Kantorovich formulated the optimization target in optimal transport problems as follows</p> <script type="math/tex; mode=display">\begin{equation} W_c(P_X, P_G) = \underset{\Gamma \in \mathcal P(x \sim P_X, y \sim P_Y)}{\text{inf}} \mathbb{E}_{x,y \sim \Gamma}[c(x,y)] \end{equation}</script> <p>where <script type="math/tex">\mathcal P(X\sim P_X, Y\sim P_Y)</script> is a set of all join distributions of <script type="math/tex">(X,Y)</script> with marginals <script type="math/tex">P_X</script> and <script type="math/tex">P_Y</script>.</p> <h3 id="wasserstein-distance">Wasserstein distance</h3> <p>When <script type="math/tex">c(x,y) = \| x-y \| ^p</script> for <script type="math/tex">p \ge 1</script>, <script type="math/tex">W_c^{1/p}</script> is called p-Wasserstein distance.</p> <script type="math/tex; mode=display">\begin{equation} W_p(P_X, P_G) = \underset{\Gamma \in \mathcal P(x \sim P_X, y \sim P_Y)}{\text{inf}} \mathbb{E}_{x,y \sim \Gamma}[\|x - y\|^p] \end{equation}</script> <p>The optimization problem is highly intractable in general, due to the constraint. However when <script type="math/tex">p=1</script>, Kantorovich-Rubinstein duality holds:</p> <script type="math/tex; mode=display">\begin{equation} W_1(P_X, P_G) = \underset{f \in \text{\{1-Lipschitz\}}}{\text{sup}} \mathbb{E}_{x\sim P_X}[f(x)] - \mathbb{E}_{y\sim P_Y}[f(y)] \end{equation}</script> <p>The family of divergences from <script type="math/tex">f</script>-divergence only consider the relative probability (the ratio between two probability density functions) and do not measure the closeness of the underlying outcomes. With disjoint support or overlapping support but intersections that yield zero measure, the divergence between a target distribution and a <script type="math/tex">\theta</script>-parameterized distribution might not be continuous with respect to <script type="math/tex">\theta</script>. Wasserstein distance on the other hand does take into account the underlying topology of the outcomes and is continuous and differentiable almost everywhere with respect to <script type="math/tex">\theta</script> and thus almost always provide useful gradient for optimization.</p> <h3 id="wasserstein-gan-wgan">Wasserstein GAN (WGAN)</h3> <p>Following the dual form of <script type="math/tex">W_1</script>, we can form a generative-adversarial model for a data distribution <script type="math/tex">P_D</script> and model <script type="math/tex">Q_\theta</script> with auxiliary function <script type="math/tex">f</script> that is 1-Lipschitz continuous.</p> <script type="math/tex; mode=display">\begin{equation} \hat \theta = \argmin_\theta \underset{f \in \text{\{1-Lipschitz\}}}{\text{sup}} \mathbb{E}_{x\sim P_D}[f(x)] - \mathbb{E}_{x\sim Q_\theta}[f(x)] \end{equation}</script> <p>$$Practical considerations for WGAN</p> <blockquote> <p>Todo: Gradient clipping with K-Lipschitz constraint on <script type="math/tex">f</script>; Soft gradient penalty (WGAN-GP)</p> </blockquote> <h3 id="wasserstein-auto-encoder-wae">Wasserstein Auto-encoder (WAE)</h3> <p>Rather than working with the dual or Wasserstein distance, which only holds for <script type="math/tex">W_1</script>, we can also work with the primal form directly. As shown in <em>Tolstikhin, Ilya, et al. “Wasserstein Auto-Encoders.”</em> the following holds when we have a deterministic decoder mapping latent variable <script type="math/tex">Z</script> to <script type="math/tex">Y</script> through <script type="math/tex">y=G(z)</script>:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{split} W_c(P_X, P_G) = W_c^\dagger(P_X, P_G) &= \underset{P \in \mathcal P(x \sim P_X, z \sim P_Z)}{\text{inf}} \mathbb{E}_{x,y \sim P}[c(x, G(z)]\\\\ &= \underset{Q: Q_Z = P_Z}{\text{inf}} \mathbb{E}_{x\sim P_X} \mathbb{E}_{z \sim Q(Z\vert X)}[c(x, G(z)] \end{split} \end{equation} %]]></script> <p>The constraint put on <script type="math/tex">Q(Z\vert X)</script> is that its marginal needs to equal to <script type="math/tex">P(Z)</script>. To have a feasible optimization problem we relax this constraint with the following constraint-free optimization target with a penalty that assess the closeness between <script type="math/tex">Q(Z)</script> and <script type="math/tex">P(Z)</script> via any reasonable divergence. This new objective is named <em>penalized optimal transport</em> (POT).</p> <script type="math/tex; mode=display">\begin{equation} D_{POT/WAE}(P_X, P_G) := \underset{Q \in \mathcal Q}{\text{inf}} \mathbb{E}_{x\sim P_X} \mathbb{E}_{z \sim Q(Z\vert X)}[c(x, G(z)] + \lambda \cdot D_{Z} (Q_Z, P_Z) \end{equation}</script> <p>If the divergence between <script type="math/tex">P_Z</script> and <script type="math/tex">Q_Z</script> is intractable to directly calculate, we could use generative-adversarial training to approximate it (see <script type="math/tex">f</script>-GAN).</p> <blockquote> <p>Note: if decoder is probabilistic instead of deterministic, we would only have <script type="math/tex">W_c(P_X, P_G) \le W_c^\dagger(P_X, P_G)</script>, so we are minimizing an upper bound of the true OT cost.</p> </blockquote> <blockquote> <p>Thought: the original paper used JS divergence for <script type="math/tex">D_Z</script>, how about we use Wasserstein distance for <script type="math/tex">D_Z</script>.</p> </blockquote> <blockquote> <p>Todo: discuss connections to AAE paper</p> </blockquote>Ran DingA high-level summary of various generative models including Variational Autoencoders (VAE), Generative Adverserial Networks (GAN), and their notable extentions and generalizations, such as f-GAN, Adversarial Variational Bayes (AVB), Wasserstein GAN, Wasserstein Auto-Encoder (WAE), Cramer GAN and etc.EM Algorithm2017-12-15T00:00:00+00:002017-12-15T00:00:00+00:00https://dingran.github.io/EM<script type="math/tex; mode=display">\newcommand{\argmin}{\mathop{\mathrm{argmin}}} \newcommand{\argmax}{\mathop{\mathrm{argmax}}} \renewcommand{\vec}{\boldsymbol{#1}}</script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> <h2 id="introduction">Introduction</h2> <p>This post explains Expectation-Maximization (EM) algorithm from scratch in a fairly concise fashion. The material is based on my own notes, which of course come from a variety of great resources online that are listed in the references section.</p> <p>EM is one of the most elegant and widely used machine learning algorithms but is sometimes not thoroughly introduced in introductory machine learning courses. What is so elegant about EM is that, as we shall see, it originates from nothing but the most fundamental laws of probability.</p> <p>Many variants of EM have been developed, and an important class of statistical machine learning methods called variational inference also has a strong connection to EM. The core ideas and derivatives of EM find many applications in both classical statistical machine learning and models that involve deep neural networks, making it worthwhile to have an intuitive and thorough understanding of it, which is what this post attempts to provide.</p> <h2 id="notation">Notation</h2> <!--- comment - Vector:$$\vec x$$; matrix$$\vec X. ---> <ul> <li>Random variables <script type="math/tex">X</script>, probability distribution <script type="math/tex">P(X)</script></li> <li>Probability density function (PDF) <script type="math/tex">p(\cdot)</script>, evaluated at value <script type="math/tex">x</script>: <script type="math/tex">p(X=x)</script> with <script type="math/tex">p(x)</script> as a shorthand</li> <li>PDF with parameter <script type="math/tex">\theta</script> is noted as <script type="math/tex">p_\theta(x)</script> or equivalently <script type="math/tex">p(x\vert \theta)</script></li> <li>Expectation of <script type="math/tex">f(x)</script> according to distribution <script type="math/tex">P</script>: <script type="math/tex">\mathbb{E}_{x\sim P}\left[f(x)\right]</script></li> <li>A set is noted as <script type="math/tex">{x_i}</script> or calligraphic letter <script type="math/tex">\mathcal X</script></li> </ul> <h2 id="maximum-likelihood">Maximum likelihood</h2> <p>Supposed we had data coming from a distribution <script type="math/tex">P_D(X)</script>, and we want to come up with a model for <script type="math/tex">x</script> parameterized by <script type="math/tex">\theta</script>: <script type="math/tex">p(x;\theta)</script> or equivalent noted as <script type="math/tex">p_{\theta}(x)</script> to best approximate the real data distribution. Further assume all the data samples are independent and identically distributed (iid) with <script type="math/tex">P_D(X)</script>.</p> <p>To find <script type="math/tex">\theta</script> under a maximum likelihood scheme we do</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{split} \hat{\theta}_{MLE} &= \argmax_{\theta} \ell(\theta) \\\\ &= \argmax_{\theta} \sum_{i} \log\left( p_{\theta}(x_i) \right) \end{split} \end{equation} %]]></script> <h2 id="motivation-for-em">Motivation for EM</h2> <p>We might encounter situations where, in addition to observed data <script type="math/tex">{x_i}</script>, we have missing or hidden data <script type="math/tex">{z_i}</script>. It might literally be data that is missing for some reason. Or, more interestingly, it might be due to our modeling choice. We might prefer to have a model with a set of meaningful but hidden variables <script type="math/tex">{z_i}</script> that help explain the “causes” of <script type="math/tex">{x_i}</script>. Good examples of this category would be Gaussian (or other kind of) mixture models, and LDA.</p> <blockquote> <p>Note to myself: examples when we introduces latent variables just for the sake of making the optimization problem easier?</p> </blockquote> <p>In either case, we will need to have a model for calculating the joint distribution of <script type="math/tex">x</script> and <script type="math/tex">z</script>, <script type="math/tex">p(x,z;\theta)</script>, which may arise from assumptions (in the case of missing data) or from models of marginal density functions <script type="math/tex">p(z; \theta)</script> and <script type="math/tex">p(x\vert z; \theta)</script>. In such cases, the log likelihood can be expressed as</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{split} \ell(\theta) &= \sum_i \log\left( p_{\theta}(x_i) \right)\\\\ &= \sum_i \log\left( \sum_{z} p_{\theta}(x_i, Z=z) \right)\\\\ &= \sum_i \log\left( \sum_{z} p_{\theta}(x_i\vert Z=z)p_{\theta}(Z=z) \right) \end{split} \end{equation} %]]></script> <p>Direct maximization of with respect to <script type="math/tex">\theta</script> might be challenging, due to the summation over <script type="math/tex">z</script> inside the log. But the problem would be much easier if we knew the values of <script type="math/tex">z</script>. It is simply the original maximum likelihood problem with all data available.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{split} \ell(\theta) &= \sum_i \log\left(p_{\theta}(x_i\vert Z=z_i)p_{\theta}(Z=z_i) \right) \\\\ &= \sum_i \log\left(p_{\theta}(x_i, z_i) \right) \end{split} \end{equation} %]]></script> <p>The collection of <script type="math/tex">({x_i}, {z_i})</script> is called the <em>complete</em> data. Naturally, <script type="math/tex">{x_i}</script> is the <em>incomplete</em> data and <script type="math/tex">{z_i}</script> is the <em>latent</em> data/variable.</p> <p>Roughly speaking, EM algorithm is an iterative method that let us to guess <script type="math/tex">z_i</script> based on <script type="math/tex">x_i</script> (and current estimate of model parameter <script type="math/tex">\hat\theta</script>). With the guessed “fill-in” <script type="math/tex">z_i</script> we now have <em>complete</em> data and we optimize the log likelihood <script type="math/tex">\ell(\theta)</script> with respect to <script type="math/tex">\theta</script>. We thus iteratively improve our guess of latent variable <script type="math/tex">z</script> and parameter <script type="math/tex">\theta</script>. We repeat this process until convergence.</p> <p>In slightly more detail, instead of guessing a single value <script type="math/tex">z</script> we guess the distribution of <script type="math/tex">z</script> given <script type="math/tex">x</script>, i.e. <script type="math/tex">p(z\vert x;\hat\theta)</script>. then optimize the expected log likelihood for <em>complete</em> data, i.e. <script type="math/tex">\sum_i \mathbb{E}_{z \sim p(z\vert x_i;\hat\theta)}\log p_\theta (x_i, z)</script>, with respect to <script type="math/tex">\theta</script> which serves as a proxy (lower bound) for the true objective <script type="math/tex">\sum_i \log p_{\theta}(x_i)</script>. Repeat until converge.</p> <p>(Note in fact guessing a single value for <script type="math/tex">z</script> is also a valid strategy. It corresponds to a variant of EM and is what we do in the well-known K-means algorithm, where we guess a “hard” label on each data points.)</p> <p>The nice thing about EM is that it comes with theoretical guarantee of monotonic improvement on the true objective even through we directly work with a proxy (lower bound) of it. Note however the rate of convergence will depend on the problem and the convergence is not guaranteed to be towards the global optima.</p> <h2 id="formulation">Formulation</h2> <p>As before, we start with the log likelihood</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{split} \ell(\theta) &= \sum_i \log\left( p_{\theta}(x_i) \right) \\\\ &= \sum_i \log\left( \int p_{\theta}(x_i, z) dz \right)\\\\ &= \sum_i \log\left( \int \frac{p_{\theta}(x_i, z)}{q(z)} q(z) dz \right) \\\\ &= \sum_i \log\left( \mathbb{E}_{z \sim Q} \left[ \frac {p_{\theta}(x_i, z)}{q(z)} \right] \right)\\\\ &\ge \sum_i \mathbb{E}_{z \sim Q} \left[\log\left( \frac {p_{\theta}(x_i,z)}{q(z)} \right) \right]\\\\ \label{eq:jensen} \end{split} \end{equation} %]]></script> <p>Here I switched the summation over <script type="math/tex">z</script> to integral assuming <script type="math/tex">z</script> is continuous, just to hint this is a possibility. The last step used Jensen’s inequality and the fact log function is strictly concave. So far we do not have any restrictions on the distribution <script type="math/tex">Q</script>, apart from <script type="math/tex">q(z)</script> being a probability density function and it is positive where <script type="math/tex">p_\theta(x_i,z)</script> is.</p> <p>Using the result above, let’s define the last quantity as <script type="math/tex">\mathcal L(q,\theta)</script>. It is usually called ELBO (Evidence Lower BOund) as it is a lower bound of <script type="math/tex">\ell(\theta)</script>.</p> <script type="math/tex; mode=display">\begin{equation} \mathcal L(q,\theta) = \sum_i \mathbb{E}_{z \sim Q} \left[\log\left( \frac {p_{\theta}(x_i,z)}{q(z)} \right) \right] \end{equation}</script> <p>Just to reiterate what we have done so far: our goal is to maximize <script type="math/tex">\ell(\theta)</script>, we exchanged the place of the log and integral over <script type="math/tex">z</script> and got a lower bound <script type="math/tex">\mathcal L</script>.</p> <p>We can show that the difference between <script type="math/tex">\ell(\theta)</script> and <script type="math/tex">\mathcal L(q,\theta)</script> is</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{split} \ell(\theta) - \mathcal L(q,\theta) & = \sum_i \int q(z) \left(log(p_\theta(x_i)) - \log\left(\frac{p_\theta(x_i,z)}{q(z)}\right)\right) dz\\\\ &= \sum_i \int q(z) \log\left(\frac{q(z)}{\frac{p_\theta(x_i,z)}{p_\theta(x)}}\right) dz \\\\ &= \sum_i \int q(z) \log\left(\frac{q(z)}{p_\theta(z\vert x_i)}\right) dz \\\\ &= \sum_i D_{KL}(q(z) \| p_\theta(z\vert x_i)) \end{split} \end{equation} %]]></script> <p>where we used the fact Kullback-Leibler (KL) divergence <script type="math/tex">D_{KL}</script> is defined as</p> <script type="math/tex; mode=display">D_{KL}(P \| Q)= \int p(x) \log \left( \frac{p(x)}{q(x)} \right) dx = \mathbb{E}_{x\sim P}[\log(\frac{p(x)}{q(x)}]</script> <p>In general, KL divergence is always nonnegative and is zero if and only if <script type="math/tex">q(x) = p(x)</script>. So in our case, the equality <script type="math/tex">\ell(\theta) = \mathcal L(q,\theta)</script> holds if and only if <script type="math/tex">q(z) = p_\theta(z\vert x_i)</script>. When this happens, we say the bound is tight. In this case, it makes sense to note <script type="math/tex">q(z)</script> as <script type="math/tex">q(z\vert x_i)</script> to make the dependence on <script type="math/tex">x_i</script> clear.</p> <h2 id="em-algorithm-and-monotonicity-guarantee">EM algorithm and monotonicity guarantee</h2> <p>The EM algorithm is remarkably simple and it goes as follows.</p> <ul> <li>E-step (of <script type="math/tex">t</script>-th iteration): <ul> <li>Let <script type="math/tex">q^t(z) = p(z \vert x_i; \hat\theta^{t-1})</script>, which is calculated as shown in Eq. <script type="math/tex">\ref{eq:E}</script></li> <li>Due to our particular choice of <script type="math/tex">q^t</script>, at current estimate of <script type="math/tex">\hat\theta^{t-1}</script> the bond is tight: <script type="math/tex">\mathcal L(q^t,\hat\theta^{t-1}) = \ell(\hat\theta^{t-1})</script></li> </ul> </li> <li>M-step <ul> <li>Maximize <script type="math/tex">\mathcal L(q^t,\theta)</script> with respect to <script type="math/tex">\theta</script>, see Eq. <script type="math/tex">\ref{eq:M}</script></li> <li>This step improves ELBO by finding a better <script type="math/tex">\theta</script>: <script type="math/tex">\mathcal L(q^t,\theta^t) \ge \mathcal L(q^t,\theta^{t-1})</script></li> </ul> </li> </ul> <p>The calculation in <strong>E-step</strong> is</p> <script type="math/tex; mode=display">\begin{equation}\label{eq:E} p(z\vert x_i; \hat\theta^{t-1}) = \frac{p(x_i\vert z; \hat\theta^{t-1})p(z; \hat\theta^{t-1})}{\int p(x_i\vert z; \hat\theta^{t-1})p(z; \hat\theta^{t-1}) dz} \end{equation}</script> <p>Just to spell out the function <script type="math/tex">\mathcal L(q^t,\theta)</script> that we maximize in <strong>M-step</strong>.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{split} \hat\theta^t &= \argmax_{\theta} \mathcal L(q^t,\theta) \\\\ &= \argmax_{\theta} \sum_i \mathbb{E}_{z \sim Q^t} \left[\log\left(p(x_i,z;\theta) \right) \right] \\\\ &= \argmax_{\theta} \sum_i \int p(z\vert x_i; \hat\theta^{t-1}) \log\left(p(x_i,z;\theta)\right) dz \\\\ \end{split} \label{eq:M} \end{equation} %]]></script> <p>With the preparation earlier we can also easily show the theoretical guarantee on monotonic improvement over the optimization objective <script type="math/tex">\ell(\theta)</script>.</p> <script type="math/tex; mode=display">\begin{equation}\label{eq:monotone} \ell(\theta^{t-1}) \underset{E-step}{=} \mathcal L(q^t,\theta^{t-1}) \underset{M-step}{\le} \mathcal L(q^t,\theta^t) \underset{Jensen}{\le} \ell(\theta^{t}) \end{equation}</script> <h3 id="why-the-e-in-e-step">Why the “E” in E-step</h3> <p>By the way, the reason it is called E-step is because in that step we do the necessary calculation to figure out the form of <script type="math/tex">\mathcal L(q,\theta)</script> as a function of <script type="math/tex">\theta</script> which we then optimize in the M-step. The form of <script type="math/tex">\mathcal L(q,\theta)</script> is the <strong>expectation</strong> of the log likelihood of <em>complete</em> data over the estimated distribution of the latent variable <script type="math/tex">z</script>.</p> <h3 id="em-as-maximization-maximization">EM as maximization-maximization</h3> <p>Because the particular choice <script type="math/tex">q^t(z)</script> in E-step is to have diminishing <script type="math/tex">D_{KL}(q(z) \| p_\theta(z\vert x_i))</script>, thus E-step can be viewed as maximizing <script type="math/tex">\mathcal L(q,\hat\theta^{t-1})</script> with respect to <script type="math/tex">q</script> and M-step as maximization with respect to <script type="math/tex">\theta</script>. So we are doing alternating maximization on the EBLO with respect to <script type="math/tex">q</script> and <script type="math/tex">\theta</script>.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{equation} \begin{split} & \text{E-step:}\hspace{4pt}q^t(z) = \argmax_q \mathcal L(q,\hat\theta^{t-1})\\\\ & \text{M-step:}\hspace{4pt}\hat\theta^t = \argmax_\theta \mathcal L(q^t,\theta) \end{split} \end{equation} %]]></script> <p>This maximization-maximization view offers justification for partial E-step (when the required calculation in exact E-step is intractable) and partial M-step (i.e. only find a <script type="math/tex">\theta</script> that increases the ELBO rather than maximizes it). Under this view, the direct maximization on ELBO as a goal offers a strong connection to <strong>variational inference</strong> as will be discussed briefly below.</p> <h3 id="example-gaussian-mixture">Example: Gaussian Mixture</h3> <p>In the context of Gaussian Mixture Model (GMM), <script type="math/tex">z_i</script> associated with <script type="math/tex">x_i</script> takes the value <script type="math/tex">{1,2,\dots\,n_{g}}</script>, where <script type="math/tex">{n_g}</script> is the number of Gaussians in the model. Thus <script type="math/tex">z_i</script> indicates which Gaussian cluster observed data point <script type="math/tex">x_i</script> belongs to. The set of parameter <script type="math/tex">\theta</script> includes those parameterize the marginal distribution of <script type="math/tex">z</script>, <script type="math/tex">P(Z;\vec \pi)</script>. <script type="math/tex">\vec \pi = [\pi_1, \pi_2, \dots, \pi_{n_g}]</script>, with <script type="math/tex">\sum_i^{n_g} \pi_i = 1</script> and <script type="math/tex">\pi_i > 0</script>. Also, <script type="math/tex">\theta</script> include those parametrized the conditional distribution of <script type="math/tex">P(X \vert Z=z_i; \mu_i, \sigma_i) \sim \mathcal N(\mu_i, \sigma_i)</script>.</p> <p>For a detailed walk-through see Andrew Ng’s CS229 lecture <a href="http://cs229.stanford.edu/notes/cs229-notes8.pdf">notes</a> and <a href="https://www.youtube.com/watch?v=ZZGTuAkF-Hw">video</a></p> <h2 id="variants-and-extensions-of-em">Variants and extensions of EM</h2> <h3 id="gem-and-cem">GEM and CEM</h3> <p>A popular variant to EM is that in Eq. <script type="math/tex">\ref{eq:M}</script> we merely find a <script type="math/tex">\hat\theta^t</script> that increases (rather than maximizes) <script type="math/tex">\mathcal L(q^t,\theta)</script>. It is easy to see <script type="math/tex">\ref{eq:monotone}</script> and the monotonicity guarantee still holds in this situation. This algorithm is proposed in the original EM paper and called <em>Generalized EM (GEM)</em>.</p> <p>Another variant is the point-estimate version we mentioned earlier, where instead of having <script type="math/tex">q^t(z) = p(z\vert x_i; \hat\theta^{t-1})</script> in the E-step, we take <script type="math/tex">z</script> to be a single value - the most probable one, i.e. <script type="math/tex">\hat{z}^t=argmax_z p(z\vert x_i; \hat\theta^{t-1})</script> or equivalently taking <script type="math/tex">q^t(z) = \delta(z-\hat{z}^t)</script>. In this case, the integral in <script type="math/tex">\ref{eq:M}</script> is greatly simplified, but the first equality in <script type="math/tex">\ref{eq:monotone}</script> does not hold any more and we lose the theoretical guarantee. This algorithm is also called <em>Classification EM (CEM)</em>.</p> <h3 id="stochastic-em">Stochastic EM</h3> <p>As we can see in Eq. <script type="math/tex">\ref{eq:M}</script>, we need to go through all data points in order to update <script type="math/tex">\theta</script>, which could be long process for large data sets. In much of the same spirit as stochastic gradient descent, we could sample subsets of data and run the E- and M-step on these mini batches. The same idea can be used for variational inference mentioned below, on the update of <em>global</em> latent variables (such as <script type="math/tex">\theta</script>).</p> <h3 id="variational-inference">Variational inference</h3> <p>The computation of the optimal <script type="math/tex">q(z)</script>, i.e. <script type="math/tex">q(z) = p(z \vert x_i; \hat\theta_{t-1})</script> in E-step might be intractable. Especially, the integral in the denominator of Eq. <script type="math/tex">\ref{eq:E}</script> does not have closed form solution for many interesting models. In this case we can take the view of EM as maximization-maximization and try to come up with better and better <script type="math/tex">q(z)</script> to improve the ELBO. In order to proceed with such variational optimization tasks, we need to specify the functional family <script type="math/tex">\mathcal Q</script> from which we will choose <script type="math/tex">q(z)</script>. Depending on the assumptions a number of interesting algorithms have been developed. The most popular one is probably <strong>mean-field approximation</strong>.</p> <p>Note that in a typical variational inference framework, the parameter <script type="math/tex">\theta</script> is treated as first class variables that we would do inference on (i.e. getting <script type="math/tex">p(\theta\vert x)</script>) rather than taking a maximum likelihood single point estimation, so <script type="math/tex">\theta</script> become part of the latent variables and absorbed into the notation <script type="math/tex">z</script>. Thus, <script type="math/tex">z</script> includes <em>global</em> variables such as <script type="math/tex">\theta</script> and <em>local</em> variables such as the latent labels <script type="math/tex">z_i</script> associated with each data point <script type="math/tex">x_i</script>.</p> <p>In mean-field method the constraint we put on <script type="math/tex">q(z)</script> is that it factorizes, i.e. <script type="math/tex">q(z) = \prod_k q_k(z_k)</script>. This is saying that all latent variables are mutual independent, by assumption. This seemingly simple assumption brings in remarkable simplifications in the calculation of integrals and especially the expectations of log likelihood involved. It leads to a coordinate ascent variational inference (CAVI) algorithm that allows closed-form iterative calculation for certain model family. The coordinate updates on <em>local</em> variables corresponds to the E-step in EM, while the updates on <em>global</em> variables corresponds to the M-step in EM.</p> <p>For more about this topic see: D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, <a href="https://arxiv.org/abs/1601.00670">“Variational Inference: A Review for Statisticians,”</a> J. Am. Stat. Assoc., vol. 112, no. 518, pp. 859–877, 2017.</p> <hr /> <h2 id="references">References</h2> <blockquote> <p>Todo: add citation in text; for now just core dumped some references here</p> </blockquote> <p>In no particular order:</p> <ol> <li> <p>A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Stat. Soc. Ser. B Methodol., vol. 39, no. 1, pp. 1–38, 1977.</p> </li> <li> <p>R. M. Neal and G. E. Hinton, “A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants,” Learn. Graph. Model., pp. 355–368, 1998.</p> </li> <li> <p>J. A. Bilmes, “A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” ReCALL, vol. 1198, no. 510, p. 126, 1998.</p> </li> <li> <p>A. Roche, “EM algorithm and variants: an informal tutorial,” pp. 1–17, 2011.</p> </li> <li> <p>M. R. Gupta, “Theory and Use of the EM Algorithm,” Found. Trends® Signal Process., vol. 4, no. 3, pp. 223–296, 2010.</p> </li> <li> <p>M. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “Introduction to variational methods for graphical models,” Mach. Learn., vol. 37, no. 2, pp. 183–233, 1999.</p> </li> <li> <p>M. J. Wainwright and M. Jordan, “Graphical Models, Exponential Families, and Variational Inference,” Found. Trends® Mach. Learn., vol. 1, no. 1–2, pp. 1–305, 2007.</p> </li> <li> <p>M. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic Variational Inference,” vol. 14, pp. 1303–1347, 2012.</p> </li> <li> <p>D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational Inference: A Review for Statisticians,” J. Am. Stat. Assoc., vol. 112, no. 518, pp. 859–877, 2017.</p> </li> <li> <p>S. Mohamed, “Variational Inference for Machine Learning,” no. February, 2015.</p> </li> <li> <p>Z. Ghahramani, “Variational Methods The Expectation Maximization ( EM ) algorithm,” no. April, 2003.</p> </li> </ol>Ran DingA quick walk-through of Expectation-Maximization (EM) algorithm.Central Limit Theorem2017-11-23T00:00:00+00:002017-11-23T00:00:00+00:00https://dingran.github.io/CLT<h2 id="definition">Definition:</h2> <p>Let <script type="math/tex">X_{1}</script>, <script type="math/tex">X_{2}</script>, <script type="math/tex">X_{3}</script>,… be i.i.d random variables from some distribution with finite mean <script type="math/tex">\mu</script> and finite variance <script type="math/tex">\sigma^{2}</script>.</p> <p>As <script type="math/tex">n \rightarrow \infty</script>, let <script type="math/tex">S=\sum_{k=1}^n X_{i}</script>, we have <script type="math/tex">S \rightarrow \mathcal{N}(n\mu, n\sigma^{2})</script> and <script type="math/tex">\frac{S-n\mu}{\sqrt{n\sigma^{2}}} \rightarrow \mathcal{N}(0,1)</script></p> <p>Equivalently, let <script type="math/tex">M=\frac{1}{n}\sum_{k=1}^n X_{i}</script>, we have <script type="math/tex">M \rightarrow \mathcal{N}(\mu,\frac{\sigma^2}{n})</script> and <script type="math/tex">\frac{M-\mu}{\sqrt{\frac{\sigma^2}{n}}} \rightarrow \mathcal{N}(0,1)</script></p> <p>Notation:</p> <ul> <li><script type="math/tex">\mathcal{N}(\mu,\sigma^2)</script> denotes <a href="https://en.wikipedia.org/wiki/Normal_distribution">Normal distribution</a> with mean of <script type="math/tex">\mu</script> and variance of <script type="math/tex">\sigma^2</script>.</li> </ul> <h2 id="discussions">Discussions:</h2> <p>Naturally CLT appears in questions that invovle sum or average of a large number of random variablse and especially when the questions only ask for an approximate answer.</p> <p>Here are a few examples.</p> <p><br /> <strong><em>Example 1:</em></strong></p> <p>Suppose we have a fair coin and we flip it 400 times. What is the probability you will see 210 heads or more?</p> <hr /> <p><br /> <strong>Exact answer</strong></p> <p>Let the outcome of each coin flip be a random variable <script type="math/tex">I_{i}</script>. Thus we are dealing with the random variable <script type="math/tex">S=\sum_{i=1}^{400}I_{i}</script>. <script type="math/tex">S</script> is te sume of a series of i.i.d Bernoulie trials, thus it follows Binomial distribution. So the exact answer is: <script type="math/tex">P(S\geq210)= \sum_{k=210}^{400}C_{400}^{k}\left(\frac{1}{2}\right)^{400}</script> which requires a program to calculate (Actually try implementing this, beware of roudoff errors and compare it against the approximate answer below.).</p> <p>Notation:</p> <ul> <li><script type="math/tex">C_{n}^{k}</script> is the notation for “<a href="https://en.wikipedia.org/wiki/Binomial_coefficient">n choose k</a>”, which denotes the number of ways to choose k items from n items where order doesn’t matter.</li> </ul> <p><br /> <strong>Approximation</strong></p> <p>We use CLT to easily get an approxmate answer quickly. First recognize that for each <script type="math/tex">I_{i}</script> we have <script type="math/tex">\mu=0.5</script> and <script type="math/tex">\sigma^2=0.5\times(1-0.5)=0.25</script>. Then, <script type="math/tex">Z=\frac{S-400*0.5}{\sqrt{400*0.25}} = \frac{S-200}{10}</script> is approximately <script type="math/tex">\mathcal{N}(0,1)</script>. For <script type="math/tex">S \geq 210</script>, we have <script type="math/tex">Z\geq1</script>.</p> <p>The 68-95-99.7 rule tells us that for a standard Normal distribution <script type="math/tex">\mathcal{N}(0,1)</script>, the probability of the random variable taking value more than 1 standard deviation away from the center is <script type="math/tex">1-0.68=0.32</script> and thus the one sided probability for <script type="math/tex">P(Z\geq1) = 0.32/2 = 0.16</script>.</p> <p><br /> <strong><em>Example 2:</em></strong></p> <p>Suppose you use Monte Carlo simulation to estimate the numerical value of <script type="math/tex">\pi</script>.</p> <ul> <li>How would you implement it?</li> <li>If we require an error of 0.001, how many trials do you need?</li> </ul> <hr /> <p><strong>Solution</strong></p> <p>One possible implementation is to start with a rectangle, say <script type="math/tex">x \in [-1,1], y\in[-1,1]</script>. If we uniformly randomly draw a point from this rectangle, the probability <script type="math/tex">p</script> of the point following into the circle region <script type="math/tex">x^2+y^2\lt1</script> is the ratio of the area between the circle and rectangle, i.e <script type="math/tex">p=\frac{\pi}{4}</script></p> <p>Formally, let random indicator variable <script type="math/tex">I</script> take value 1 if the point falls in the circle and 0 otherwise, then <script type="math/tex">P(I=1)=p</script> and <script type="math/tex">E(I)=p</script>. If we do <script type="math/tex">n</script> such trials, and define <script type="math/tex">M=\frac{1}{n}\sum_{k=1}^n I_{k}</script>, then <script type="math/tex">M</script> follows approximately <script type="math/tex">\mathcal{N}(\mu_{I},\frac{\sigma_{I}^2}{n})</script>. In this setup, <script type="math/tex">\mu_{I}=E(I)=p</script> and <script type="math/tex">\sigma_{I}^2=p(1-p)</script> (see <a href="prob-dist-discrete.ipynb">Probability Distribution</a> section for details on <script type="math/tex">\sigma_{I}^2</script>).</p> <p>One thing we need to clarify with the interviewer is what error really means? She might tell you to consider it as the standard deviation of the estimated <script type="math/tex">\pi</script>. Therefore the specified error translates into a required sigma of <script type="math/tex">\sigma_{req}=\frac{error}{4}</script> for random variable <script type="math/tex">M</script>. Thus <script type="math/tex">n = \frac{\sigma_{I}^2}{\sigma_{req}^2}=\frac{p(1-p)}{(0.00025)^2}\approx2.7\times 10^6</script>.</p> <p>By the way, we can see that the number of trials <script type="math/tex">n</script> scales with <script type="math/tex">\frac{1}{error^2}</script>, which is caused by the <script type="math/tex">\frac{1}{\sqrt{n}}</script> scaling of the <script type="math/tex">\sigma_{M}</script> in the CLT, and is generally the computationaly complexity entailed by <a href="https://en.wikipedia.org/wiki/Monte_Carlo_integration">Monte Carlo integration</a>.</p>Ran DingSummary and examples of CLT.