Brief Survey of Generative Models
OverviewPermalink
This page is a high-level summary of various generative models with little explanations. Models to cover are as follows:
Variational Autoencoders (VAE)
- Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).
Adversarial Variational Bayes (AVB)
Extention to VAE to use non-Gaussian encoders
- Mescheder, Lars, Sebastian Nowozin, and Andreas Geiger. “Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks.” arXiv preprint arXiv:1701.04722 (2017).
Generative Adverserial Networks (GAN)
- Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.
Generalized divergence minimization GAN (ff-GAN)
- Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka. “f-gan: Training generative neural samplers using variational divergence minimization.” Advances in Neural Information Processing Systems. 2016.
Wasserstein GAN (WGAN)
- Arjovsky, Martin, Soumith Chintala, and Léon Bottou. “Wasserstein gan.” arXiv preprint arXiv:1701.07875 (2017).
Adversarial Autoencoders (AAE)
- Makhzani, Alireza, et al. “Adversarial autoencoders.” arXiv preprint arXiv:1511.05644 (2015).
Wasserstein Auto-Encoder (WAE)
- Tolstikhin, Ilya, et al. “Wasserstein Auto-Encoders.” arXiv preprint arXiv:1711.01558 (2017).
Cramer GAN
- Bellemare, Marc G., et al. “The Cramer Distance as a Solution to Biased Wasserstein Gradients.” arXiv preprint arXiv:1705.10743 (2017).
VAEPermalink
Model setup:Permalink
- Recognition model: qϕ(z|x)=N(μ=h1(x),σ2I=h2(x)I)qϕ(z|x)=N(μ=h1(x),σ2I=h2(x)I)
- Assumed fixed prior: p(z)=N(0,I)p(z)=N(0,I)
- Generation model: pθ(x|z)=N(μ=g1(z),σ2I=g2(z)I)pθ(x|z)=N(μ=g1(z),σ2I=g2(z)I)
- Implied (but intractable) posterior: pθ(z|x)pθ(z|x)
Key equations:Permalink
logpθ(xi)=DKL(qϕ(z|xi)‖pθ(z|xi)+L(θ,ϕ,xi)logpθ(xi)=DKL(qϕ(z|xi)∥pθ(z|xi)+L(θ,ϕ,xi)(1) L(θ,ϕ,xi)=Ez∼qϕ(z|xi)[logpθ(xi,z)−logqϕ(z|xi)]=Ez∼qϕ(z|xi)[logpθ(xi,z)]+H[qϕ(z|xi)]=Ez∼qϕ(z|xi)[logpθ(xi|z)]−DKL[qϕ(z|xi)‖p(z)]L(θ,ϕ,xi)=Ez∼qϕ(z|xi)[logpθ(xi,z)−logqϕ(z|xi)]=Ez∼qϕ(z|xi)[logpθ(xi,z)]+H[qϕ(z|xi)]=Ez∼qϕ(z|xi)[logpθ(xi|z)]−DKL[qϕ(z|xi)∥p(z)](2)Optimization objective:Permalink
ˆθ,ˆϕ=argmaxθ,ϕ∑iL(θ,ϕ,xi)^θ,^ϕ=argmaxθ,ϕ∑iL(θ,ϕ,xi)Gradient-friendly Monte Carlo:Permalink
Difficulties in calculating L(θ,ϕ,xi)L(θ,ϕ,xi):
- Due to the generality of qq and pp (typically a neural network), the expectation in 22 does not have an analytical form. So we need to resort to Monte Carlo estimation.
- Furthermore, direct sampling zz according to qq poses difficulty in taking derivative against parameters ϕϕ that parameterizes the distribution qq.
Solution: Reparameterization Trick
Find smooth and invertible transformation z=gϕ(ϵ)z=gϕ(ϵ) such that with ϵϵ drawn from a fixed (non-parameterized) distribution p(ϵ)p(ϵ) we have z∼q(z;ϕ)z∼q(z;ϕ), so
Ez∼q(z;ϕ)[f(z)]=Eϵ∼p(ϵ)[f(gϕ(ϵ))]Ez∼q(z;ϕ)[f(z)]=Eϵ∼p(ϵ)[f(gϕ(ϵ))]For the Normal distribution used here (qϕ(z|x)qϕ(z|x)), it is convenient to use location-scale transformation, z=μ+σ∗ϵz=μ+σ∗ϵ with ϵ∼N(0,I)ϵ∼N(0,I).
˜L(θ,ϕ,xi)=1LL∑l=1(logpθ(xi|zi,l)]−DKL[qϕ(zi,l|xi)‖p(zi,l))˜L(θ,ϕ,xi)=1LL∑l=1(logpθ(xi|zi,l)]−DKL[qϕ(zi,l|xi)∥p(zi,l))(3) zi,l=μxi+σxi∗ϵi,l and ϵl∼N(0,I)zi,l=μxi+σxi∗ϵi,l and ϵl∼N(0,I)For total NN data points with mini batch size MM:
L(θ,ϕ;X)=N∑i=1L(θ,ϕ,xi)≈~LM(θ,ϕ;X)=NMM∑i=1˜L(θ,ϕ,xi)L(θ,ϕ;X)=N∑i=1L(θ,ϕ,xi)≈˜LM(θ,ϕ;X)=NMM∑i=1˜L(θ,ϕ,xi)(4)For sufficiently large batch size MM, the inner loop sample size LL can be set to 1. Due to stochastic mini batch gradient descent and stochastic expectation estimation, this is also called doubly stochastic estimation.
Using non-Gaussian encodersPermalink
Todo: discuss AVB paper
Gumble trick for discrete latent variablesPermalink
Ref for this section:
- Gumble max trick https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/
- Balog, Matej, et al. “Lost Relatives of the Gumbel Trick.” arXiv preprint arXiv:1706.04161 (2017).
- Jang, Eric, Shixiang Gu, and Ben Poole. “Categorical reparameterization with gumbel-softmax.” arXiv preprint arXiv:1611.01144 (2016).
Gumble distribution:
ff-GAN and GANPermalink
Prelude on ff-divergence and its variational lower boundPermalink
The f-divergence family
Df=∫Xq(x) f(p(x)q(x))dxDf=∫Xq(x) f(p(x)q(x))dx(5)where the generator function f:R+→Rf:R+→R is a convex, lower-semicontinuous function satisfying f(1)=0f(1)=0.
Every convex, lower-semicontinuous function has a convex conjugate function fcfc, also known as Fenchel conjugate. This function is defined as
fc(t)=supu∈domf{ut−f(u)}fc(t)=supu∈domf{ut−f(u)}(6)Function fcfc is again convex and lower-semicontinuous and the pair (f,fc)(f,fc) is dual to each other, i.e. (fc)c=f(fc)c=f. So we can represent ff as
f(t)=supt∈domfc{tu−fc(t)}f(t)=supt∈domfc{tu−fc(t)}(7)With this we can establish a lower bound for estimating the f-divergence in general
Df(P‖Q)=∫Xq(x)supt∈domfc{tp(x)q(x)−fc(t)}dx≥supT∈T∫X(p(x)T(x)−q(x)fc(T(x)))dx=supT∈T(Ex∼P[T(x)]−Ex∼Q[fc(T(x))])Df(P∥Q)=∫Xq(x)supt∈domfc{tp(x)q(x)−fc(t)}dx≥supT∈T∫X(p(x)T(x)−q(x)fc(T(x)))dx=supT∈T(Ex∼P[T(x)]−Ex∼Q[fc(T(x))])(8)where T is an arbitrary class of functions T:X→R. The inequality is due to Jessen’s inequality and constraints imposed by T.
The bond is tight for
T∗(x)=f′(p(x)q(x))Generative adversarial trainingPermalink
Suppose our goal is to come up with a distribution Q (model) that is close to P (the data distribution) and the similarity score (loss) is measured by Df(P‖Q). However the direct calculation of 5 is intractable, such as the case where the functional form of P is unknown and Q is a complex model parameterized by a neural network.
To be specific:
- Evaluating q(x) at any x is easy, but integrating it is hard due to lack of easy functional form.
- For p(x), we do not know how to evaluate it at any x
- Sampling from both P and Q are easy. Because drawing from data set approximates x∼P and we can make the model Q take random vectors as input which are easy to produce.
Fortunately, we can sample from both of them easily. In this case, 8 offers a way to estimate the lower bound of the divergence. We would need to maximize this lower bound by changing T so that it is close to the true divergence, then minimize it over Q. This is formally stated as follows.
F(θ,ω)=Ex∼P[Tω(x)]+Ex∼Qθ[−fc(Tω(x))] ˆθ=argminθmaxωF(θ,ω)To ensure that the output of Tω respects the domain of fc, we define Tω(x)=gf(Vω(x)), where Vω:X→R without any range constraints on the output and gf:R→domfc is an output activation function specific to the f-divergence used with suitable output ranges.
GANPermalink
For the original GAN, with a divergence target similar to Jensen-Shannon F(θ,ω)=Ex∼P[logDω(x)]+Ex∼Qθ[log(1−Dω(x))] with Dω=1/(1+e−Vω(x)) which corresponds to the following
gf(ν)=log(1/(1+e−ν)) Tω(x)=log(Dω(x))=gf(Vω(x)) fc(t)=−log(1−exp(t)) log(1−Dω(x))=−fc(Tω(x))
Practical considerations in adversarial trainingPermalink
Todo: log trick, DCGAN heuristics
Example divergence and their related functionsPermalink
Name | Df(P|Q) | Generator f(u) | T∗(x) |
---|---|---|---|
Forward KL | ∫p(x)logp(x)q(x)dx | ulogu | 1+logq(x)p(x) |
Reverse KL | ∫q(x)logp(x)q(x)dx | −logu | −q(x)p(x) |
Jensen-Shannon | 12∫p(x)log2p(x)p(x)+q(x)+q(x)log2q(x)p(x)+q(x)dx | ulogu−(u+1)logu+12 | log2p(x)p(x)+q(x) |
GAN | ∫p(x)log2p(x)p(x)+q(x)+q(x)log2q(x)p(x)+q(x)dx−log(4) | ulogu−(u+1)log(u+1) | logp(x)p(x)+q(x) |
Name | Conjugate fc(t) | domfc | Output activation gf | f′(1) |
---|---|---|---|---|
Forward KL | exp(t−1) | R | ν | 1 |
Reverse KL | −1−log(−t) | R− | −exp(ν) | −1 |
Jensen-Shannon | −log(2−exp(t)) | t<log(2) | log(2)−log(1+exp(−ν)) | 0 |
GAN | −log(1−exp(t)) | R− | −log(1+exp(−ν)) | −log(2) |
WGAN and WAEPermalink
Optimal transport (OT)Permalink
Kantorovich formulated the optimization target in optimal transport problems as follows
Wc(PX,PG)=infΓ∈P(x∼PX,y∼PY)Ex,y∼Γ[c(x,y)]where P(X∼PX,Y∼PY) is a set of all join distributions of (X,Y) with marginals PX and PY.
Wasserstein distancePermalink
When c(x,y)=‖x−y‖p for p≥1, W1/pc is called p-Wasserstein distance.
Wp(PX,PG)=infΓ∈P(x∼PX,y∼PY)Ex,y∼Γ[‖x−y‖p]The optimization problem is highly intractable in general, due to the constraint. However when p=1, Kantorovich-Rubinstein duality holds:
W1(PX,PG)=supf∈{1-Lipschitz}Ex∼PX[f(x)]−Ey∼PY[f(y)]The family of divergences from f-divergence only consider the relative probability (the ratio between two probability density functions) and do not measure the closeness of the underlying outcomes. With disjoint support or overlapping support but intersections that yield zero measure, the divergence between a target distribution and a θ-parameterized distribution might not be continuous with respect to θ. Wasserstein distance on the other hand does take into account the underlying topology of the outcomes and is continuous and differentiable almost everywhere with respect to θ and thus almost always provide useful gradient for optimization.
Wasserstein GAN (WGAN)Permalink
Following the dual form of W1, we can form a generative-adversarial model for a data distribution PD and model Qθ with auxiliary function f that is 1-Lipschitz continuous.
ˆθ=argminθsupf∈{1-Lipschitz}Ex∼PD[f(x)]−Ex∼Qθ[f(x)]$$ Practical considerations for WGAN
Todo: Gradient clipping with K-Lipschitz constraint on f; Soft gradient penalty (WGAN-GP)
Wasserstein Auto-encoder (WAE)Permalink
Rather than working with the dual or Wasserstein distance, which only holds for W1, we can also work with the primal form directly. As shown in Tolstikhin, Ilya, et al. “Wasserstein Auto-Encoders.” the following holds when we have a deterministic decoder mapping latent variable Z to Y through y=G(z):
Wc(PX,PG)=W†c(PX,PG)=infP∈P(x∼PX,z∼PZ)Ex,y∼P[c(x,G(z)]=infQ:QZ=PZEx∼PXEz∼Q(Z|X)[c(x,G(z)]The constraint put on Q(Z|X) is that its marginal needs to equal to P(Z). To have a feasible optimization problem we relax this constraint with the following constraint-free optimization target with a penalty that assess the closeness between Q(Z) and P(Z) via any reasonable divergence. This new objective is named penalized optimal transport (POT).
DPOT/WAE(PX,PG):=infQ∈QEx∼PXEz∼Q(Z|X)[c(x,G(z)]+λ⋅DZ(QZ,PZ)If the divergence between PZ and QZ is intractable to directly calculate, we could use generative-adversarial training to approximate it (see f-GAN).
Note: if decoder is probabilistic instead of deterministic, we would only have Wc(PX,PG)≤W†c(PX,PG), so we are minimizing an upper bound of the true OT cost.
Thought: the original paper used JS divergence for DZ, how about we use Wasserstein distance for DZ.
Todo: discuss connections to AAE paper
Leave a comment