Brief Survey of Generative Models

  8 minute read

OverviewPermalink

This page is a high-level summary of various generative models with little explanations. Models to cover are as follows:

Variational Autoencoders (VAE)

  • Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

Adversarial Variational Bayes (AVB)

Extention to VAE to use non-Gaussian encoders

  • Mescheder, Lars, Sebastian Nowozin, and Andreas Geiger. “Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks.” arXiv preprint arXiv:1701.04722 (2017).

Generative Adverserial Networks (GAN)

  • Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.

Generalized divergence minimization GAN (ff-GAN)

  • Nowozin, Sebastian, Botond Cseke, and Ryota Tomioka. “f-gan: Training generative neural samplers using variational divergence minimization.” Advances in Neural Information Processing Systems. 2016.

Wasserstein GAN (WGAN)

  • Arjovsky, Martin, Soumith Chintala, and Léon Bottou. “Wasserstein gan.” arXiv preprint arXiv:1701.07875 (2017).

Adversarial Autoencoders (AAE)

  • Makhzani, Alireza, et al. “Adversarial autoencoders.” arXiv preprint arXiv:1511.05644 (2015).

Wasserstein Auto-Encoder (WAE)

  • Tolstikhin, Ilya, et al. “Wasserstein Auto-Encoders.” arXiv preprint arXiv:1711.01558 (2017).

Cramer GAN

  • Bellemare, Marc G., et al. “The Cramer Distance as a Solution to Biased Wasserstein Gradients.” arXiv preprint arXiv:1705.10743 (2017).

VAEPermalink

Model setup:Permalink

  • Recognition model: qϕ(z|x)=N(μ=h1(x),σ2I=h2(x)I)qϕ(z|x)=N(μ=h1(x),σ2I=h2(x)I)
  • Assumed fixed prior: p(z)=N(0,I)p(z)=N(0,I)
  • Generation model: pθ(x|z)=N(μ=g1(z),σ2I=g2(z)I)pθ(x|z)=N(μ=g1(z),σ2I=g2(z)I)
    • Implied (but intractable) posterior: pθ(z|x)pθ(z|x)

Key equations:Permalink

logpθ(xi)=DKL(qϕ(z|xi)pθ(z|xi)+L(θ,ϕ,xi)logpθ(xi)=DKL(qϕ(z|xi)pθ(z|xi)+L(θ,ϕ,xi)(1) L(θ,ϕ,xi)=Ezqϕ(z|xi)[logpθ(xi,z)logqϕ(z|xi)]=Ezqϕ(z|xi)[logpθ(xi,z)]+H[qϕ(z|xi)]=Ezqϕ(z|xi)[logpθ(xi|z)]DKL[qϕ(z|xi)p(z)]L(θ,ϕ,xi)=Ezqϕ(z|xi)[logpθ(xi,z)logqϕ(z|xi)]=Ezqϕ(z|xi)[logpθ(xi,z)]+H[qϕ(z|xi)]=Ezqϕ(z|xi)[logpθ(xi|z)]DKL[qϕ(z|xi)p(z)](2)

Optimization objective:Permalink

ˆθ,ˆϕ=argmaxθ,ϕiL(θ,ϕ,xi)^θ,^ϕ=argmaxθ,ϕiL(θ,ϕ,xi)

Gradient-friendly Monte Carlo:Permalink

Difficulties in calculating L(θ,ϕ,xi)L(θ,ϕ,xi):

  • Due to the generality of qq and pp (typically a neural network), the expectation in 22 does not have an analytical form. So we need to resort to Monte Carlo estimation.
  • Furthermore, direct sampling zz according to qq poses difficulty in taking derivative against parameters ϕϕ that parameterizes the distribution qq.

Solution: Reparameterization Trick

Find smooth and invertible transformation z=gϕ(ϵ)z=gϕ(ϵ) such that with ϵϵ drawn from a fixed (non-parameterized) distribution p(ϵ)p(ϵ) we have zq(z;ϕ)zq(z;ϕ), so

Ezq(z;ϕ)[f(z)]=Eϵp(ϵ)[f(gϕ(ϵ))]Ezq(z;ϕ)[f(z)]=Eϵp(ϵ)[f(gϕ(ϵ))]

For the Normal distribution used here (qϕ(z|x)qϕ(z|x)), it is convenient to use location-scale transformation, z=μ+σϵz=μ+σϵ with ϵN(0,I)ϵN(0,I).

˜L(θ,ϕ,xi)=1LLl=1(logpθ(xi|zi,l)]DKL[qϕ(zi,l|xi)p(zi,l))˜L(θ,ϕ,xi)=1LLl=1(logpθ(xi|zi,l)]DKL[qϕ(zi,l|xi)p(zi,l))(3) zi,l=μxi+σxiϵi,l  and  ϵlN(0,I)zi,l=μxi+σxiϵi,l  and  ϵlN(0,I)

For total NN data points with mini batch size MM:

L(θ,ϕ;X)=Ni=1L(θ,ϕ,xi)~LM(θ,ϕ;X)=NMMi=1˜L(θ,ϕ,xi)L(θ,ϕ;X)=Ni=1L(θ,ϕ,xi)˜LM(θ,ϕ;X)=NMMi=1˜L(θ,ϕ,xi)(4)

For sufficiently large batch size MM, the inner loop sample size LL can be set to 1. Due to stochastic mini batch gradient descent and stochastic expectation estimation, this is also called doubly stochastic estimation.

Using non-Gaussian encodersPermalink

Todo: discuss AVB paper

Gumble trick for discrete latent variablesPermalink

Ref for this section:

  1. Gumble max trick https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/
  2. Balog, Matej, et al. “Lost Relatives of the Gumbel Trick.” arXiv preprint arXiv:1706.04161 (2017).
  3. Jang, Eric, Shixiang Gu, and Ben Poole. “Categorical reparameterization with gumbel-softmax.” arXiv preprint arXiv:1611.01144 (2016).

Gumble distribution:


ff-GAN and GANPermalink

Prelude on ff-divergence and its variational lower boundPermalink

The f-divergence family

Df=Xq(x) f(p(x)q(x))dxDf=Xq(x) f(p(x)q(x))dx(5)

where the generator function f:R+Rf:R+R is a convex, lower-semicontinuous function satisfying f(1)=0f(1)=0.

Every convex, lower-semicontinuous function has a convex conjugate function fcfc, also known as Fenchel conjugate. This function is defined as

fc(t)=supudomf{utf(u)}fc(t)=supudomf{utf(u)}(6)

Function fcfc is again convex and lower-semicontinuous and the pair (f,fc)(f,fc) is dual to each other, i.e. (fc)c=f(fc)c=f. So we can represent ff as

f(t)=suptdomfc{tufc(t)}f(t)=suptdomfc{tufc(t)}(7)

With this we can establish a lower bound for estimating the f-divergence in general

Df(PQ)=Xq(x)suptdomfc{tp(x)q(x)fc(t)}dxsupTTX(p(x)T(x)q(x)fc(T(x)))dx=supTT(ExP[T(x)]ExQ[fc(T(x))])Df(PQ)=Xq(x)suptdomfc{tp(x)q(x)fc(t)}dxsupTTX(p(x)T(x)q(x)fc(T(x)))dx=supTT(ExP[T(x)]ExQ[fc(T(x))])(8)

where T is an arbitrary class of functions T:XR. The inequality is due to Jessen’s inequality and constraints imposed by T.

The bond is tight for

T(x)=f(p(x)q(x))

Generative adversarial trainingPermalink

Suppose our goal is to come up with a distribution Q (model) that is close to P (the data distribution) and the similarity score (loss) is measured by Df(PQ). However the direct calculation of 5 is intractable, such as the case where the functional form of P is unknown and Q is a complex model parameterized by a neural network.

To be specific:

  • Evaluating q(x) at any x is easy, but integrating it is hard due to lack of easy functional form.
  • For p(x), we do not know how to evaluate it at any x
  • Sampling from both P and Q are easy. Because drawing from data set approximates xP and we can make the model Q take random vectors as input which are easy to produce.

Fortunately, we can sample from both of them easily. In this case, 8 offers a way to estimate the lower bound of the divergence. We would need to maximize this lower bound by changing T so that it is close to the true divergence, then minimize it over Q. This is formally stated as follows.

F(θ,ω)=ExP[Tω(x)]+ExQθ[fc(Tω(x))] ˆθ=argminθmaxωF(θ,ω)

To ensure that the output of Tω respects the domain of fc, we define Tω(x)=gf(Vω(x)), where Vω:XR without any range constraints on the output and gf:Rdomfc is an output activation function specific to the f-divergence used with suitable output ranges.

GANPermalink

For the original GAN, with a divergence target similar to Jensen-Shannon F(θ,ω)=ExP[logDω(x)]+ExQθ[log(1Dω(x))] with Dω=1/(1+eVω(x)) which corresponds to the following

gf(ν)=log(1/(1+eν)) Tω(x)=log(Dω(x))=gf(Vω(x)) fc(t)=log(1exp(t)) log(1Dω(x))=fc(Tω(x))

Practical considerations in adversarial trainingPermalink

Todo: log trick, DCGAN heuristics

Name Df(P|Q) Generator f(u) T(x)
Forward KL p(x)logp(x)q(x)dx ulogu 1+logq(x)p(x)
Reverse KL q(x)logp(x)q(x)dx logu q(x)p(x)
Jensen-Shannon 12p(x)log2p(x)p(x)+q(x)+q(x)log2q(x)p(x)+q(x)dx ulogu(u+1)logu+12 log2p(x)p(x)+q(x)
GAN p(x)log2p(x)p(x)+q(x)+q(x)log2q(x)p(x)+q(x)dxlog(4) ulogu(u+1)log(u+1) logp(x)p(x)+q(x)
Name Conjugate fc(t) domfc Output activation gf f(1)
Forward KL exp(t1) R ν 1
Reverse KL 1log(t) R exp(ν) 1
Jensen-Shannon log(2exp(t)) t<log(2) log(2)log(1+exp(ν)) 0
GAN log(1exp(t)) R log(1+exp(ν)) log(2)

WGAN and WAEPermalink

Optimal transport (OT)Permalink

Kantorovich formulated the optimization target in optimal transport problems as follows

Wc(PX,PG)=infΓP(xPX,yPY)Ex,yΓ[c(x,y)]

where P(XPX,YPY) is a set of all join distributions of (X,Y) with marginals PX and PY.

Wasserstein distancePermalink

When c(x,y)=xyp for p1, W1/pc is called p-Wasserstein distance.

Wp(PX,PG)=infΓP(xPX,yPY)Ex,yΓ[xyp]

The optimization problem is highly intractable in general, due to the constraint. However when p=1, Kantorovich-Rubinstein duality holds:

W1(PX,PG)=supf{1-Lipschitz}ExPX[f(x)]EyPY[f(y)]

The family of divergences from f-divergence only consider the relative probability (the ratio between two probability density functions) and do not measure the closeness of the underlying outcomes. With disjoint support or overlapping support but intersections that yield zero measure, the divergence between a target distribution and a θ-parameterized distribution might not be continuous with respect to θ. Wasserstein distance on the other hand does take into account the underlying topology of the outcomes and is continuous and differentiable almost everywhere with respect to θ and thus almost always provide useful gradient for optimization.

Wasserstein GAN (WGAN)Permalink

Following the dual form of W1, we can form a generative-adversarial model for a data distribution PD and model Qθ with auxiliary function f that is 1-Lipschitz continuous.

ˆθ=argminθsupf{1-Lipschitz}ExPD[f(x)]ExQθ[f(x)]

$$ Practical considerations for WGAN

Todo: Gradient clipping with K-Lipschitz constraint on f; Soft gradient penalty (WGAN-GP)

Wasserstein Auto-encoder (WAE)Permalink

Rather than working with the dual or Wasserstein distance, which only holds for W1, we can also work with the primal form directly. As shown in Tolstikhin, Ilya, et al. “Wasserstein Auto-Encoders.” the following holds when we have a deterministic decoder mapping latent variable Z to Y through y=G(z):

Wc(PX,PG)=Wc(PX,PG)=infPP(xPX,zPZ)Ex,yP[c(x,G(z)]=infQ:QZ=PZExPXEzQ(Z|X)[c(x,G(z)]

The constraint put on Q(Z|X) is that its marginal needs to equal to P(Z). To have a feasible optimization problem we relax this constraint with the following constraint-free optimization target with a penalty that assess the closeness between Q(Z) and P(Z) via any reasonable divergence. This new objective is named penalized optimal transport (POT).

DPOT/WAE(PX,PG):=infQQExPXEzQ(Z|X)[c(x,G(z)]+λDZ(QZ,PZ)

If the divergence between PZ and QZ is intractable to directly calculate, we could use generative-adversarial training to approximate it (see f-GAN).

Note: if decoder is probabilistic instead of deterministic, we would only have Wc(PX,PG)Wc(PX,PG), so we are minimizing an upper bound of the true OT cost.

Thought: the original paper used JS divergence for DZ, how about we use Wasserstein distance for DZ.

Todo: discuss connections to AAE paper

Leave a comment