# Learning Multi-layer Latent Variable Model via Variational Optimization of Short Run MCMC for Approximate Inference

^{*}Equal contributions

University of California, Los Angeles (UCLA), USA

Stevens Institute of Technology, USA

## Abstract

This paper studies the fundamental problem of learning deep generative models that consist of multiple layers of latent variables organized in top-down architectures. Such models have high expressivity and allow for learning hierarchical representations. Learning such a generative model requires inferring the latent variables for each training example based on the posterior distribution of these latent variables. The inference typically requires Markov chain Monte Caro (MCMC) that can be time consuming. In this paper, we propose to use noise initialized non-persistent short run MCMC, such as finite step Langevin dynamics initialized from the prior distribution of the latent variables, as an approximate inference engine, where the step size of the Langevin dynamics is variationally optimized by minimizing the Kullback-Leibler divergence between the distribution produced by the short run MCMC and the posterior distribution. Our experiments show that the proposed method outperforms variational auto-encoder (VAE) in terms of reconstruction error and synthesis quality. The advantage of the proposed method is that it is simple and automatic without the need to design an inference model.

## Paper

The publication can be obtained here.@article{nijkamp2020inf, title={Learning Multi-layer Latent Variable Model via Variational Optimization of Short Run MCMC for Approximate Inference}, author={Nijkamp, Erik and Pang, Bo and Han, Tian and Zhu, Song-Chun and Wu, Ying Nian}, journal={ECCV}, year={2020} }

## Contributions

(1) We propose short run MCMC for approximate inference of latent variables in deep generative models.

(2) We provide a method to determine the optimal step size, or in general, hyper-parameters of the short run MCMC.

(3) We demonstrate learning of multi-layer latent variable models with high quality samples and reconstructions.

## Code

The code can be obtained here. (Coming soon)

## Experiments

### Experiment 1: Synthesis

We contrast the synthesis of a generator model \(g_\theta(z)\) with \(L=5\) latent layers learned by Ladder-VAE and our proposed short run MCMC infernece method in Figure 1. Despite its simplicity, short run MCMC is competitive with elaborate means of inference in VAE models and flow-based models, such as Glow.

### Experiment 2: Reconstruction

We evaluate the accuracy of the learned short run MCMC inference dynamics \(q_{s, \theta_t}(z|x_i)\) by reconstructing test images. In contrast to traditional MCMC posterior sampling with persistent chains, short run inference with small \(K\) allows not only for efficient learning on training examples, but also the same dynamics can be recruited for testing examples. Figure 2 compares the reconstructions of learned generators with \(L=5\) layers by Ladder-VAE and short run inference on CelebA \((64\times64\times3)\). The fidelity of reconstructions by short run MCMC inference appears qualitatively improved over VAE

### Experiment 3: Inpainting

Our method can "inpaint" occluded image regions. To recover the occluded pixels, the only required modification involves the computation of \(\|x - g_\theta(z)\|^2/\sigma^2\). For a fully observed image, the term is computed by the summation over all pixels. For partially observed images, we only compute the summation over the observed pixels. Figure 3 depicts test images taken from the CelebA dataset for which a mask randomly occludes pixels in various patterns.

### Experiment 4: Hierarchical representation

Multi-layer latent variable models not only demonstrate improved expressiveness over single-layer ones but also allow for learning the hierarchical structure. It can be argued that an alternative parameterization of the multi-layer generator promotes disentangled hierarchical features. We train a three-layer model with this parameterization using short run inference on SVHN. As shown in Figure 4, the three-layer latent variables capture disentangled representations, which are background color, digit identity, general structure from bottom to top layer.

### Experiment 5: Variational optimization of step size

The step size \(s\) of short run MCMC may be optimized such that \(q_s(z)\) best approximates the posterior \(p_\theta(z|x)\). That is, we can optimize the step size \(s\) by minimizing \(KL(q_s(z) \| p_\theta(z|x))\) via a grid search or gradient descent. Then, we evaluate the eigenvalues of \(d F(z_0, \epsilon)/d z_0\). As both steps are computed in a differentiable manner, we may compute the gradient with respect to \(s\). Figure 5 depict the optimal step size \(s\) over learning iterations \(t\) determined by grid-search and gradient descent.

## Acknowledgements

The work is supported by NSF DMS-2015577, DARPA XAI N66001-17-2-4029, ARO W911NF1810296, ONR MURI N00014-16-1-2007, and XSEDE grant ASC170063. We thank NVIDIA for the donation of Titan V GPUs. We thank Eric Fischer for the assistance with experiments.