Back to Modules
intermediateComponents

Vision Model: Variational Autoencoder (VAE)

Learn how VAEs compress visual observations into compact latent representations.

35 min read
2 references

Vision Model: Variational Autoencoder (VAE)

Overview

The Vision Model in World Models uses a Variational Autoencoder (VAE) to compress high-dimensional visual observations into a compact latent space.

Why Use a VAE?

Raw visual observations are:

  • High-dimensional: Computationally expensive to process
  • Redundant: Many pixels contain similar information
  • Noisy: Not all visual details are relevant for decision-making

A VAE addresses these issues by learning to:

  1. Encode observations into a low-dimensional latent space
  2. Decode latent vectors back to reconstructed observations
  3. Regularize the latent space for smooth interpolation

The Reparameterization Trick

A key innovation in VAEs is the reparameterization trick, which enables backpropagation through the stochastic sampling process.

Loss Function

The VAE is trained to minimize:

L = Reconstruction Loss + KL Divergence

Reconstruction Loss

Measures how well the decoder reconstructs the input.

KL Divergence

Regularizes the latent space to be close to a standard normal distribution.

Latent Space Properties

A well-trained VAE produces a latent space with desirable properties:

  1. Continuity: Similar inputs map to nearby points
  2. Completeness: Every point in latent space decodes to a valid output
  3. Smoothness: Interpolation between points produces meaningful transitions
References
Academic papers and resources

Auto-Encoding Variational Bayes

Diederik P. Kingma, Max Welling (2013)

paper

β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework

Irina Higgins et al. (2017)

paper