Back to Modules

intermediateComponents

Vision Model: Variational Autoencoder (VAE)

Learn how VAEs compress visual observations into compact latent representations.

35 min read

2 references

Vision Model: Variational Autoencoder (VAE)

Overview

The Vision Model in World Models uses a Variational Autoencoder (VAE) to compress high-dimensional visual observations into a compact latent space.

Why Use a VAE?

Raw visual observations are:

High-dimensional: Computationally expensive to process
Redundant: Many pixels contain similar information
Noisy: Not all visual details are relevant for decision-making

A VAE addresses these issues by learning to:

Encode observations into a low-dimensional latent space
Decode latent vectors back to reconstructed observations
Regularize the latent space for smooth interpolation

The Reparameterization Trick

A key innovation in VAEs is the reparameterization trick, which enables backpropagation through the stochastic sampling process.

Loss Function

The VAE is trained to minimize:

L = Reconstruction Loss + KL Divergence

Reconstruction Loss

Measures how well the decoder reconstructs the input.

KL Divergence

Regularizes the latent space to be close to a standard normal distribution.

Latent Space Properties

A well-trained VAE produces a latent space with desirable properties:

Continuity: Similar inputs map to nearby points
Completeness: Every point in latent space decodes to a valid output
Smoothness: Interpolation between points produces meaningful transitions

References

Academic papers and resources

Auto-Encoding Variational Bayes

Diederik P. Kingma, Max Welling (2013)

paper

β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework

Irina Higgins et al. (2017)

paper