arxiv:2605.02134

Video Generation with Predictive Latents

Published on May 4

· Submitted by

Zhao Yian on May 6

ByteDance Seed

Upvote

Authors:

Abstract

Predictive Video VAE combines predictive learning with video reconstruction to improve latent space representation and generative performance through temporal coherence and motion priors.

AI-generated summary

Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.

View arXiv page View PDF Project page Add to collection

Community

zhaoyian01

Paper submitter 1 day ago

Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the
visual world into compact spatiotemporal latent spaces, improving training efficiency and stability.
While existing video VAEs achieve commendable reconstruction quality, continued optimization of
reconstruction does not necessarily translate into improved generative performance. How to enhance
the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired
by principles of predictive world modeling, we investigate the potential of predictive learning to
improve the video generative modeling. To this end, we introduce a simple and effective predictive
reconstruction objective that unifies predictive learning with video reconstruction. Specifically,
we randomly discard future frames and encode only partial past observations, while training the
decoder to reconstruct the observed frames and predict future ones simultaneously. This design
encourages the latent space to encode temporally predictive structures and build a more coherent
understanding of video dynamics, thereby improving generation quality. Our model, termed
Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52%
faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore,
comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with
generative performance improving alongside VAE training, but also yields consistent gains in
downstream video understanding, underscoring a latent space that effectively captures temporal
coherence and motion priors.

avahal

about 20 hours ago

the most interesting nugget for me is how they fuse predictive learning with a partial-context reconstruction instead of pure pixel loss. they randomly discard future frames and train the decoder to reconstruct the observed frames while predicting the future ones, and they add a temporal-difference reconstruction to push motion priors. this motion-aware term blocks a degenerate shortcut where the model just copies static content, which is a classic pitfall in long-horizon video VAEs. the arxivlens breakdown helped me parse the method details, worth a quick skim here: https://arxivlens.com/PaperView/Details/video-generation-with-predictive-latents-7120-227e313f. curious how sensitive the results are to the balance between predictive loss and reconstruction loss, and whether the motion term scales with longer horizons?