Abstract
Predictive Video VAE combines predictive learning with video reconstruction to improve latent space representation and generative performance through temporal coherence and motion priors.
Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.
Community
Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the
visual world into compact spatiotemporal latent spaces, improving training efficiency and stability.
While existing video VAEs achieve commendable reconstruction quality, continued optimization of
reconstruction does not necessarily translate into improved generative performance. How to enhance
the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired
by principles of predictive world modeling, we investigate the potential of predictive learning to
improve the video generative modeling. To this end, we introduce a simple and effective predictive
reconstruction objective that unifies predictive learning with video reconstruction. Specifically,
we randomly discard future frames and encode only partial past observations, while training the
decoder to reconstruct the observed frames and predict future ones simultaneously. This design
encourages the latent space to encode temporally predictive structures and build a more coherent
understanding of video dynamics, thereby improving generation quality. Our model, termed
Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52%
faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore,
comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with
generative performance improving alongside VAE training, but also yields consistent gains in
downstream video understanding, underscoring a latent space that effectively captures temporal
coherence and motion priors.
the most interesting nugget for me is how they fuse predictive learning with a partial-context reconstruction instead of pure pixel loss. they randomly discard future frames and train the decoder to reconstruct the observed frames while predicting the future ones, and they add a temporal-difference reconstruction to push motion priors. this motion-aware term blocks a degenerate shortcut where the model just copies static content, which is a classic pitfall in long-horizon video VAEs. the arxivlens breakdown helped me parse the method details, worth a quick skim here: https://arxivlens.com/PaperView/Details/video-generation-with-predictive-latents-7120-227e313f. curious how sensitive the results are to the balance between predictive loss and reconstruction loss, and whether the motion term scales with longer horizons?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Latent-Compressed Variational Autoencoder for Video Diffusion Models (2026)
- Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction (2026)
- RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing (2026)
- HMPDM: A Diffusion Model for Driving Video Prediction with Historical Motion Priors (2026)
- Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting (2026)
- Diffusion Models for Joint Audio-Video Generation (2026)
- TrajLoom: Dense Future Trajectory Generation from Video (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.02134 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper