Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
Abstract
CompACT, a discrete tokenizer that reduces observation encoding from hundreds to 8 tokens, enables faster and more efficient world model planning for real-time control applications.
World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.
Community
CompACT, a discrete tokenizer that reduces observation encoding from hundreds to 8 tokens, enables faster and more efficient world model planning for real-time control applications.
8-16 tokens per frame for planning, with a frozen DINOv3 backbone and a learnable latent resampler, is wild. i'd like to see how robust that compact latent space is when important but rare cues get compressed away, especially for longer-horizon tasks. the breakdown on arxivlens was solid and helped me sanity-check the token flow while skimming, nice to have a quick walkthrough: https://arxivlens.com/PaperView/Details/planning-in-8-tokens-a-compact-discrete-tokenizer-for-latent-world-model-6795-84bb8360
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Causal World Modeling for Robot Control (2026)
- OAT: Ordered Action Tokenization (2026)
- Recursive Belief Vision Language Action Models (2026)
- Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation (2026)
- LAD-Drive: Bridging Language and Trajectory with Action-Aware Diffusion Transformers (2026)
- FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution (2026)
- Scaling World Model for Hierarchical Manipulation Policies (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper

