UnifoLM-VLA-0
Collection
15 items • Updated
• 2
Project Page | Models | Datasets
🌎English | 🇨🇳中文
UnifoLM-VLA-0 is a Vision–Language–Action (VLA) large model in the UnifoLM series, designed for general-purpose humanoid robot manipulation. It goes beyond the limitations of conventional Vision–Language Models (VLMs) in physical interaction. Through continued pre-training on robot manipulation data, the model evolves from "vision-language understanding" to an "embodied brain" equipped with physical common sense.
| Spatial Semantic Enhancement | Manipulation Generalization |
|---|---|
| To address the requirements for instruction comprehension and spatial understanding in manipulation tasks, the model deeply integrates textual instructions with 2D/3D spatial details through continued pre-training, substantially strengthening its spatial perception and geometric understanding capabilities. | By leveraging full dynamics prediction data, the model achieves strong generalization across diverse manipulation tasks. In real-robot validation, it can complete 12 categories of complex manipulation tasks with high quality using only a single policy. |
@misc{unifolm-vla-0,
author = {Unitree},
title = {UnifoLM-VLA-0: A Vision-Language-Action (VLA) Framework under UnifoLM Family},
year = {2026},
}
The model is released under the CC BY-NC-SA 4.0 license as found in the LICENSE. You are responsible for ensuring that your use of Unitree AI Models complies with all applicable laws.