UnifoLM-VLA-0: A Vision-Language-Action (VLA) Framework under UnifoLM Family

Project Page | Models | Datasets

🌎English | 🇨🇳中文

UnifoLM-VLA-0 is a Vision–Language–Action (VLA) large model in the UnifoLM series, designed for general-purpose humanoid robot manipulation. It goes beyond the limitations of conventional Vision–Language Models (VLMs) in physical interaction. Through continued pre-training on robot manipulation data, the model evolves from "vision-language understanding" to an "embodied brain" equipped with physical common sense.

Spatial Semantic Enhancement Manipulation Generalization
To address the requirements for instruction comprehension and spatial understanding in manipulation tasks, the model deeply integrates textual instructions with 2D/3D spatial details through continued pre-training, substantially strengthening its spatial perception and geometric understanding capabilities. By leveraging full dynamics prediction data, the model achieves strong generalization across diverse manipulation tasks. In real-robot validation, it can complete 12 categories of complex manipulation tasks with high quality using only a single policy.
UnifoLM-VLA Demo

📝 Citation

@misc{unifolm-vla-0,
  author       = {Unitree},
  title        = {UnifoLM-VLA-0: A Vision-Language-Action (VLA) Framework under UnifoLM Family},
  year         = {2026},
}

License

The model is released under the CC BY-NC-SA 4.0 license as found in the LICENSE. You are responsible for ensuring that your use of Unitree AI Models complies with all applicable laws.

Downloads last month
44
Video Preview
loading

Collection including unitreerobotics/UnifoLM-VLA-Libero