Update License to MIT, Add Paper Abstract, and Enhance Usage Section

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +152 -99
README.md CHANGED
@@ -1,99 +1,152 @@
1
- ---
2
- pipeline_tag: robotics
3
- library_name: transformers
4
- license: cc-by-nc-sa-4.0
5
- tags:
6
- - vision-language-model
7
- - manipulation
8
- - robotics
9
- ---
10
-
11
- <div align="center">
12
- <video src="https://cdn-uploads.huggingface.co/production/uploads/678123194248fde89e4fc9bf/_cbIWKHPzffRxIpfmqdFG.mp4"
13
- controls autoplay muted playsinline loop width="720"></video>
14
-
15
- <p><em>🏁 Best viewed with sound on</em></p>
16
- </div>
17
-
18
-
19
- # F1: A Vision Language Action Model Bridging<br>Understanding and Generation to Actions
20
- [![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/abs/2509.06951)
21
- [![Code](https://img.shields.io/badge/GitHub-Code-800820?logo=github)](https://github.com/InternRobotics/F1-VLA)
22
- [![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://aopolin-lv.github.io/F1-VLA)
23
-
24
-
25
-
26
- ## πŸš€ Key Innovations
27
-
28
- - **🧠 Predictive Inverse Dynamics**: Visual foresight generation for planning-based control
29
- - **πŸ—οΈ Mixture-of-Transformer**: Three specialized experts (Understanding, Generation, Action)
30
- - **πŸ“ˆ Three-Stage Training**: Progressive alignment, pretraining, and adaptation
31
-
32
- ## πŸ€– Real-World Robot Experiments
33
-
34
- <!-- <div align="center">
35
- <video src="https://cdn-uploads.huggingface.co/production/uploads/678123194248fde89e4fc9bf/FPZ45NJd9_B_T1gOP8QVf.qt"
36
- controls autoplay muted playsinline loop width="720"></video>
37
- <p><em>9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation</em></p>
38
- </div> -->
39
-
40
- <div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
41
- <!-- First Row -->
42
- <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
43
- <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
44
- <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/arx_v2_long.mp4" type="video/mp4">
45
- </video>
46
- <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
47
- <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/arx_v1_dyna.mp4" type="video/mp4">
48
- </video>
49
- <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
50
- <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/franka_v1_sweep.mp4" type="video/mp4">
51
- </video>
52
- </div>
53
- <!-- Second Row -->
54
- <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
55
- <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
56
- <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v2_handover.mp4" type="video/mp4">
57
- </video>
58
- <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
59
- <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v3_tea.mp4" type="video/mp4">
60
- </video>
61
- <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
62
- <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v1_flower.mp4" type="video/mp4">
63
- </video>
64
- </div>
65
- <p><em>Diverse manipulation tasks across multiple robot platforms.</em></p>
66
- </div>
67
-
68
-
69
- ## πŸ“Š Performance Summary
70
-
71
- | Task | Platform | F1 | Ο€0 | Improvement |
72
- |:--------:|:------------:|:------------------:|:------------:|:---------------:|
73
- | Multi-task | Genie-1 | 82.2% | 65.2% | +17.0% |
74
- | Adaptation | Franka | 66.7% | 53.3% | +13.4% |
75
- | Long-horizon | ARX LIFT II | 40.0% | 0.0% | +40.0% |
76
- | Dynamic Env | ARX LIFT II | 66.7% | 33.3% | +33.4% |
77
-
78
- ## Usage
79
- Please refer to our official repo [F1-VLA](https://github.com/InternRobotics/F1-VLA).
80
-
81
- ## πŸ“š Citation
82
-
83
- If you find our work helpful, please cite:
84
-
85
- ```bibtex
86
- @article{f1_vla_2025,
87
- title={F1: A Vision Language Action Model Bridging Understanding and Generation to Actions},
88
- author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Jiangmiao Pang},
89
- journal={Conference/Journal Name},
90
- year={2025},
91
- url={https://arxiv.org/abs/2509.06951}
92
- }
93
- ```
94
-
95
- ## License
96
- This work is under the [cc-by-nc-sa-4.0](LICENSE).
97
-
98
- ## Acknowledgements
99
- This repository is based on [Lerobot](https://github.com/huggingface/lerobot), [Any4lerobot](https://github.com/Tavish9/any4lerobot/), and [VAR](https://github.com/FoundationVision/VAR).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ pipeline_tag: robotics
5
+ tags:
6
+ - vision-language-model
7
+ - manipulation
8
+ - robotics
9
+ ---
10
+
11
+ <div align="center">
12
+ <video src="https://cdn-uploads.huggingface.co/production/uploads/678123194248fde89e4fc9bf/_cbIWKHPzffRxIpfmqdFG.mp4"
13
+ controls autoplay muted playsinline loop width="720"></video>
14
+
15
+ <p><em>🏁 Best viewed with sound on</em></p>
16
+ </div>
17
+
18
+
19
+ # F1: A Vision-Language-Action Model Bridging<br>Understanding and Generation to Actions
20
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/abs/2509.06951)
21
+ [![Code](https://img.shields.io/badge/GitHub-Code-800820?logo=github)](https://github.com/InternRobotics/F1-VLA)
22
+ [![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://aopolin-lv.github.io/F1-VLA)
23
+
24
+ ### Abstract
25
+ Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.
26
+
27
+
28
+ ## πŸš€ Key Innovations
29
+
30
+ - **🧠 Predictive Inverse Dynamics**: Visual foresight generation for planning-based control
31
+ - **πŸ—οΈ Mixture-of-Transformer**: Three specialized experts (Understanding, Generation, Action)
32
+ - **πŸ“ˆ Three-Stage Training**: Progressive alignment, pretraining, and adaptation
33
+
34
+ ## πŸ€– Real-World Robot Experiments
35
+
36
+ <!-- <div align="center">
37
+ <video src="https://cdn-uploads.huggingface.co/production/uploads/678123194248fde89e4fc9bf/FPZ45NJd9_B_T1gOP8QVf.qt"
38
+ controls autoplay muted playsinline loop width="720"></video>
39
+ <p><em>9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation</em></p>
40
+ </div> -->
41
+
42
+ <div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
43
+ <!-- First Row -->
44
+ <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
45
+ <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
46
+ <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/arx_v2_long.mp4" type="video/mp4">
47
+ </video>
48
+ <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
49
+ <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/arx_v1_dyna.mp4" type="video/mp4">
50
+ </video>
51
+ <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
52
+ <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/franka_v1_sweep.mp4" type="video/mp4">
53
+ </video>
54
+ </div>
55
+ <!-- Second Row -->
56
+ <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
57
+ <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
58
+ <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v2_handover.mp4" type="video/mp4">
59
+ </video>
60
+ <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
61
+ <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v3_tea.mp4" type="video/mp4">
62
+ </video>
63
+ <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
64
+ <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v1_flower.mp4" type="video/mp4">
65
+ </video>
66
+ </div>
67
+ <p><em>Diverse manipulation tasks across multiple robot platforms.</em></p>
68
+ </div>
69
+
70
+
71
+ ## πŸ“Š Performance Summary
72
+
73
+ | Task | Platform | F1 | Ο€0 | Improvement |
74
+ |:--------:|:------------:|:------------------:|:------------:|:---------------:|
75
+ | Multi-task | Genie-1 | 82.2% | 65.2% | +17.0% |
76
+ | Adaptation | Franka | 66.7% | 53.3% | +13.4% |
77
+ | Long-horizon | ARX LIFT II | 40.0% | 0.0% | +40.0% |
78
+ | Dynamic Env | ARX LIFT II | 66.7% | 33.3% | +33.4% |
79
+
80
+ ## Usage
81
+ ### Prerequisites
82
+ - Python β‰₯ 3.10
83
+ - torch β‰₯ 2.6.0
84
+ - CUDA β‰₯ 12.4
85
+
86
+ ### Installation
87
+ ```bash
88
+ # Clone repository
89
+ git clone https://github.com/InternRobotics/F1-VLA.git
90
+ export VLA_HOME=$(pwd)
91
+ cd F1-VLA/f1_vla
92
+
93
+ # Create environment
94
+ conda create -f f1_vla python==3.10
95
+ conda activate f1_vla
96
+
97
+ # Install dependencies
98
+ pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 torchcodec==0.2.1 --index-url https://download.pytorch.org/whl/cu124
99
+
100
+ # install f1_vla
101
+ pip install -e .
102
+
103
+ pip install numpy==1.26.4
104
+ ```
105
+ For optimal performance and compatibility, we highly recommend using [FFmpeg](https://ffmpeg.org/) alongside [TorchCodec](https://github.com/pytorch/torchcodec).
106
+
107
+ - FFmpeg is an industry-standard multimedia framework that provides robust, all-purpose video and audio processing.
108
+ - TorchCodec is a library specifically designed for deep learning workflows in PyTorch, offering highly optimized video I/O.
109
+
110
+ By using these two tools, the time of loading the video dataset is greatly accelerated.
111
+
112
+ ### Download Pretrained Datasets and Models
113
+
114
+ |**Name**| **link**|
115
+ |:--|:--|
116
+ |LIBERO_SPATIAL_NO_NOOPS_PATH|[IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot](https://huggingface.co/datasets/IPEC-COMMUNITY/libero_spatial_no_noops_1.0.0_lerobot)|
117
+ |STAGE2_CKPT_PATH|[F1_pretrain](https://huggingface.co/InternRobotics/F1-VLA)|
118
+ |LEROBOT_PI0_PATH|[lerobot/pi0](https://huggingface.co/lerobot/pi0)|
119
+ |PALIGEMMA_PATH|[google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224)|
120
+ |VAE_PATH|[var_d16.pth](https://huggingface.co/FoundationVision/var/resolve/main/var_d16.pth)|
121
+
122
+ ### Finetune
123
+ ```shell
124
+ # 1. edit config file
125
+ vim f1_vla/config/debug_test.yaml
126
+
127
+ # 2. run the program
128
+ cd $(VLA_HOME)
129
+ python train_hf.py --config-file f1_vla/config/debug_test.yaml
130
+ ```
131
+
132
+
133
+ ## πŸ“š Citation
134
+
135
+ If you find our work helpful, please cite:
136
+
137
+ ```bibtex
138
+ @article{f1_vla_2025,
139
+ title={F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions},
140
+ author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Michael Yu Wang and Liqiang Nie and Jiangmiao Pang},
141
+ eprint={2509.06951},
142
+ archivePrefix={arXiv},
143
+ year={2025},
144
+ url={https://arxiv.org/abs/2509.06951}
145
+ }
146
+ ```
147
+
148
+ ## License
149
+ This work is licensed under the [MIT License](https://opensource.org/licenses/MIT).
150
+
151
+ ## Acknowledgements
152
+ This repository is based on [Lerobot](https://github.com/huggingface/lerobot), [Any4lerobot](https://github.com/Tavish9/any4lerobot/), and [VAR](https://github.com/FoundationVision/VAR).