--- language: - en license: mit library_name: transformers pipeline_tag: other tags: - robotics - navigation - embodied-ai - waypoint-prediction - qwen model_name: OpenTrackVLA Qwen0.6B Planner --- # OpenTrackVLA 🤖 👀 **Visual Navigation & Following for Everyone.** [](https://opensource.org/licenses/Apache-2.0) [](https://www.google.com/search?q=) [](https://www.google.com/search?q=) [](https://arxiv.org/abs/2509.12129) **OpenTrackVLA** is a fully open-source Vision-Language-Action (VLA) stack that turns **monocular video** and **natural-language instructions** into actionable, short-horizon waypoints. While we explore massive backbones (8B/30B) internally, this repository is dedicated to democratizing embodied AI. We have intentionally released our highly efficient **0.6B checkpoint** along with the **full training pipeline**. ### 🚀 Why OpenTrackVLA? * **Fully Open Source:** We release the model weights, inference code, *and* the training stack—not just the inference wrapper. * **Accessible:** Designed to reproduce, fine-tune, and deploy with affordable compute . * **Multimodal Control:** Combines learned priors with visual input to guide real or simulated robots via simple text prompts. > **Acknowledgment:** OpenTrackVLA builds on the ideas introduced by the original [TrackVLA project](https://github.com/wsakobe/TrackVLA). Their partially-open release inspired this community-driven effort to keep the ecosystem open so researchers and developers can continue improving the stack together. ## Demo In Action The system processes video history and text instructions to predict future waypoints. Below are examples of the tracker in action: