File size: 6,933 Bytes
166923c
 
 
 
 
 
 
 
 
6d48df2
 
166923c
 
6d48df2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166923c
 
 
 
6d48df2
 
 
 
 
 
166923c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d48df2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166923c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
pipeline_tag: robotics
library_name: transformers
license: cc-by-nc-sa-4.0
tags:
  - vision-language-model
  - navigation
---

# InternVLA-N1 Model Series

This model was presented in the paper [Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation](https://huggingface.co/papers/2512.08186).

![License](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)
![Transformers](https://img.shields.io/badge/%F0%9F%A4%97%20Transformers-9cf?style=flat)
![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?logo=pytorch&logoColor=white)

---

## Model Description
InternVLA-N1 is a state-of-the-art navigation foundation model built on a **multi-system design**. Within this framework, it introduces a **dual-system approach** that joint trains the **System 2** for high-level reasoning and **System 1** for low-level action and control. This asynchronous architecture enables smooth, efficient, and robust instruction-following navigation in both simulated and real-world environments.


---

### πŸ”— Resources

[![Code](https://img.shields.io/badge/GitHub-InternNav-181717?logo=github)](https://github.com/InternRobotics/InternNav)
[![Technical Report β€” InternVLA-N1](https://img.shields.io/badge/Technical_Report-InternVLA--N1-BB2649?logo=adobeacrobatreader&logoColor=white)](https://internrobotics.github.io/internvla-n1.github.io/static/pdfs/InternVLA_N1.pdf)
[![DualVLN Paper β€” arXiv](https://img.shields.io/badge/arXiv-DualVLN-B31B1B?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2512.08186)
[![Project Page β€” InternVLA-N1](https://img.shields.io/badge/Project_Page-InternVLA--N1-4285F4?logo=google-chrome&logoColor=white)](https://internrobotics.github.io/internvla-n1.github.io/)
[![Project Page β€” DualVLN](https://img.shields.io/badge/Project_Page-DualVLN-4285F4?logo=google-chrome&logoColor=white)](https://internrobotics.github.io/internvla-n1-dualvln.github.io/)
[![Dataset](https://img.shields.io/badge/Dataset-InternData--N1-FF6F00?logo=huggingface&logoColor=white)](https://huggingface.co/datasets/InternRobotics/InternData-N1)

---

## Key Features

- 🧩 **Modular Multi-System Support**  
  Combines **System 2** (reasoning/planning) with **System 1** (action/control) in an asynchronous framework, delivering the first **Dual-System Vision-Language Navigation (VLN) Foundation Model**.

- πŸš€ **Zero-Shot Sim2Real Generalization**  
  Trained exclusively on simulation data (**InternData-N1**) while generalizing effectively to real-world deployments.

- πŸ† **State-of-the-Art Performance**  
  Achieves leading results on multiple VLN benchmarks, including **VLN-CE R2R/RxR** and **VLN-PE**.

- ⚑ **Asynchronous Inference**  
  Enables smooth execution and dynamic obstacle avoidance during navigation.


---

## Model Variants

| Model Variant | Description | Key Characteristics |
|--------------|-------------|----------------------|\
| [**InternVLA-N1 (S2)**](https://huggingface.co/InternRobotics/InternVLA-N1-System2) | Finetuned Qwen2.5-VL model for pixel-goal grounding | Strong System 2 module; compatible with decoupled System 1 controllers or joint optimization pipelines |\
| [**InternVLA-N1 (Dual System) _w/ NavDP\*_**](https://huggingface.co/InternRobotics/InternVLA-N1-w-NavDP) | Jointly tuned System 1 (NavDP\*) and InternVLA-N1 (S2) | Optimized end-to-end performance; uses RGB-D observations |\
| [**InternVLA-N1 (Dual System) _DualVLN_**](https://huggingface.co/InternRobotics/InternVLA-N1-DualVLN) | Latest dual-system architecture | Optimized end-to-end performance and faster convergence; uses RGB observations |\


> The previously released version is now called [InternVLA-N1-wo-dagger](https://huggingface.co/InternRobotics/InternVLA-N1-wo-dagger). The lastest official release is recommended for best performance.

---

## Sample Usage

This model is compatible with the Hugging Face `transformers` library. The following code snippet demonstrates how to perform inference:

```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
import requests
from io import BytesIO

# Load model and processor
hf_model_id = "InternRobotics/InternVLA-N1-DualVLN"
model = AutoModelForCausalLM.from_pretrained(hf_model_id, torch_dtype=torch.float16, trust_remote_code=True, device_map="cuda")
processor = AutoProcessor.from_pretrained(hf_model_id, trust_remote_code=True)

# Load a dummy image
# Replace with your actual image path or a URL to a relevant scene
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bird_image.jpg"
image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")

# Define a question related to navigation or visual understanding
question = "What is the most direct path to the kitchen from here? Describe the first few steps."

messages = [
    {"role": "user", "content": f"<|image_pad|>{question}"},
]

# Process inputs
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = inputs.to(model.device)
pixel_values = processor.preprocess(images=image, return_tensors="pt")["pixel_values"]
pixel_values = pixel_values.to(model.device, dtype=torch.float16)

# Generate response
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        pixel_values=pixel_values,
        do_sample=True,
        temperature=0.7,
        max_new_tokens=1024,
        eos_token_id=processor.tokenizer.eos_token_id,
        repetition_penalty=1.05
    )

response = processor.decode(outputs[0], skip_special_tokens=True)
print(f"User: {question}
Assistant: {response}")
```

For more detailed usage (inference, evaluation, and Gradio demo), please refer to the [InternNav repository](https://github.com/InternRobotics/InternNav).

---

## Citation
If you find our work helpful, please consider starring this repository 🌟 and citing:

```bibtex
@misc{internvla-n1,
    title = {{InternVLA-N1: An} Open Dual-System Navigation Foundation Model with Learned Latent Plans},
    author = {InternVLA-N1 Team},
    year = {2025},
    booktitle={arXiv},
}
@misc{internnav2025,
    title = {{InternNav: InternRobotics'} open platform for building generalized navigation foundation models},
    author = {InternNav Contributors},
    howpublished={\url{https://github.com/InternRobotics/InternNav}},
    year = {2025}
}
@misc{wei2025groundslowfastdualsystem,
      title={Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation}, 
      author={Meng Wei and Chenyang Wan and Jiaqi Peng and Xiqian Yu and Yuqiang Yang and Delin Feng and Wenzhe Cai and Chenming Zhu and Tai Wang and Jiangmiao Pang and Xihui Liu},
      year={2025},
      eprint={2512.08186},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.08186}, 
}
```