File size: 10,350 Bytes
33789d6 d407812 3de0361 d407812 6353b09 d407812 d376d49 a13bccb d407812 4590c99 d407812 7df3000 d407812 8321181 7b163bc d407812 6353b09 d407812 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
---
datasets:
- LongVideo-Reason/longvideo-reason
language:
- en
tags:
- VLM
- RL
- video
- Reasoning
- Long
---
# LongVILA-R1-7B
[](https://arxiv.org/abs/2507.07966)
[](https://github.com/NVlabs/Long-RL)
[](https://huggingface.co/Efficient-Large-Model/LongVILA-R1-7B)
[](https://www.youtube.com/watch?v=ykbblK2jiEg)
[](https://long-rl.hanlab.ai)
## Introduction:
<p>
<strong>LongVILA-R1-7B</strong> supports both <u>multiple-choice</u> questions and <u>open-ended</u> questions. It can switch between thinking and non-thinking modes.<br>
<strong>LongVILA-R1-7B</strong> demonstrates strong performance in long video reasoning, achieving <strong>71.1%</strong> on VideoMME (w/ sub.) and surpassing Gemini-1.5-Pro across diverse reasoning tasks.<br>
<strong>LongVILA-R1-7B</strong> supports processing up to <strong>8,192</strong> video frames per video, with configurable FPS settings.<br>
<strong>Long-RL</strong> is a codebase that accelerates long video RL training by up to <strong>2.1×</strong> through its MR-SP system. It supports RL training on image, video, and omni inputs across VILA, Qwen/Qwen-VL, and diffusion models.
</p>
## Evaluation:
### Video QA Benchmarks
| Models | VideoMME (w/o sub) | VideoMME (w sub) | ActivityNet-QA (test) | LongVideoBench (val) | PerceptionTest (val) | NExT-QA (mc) | VNBench (val) |
|:-------------------|:------------------:|:----------------:|:---------------------:|:--------------------:|:--------------------:|:--------:|:-------------:|
| **LongVILA-7B** | **60.1** | **65.1** | **59.5** | **57.1** | **58.1** | **80.7** | **63.0** |
| **LongVILA-R1-7B** | **65.0** | **70.7** | **64.8** | **58.0** | **68.9** | **81.5** | **75.5** |
### LongVideo-Reason-eval
| Models | Temporal | Goal | Plot | Spatial | Overall|
| :--- | :---: | :---: | :---: | :---: | :---: |
| | | |
| **LongVILA-R1-7B** | **68.1** | **85.7** | **70.6** | **53.3** | **72.0** |
## Usage
### General Inference
```python
from transformers import AutoModel
model_path = "Efficient-Large-Model/LongVILA-R1-7B"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
# You can adjust the FPS value as needed.
# To disable FPS control, set it to 0 and manually specify the number of processed video frames via `num_video_frames`.
# Example:
# model.config.fps = 8.0
# model.config.num_video_frames, model.config.fps = 512, 0
use_thinking = True # Switching between thinking and non-thinking modes
system_prompt_thinking = "You are a helpful assistant. The user asks a question, and then you solves it.\n\nPlease first think deeply about the question based on the given video, and then provide the final answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.\n\n Question: {question}"
prompt = "What is the main purpose of the video?"
video_path = "video.mp4"
if use_thinking:
prompt = system_prompt_thinking.format(question=prompt)
response = model.generate_content([prompt, {"path": video_path}])
print("Response: ", response)
```
### with vLLM engine
Tested on `vllm==0.9.1`. We need to get the remote code first.
```bash
mkdir remote_code
cp path_to/Efficient-Large-Model/LongVILA-R1-7B/*.py remote_code
```
Then, you can use the following code for model generation.
```python
import os
from transformers import AutoModel
from vllm import LLM, SamplingParams
from remote_code.media import extract_media
from remote_code.mm_utils import process_images
from remote_code.tokenizer_utils import tokenize_conversation
model_path = "path_to/Efficient-Large-Model/LongVILA-R1-7B"
model_encoder = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto", llm_only_need_embed=True)
# you can change gpu_memory_utilization according to GPU memory
llm = LLM(model=os.path.join(model_path, "llm"), enable_prompt_embeds=True, gpu_memory_utilization=0.5)
use_thinking = True # Switching between thinking and non-thinking modes
system_prompt_thinking = "You are a helpful assistant. The user asks a question, and then you solves it.\n\nPlease first think deeply about the question based on the given video, and then provide the final answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.\n\n Question: {question}"
prompt = "What is the main purpose of the video?"
video_path = "video.mp4"
if use_thinking:
prompt = system_prompt_thinking.format(question=prompt)
conversation = [{"from": "human", "value": [prompt, {"path": video_path}]}]
media = extract_media(conversation, model_encoder.config)
input_ids = tokenize_conversation(conversation, model_encoder.tokenizer, add_generation_prompt=True).unsqueeze(0).cuda()
media["video"] = [
process_images(images, model_encoder.vision_tower.image_processor, model_encoder.config).half()
for images in media["video"]
]
inputs_embeds, _, _ = model_encoder._embed(input_ids, media, {"video": {}}, None, None)
completions = llm.generate(prompts=[{"prompt_embeds": inputs_embeds.squeeze(0)}], sampling_params=SamplingParams(max_tokens=1024))
response = completions[0].outputs[0].text
print("Response: ", response)
```
# LongVILA-R1 Model Card
## Model details
**Model type:**
LongVILA-R1 addresses the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.0% and 70.7% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-R1 across multiple benchmarks. Moreover, LongVILA-R1 shows steady performance improvements as the number of input video frames increases.
**Model date:**
LongVILA-R1-7B was trained in July 2025.
**Paper or resources for more information:**
- Paper https://arxiv.org/abs/2507.07966
- Code https://github.com/NVLabs/Long-RL
- Model https://huggingface.co/Efficient-Large-Model/LongVILA-R1-7B
- Video https://www.youtube.com/watch?v=ykbblK2jiEg
- Demo https://long-rl.hanlab.ai
```bibtex
@misc{long-rl,
title = {Long-RL: Scaling RL to Long Sequences},
author = {Yukang Chen, Wei Huang, Shuai Yang, Qinghao Hu, Baifeng Shi, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu,Hongxu Yin, Yao Lu, Song Han},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/NVlabs/Long-RL}},
}
```
```bibtex
@article{chen2025longvila-r1,
title={Scaling RL to Long Videos},
author={Yukang Chen and Wei Huang and Baifeng Shi and Qinghao Hu and Hanrong Ye and Ligeng Zhu and Zhijian Liu and Pavlo Molchanov and Jan Kautz and Xiaojuan Qi and Sifei Liu and Hongxu Yin and Yao Lu and Song Han},
year={2025},
eprint={2507.07966},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
```bibtex
@inproceedings{chen2024longvila,
title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos},
author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Ethan He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han},
booktitle={The International Conference on Learning Representations (ICLR)},
year={2025},
}
```
## License
- The weights are released under the [CC-BY-NC-SA-4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
- The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
- [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI
- [Dataset Licenses](https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/LICENSE) for each one used during training.
- [NVIDIA Licenses](https://huggingface.co/Efficient-Large-Model/LongVILA-R1-7B/blob/main/NV_LICENSE)
**Where to send questions or comments about the model:**
https://github.com/NVLabs/Long-RL/issues
## Intended use
**Primary intended uses:**
The primary use of LongVILA-R1 is research on large multimodal models and chatbots.
**Primary intended users:**
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
## Input:
**Input Type:** Video and Text
**Input Format:** MP4 and other video fromats
## Output:
**Output Type:** Text
**Output Format:** String
**[Preferred/Supported] Operating System(s):** <br>
Linux
## Inference:
**Engine:** [Tensor(RT), Triton, Or List Other Here]
* PyTorch
**Test Hardware:**
* A100
* H100
* A6000
## Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. |