File size: 10,637 Bytes
8b1b847 a8ac0dd 8b1b847 cca1829 8b1b847 cca1829 8b1b847 cca1829 8b1b847 cca1829 8b1b847 cca1829 8b1b847 cca1829 8b1b847 cca1829 8b1b847 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
---
license: mit
tags:
- pytorch
---
# π₯ FAR: Frame Autoregressive Model for Both Short- and Long-Context Video Modeling π
<div align="center">
[](https://farlongctx.github.io/)
[](https://arxiv.org/abs/2503.19325)
[](https://huggingface.co/guyuchao/FAR_Models)
[](https://paperswithcode.com/sota/video-generation-on-ucf-101)
</div>
<p align="center" style="font-size: larger;">
<a href="https://arxiv.org/abs/2503.19325">Long-Context Autoregressive Video Modeling with Next-Frame Prediction</a>
</p>

## π’ News
* **2025-03:** Paper and Code of [FAR](https://farlongctx.github.io/) are released! π
## π What's the Potential of FAR?
### π₯ Introducing FAR: a new baseline for autoregressive video generation
FAR (i.e., <u>**F**</u>rame <u>**A**</u>uto<u>**R**</u>egressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.

### π₯ FAR achieves better convergence than video diffusion models with the same continuous latent space
<p align="center">
<img src="https://github.com/showlab/FAR/blob/main/assets/converenge.jpg?raw=true" width=55%>
<p>
### π₯ FAR leverages clean visual context without additional image-to-video fine-tuning:
Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame β₯ 1) within a single model.
<p align="center">
<img src="https://github.com/showlab/FAR/blob/main/assets/performance.png?raw=true" width=75%>
<p>
### π₯ FAR supports 16x longer temporal extrapolation at test time
<p align="center">
<img src="https://github.com/showlab/FAR/blob/main/assets/extrapolation.png?raw=true" width=100%>
<p>
### π₯ FAR supports efficient training on long-video sequence with managable token lengths
<p align="center">
<img src="https://github.com/showlab/FAR/blob/main/assets/long_short_term_ctx.jpg?raw=true" width=55%>
<p>
#### π For more details, check out our [paper](https://arxiv.org/abs/2503.19325).
## ποΈββοΈ FAR Model Zoo
We provide trained FAR models in our paper for re-implementation.
### Video Generation
We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of [Latte](https://arxiv.org/abs/2401.03048):
| Model (Config) | #Params | Resolution | Condition | FVD | HF Weights | Pre-Computed Samples |
|:-------:|:------------:|:------------:|:-----------:|:-----:|:----------:|:----------:|
| [FAR-L](options/train/far/video_generation/FAR_L_ucf101_uncond_res128_400K_bs32.yml) | 457 M | 128x128 | β | 280 Β± 11.7 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_L_UCF101_Uncond128-c19abd2c.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
| [FAR-L](options/train/far/video_generation/FAR_L_ucf101_cond_res128_400K_bs32.yml) | 457 M | 128x128 | β | 99 Β± 5.9 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_L_UCF101_Cond128-c6f798bf.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
| [FAR-L](options/train/far/video_generation/FAR_L_ucf101_uncond_res256_400K_bs32.yml) | 457 M | 256x256 | β | 303 Β± 13.5 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_L_UCF101_Uncond256-adea51e9.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
| [FAR-L](options/train/far/video_generation/FAR_L_ucf101_cond_res256_400K_bs32.yml) | 457 M | 256x256 | β | 113 Β± 3.6 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_L_UCF101_Cond256-41c6033f.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
| [FAR-XL](options/train/far/video_generation/FAR_XL_ucf101_uncond_res256_400K_bs32.yml) | 657 M | 256x256 | β | 279 Β± 9.2 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_XL_UCF101_Uncond256-3594ce6b.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
| [FAR-XL](options/train/far/video_generation/FAR_XL_ucf101_cond_res256_400K_bs32.yml) | 657 M | 256x256 | β | 108 Β± 4.2 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_XL_UCF101_Cond256-28a88f56.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
### Short-Video Prediction
We follows the evaluation prototype of [MCVD](https://arxiv.org/abs/2205.09853) and [ExtDM](https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_ExtDM_Distribution_Extrapolation_Diffusion_Model_for_Video_Prediction_CVPR_2024_paper.pdf):
| Model (Config) | #Params | Dataset | PSNR | SSIM | LPIPS | FVD | HF Weights | Pre-Computed Samples |
|:-----:|:------------:|:------------:|:-----:|:-----:|:-----:|:-----:|:----------:|:----------:|
| [FAR-B](options/train/far/short_video_prediction/FAR_B_ucf101_res64_200K_bs32.yml) | 130 M | UCF101 | 25.64 | 0.818 | 0.037 | 194.1 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/short_video_prediction/FAR_B_UCF101_Uncond64-381d295f.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
| [FAR-B](options/train/far/short_video_prediction/FAR_B_bair_res64_200K_bs32.yml) | 130 M | BAIR (c=2, p=28) | 19.40 | 0.819 | 0.049 | 144.3 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/short_video_prediction/FAR_B_BAIR_Uncond64-1983191b.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
### Long-Video Prediction
We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of [TECO](https://arxiv.org/abs/2210.02396):
| Model (Config) | #Params | Dataset | PSNR | SSIM | LPIPS | FVD | HF Weights | Pre-Computed Samples |
|:-----:|:------------:|:------------:|:-----:|:-----:|:-----:|:-----:|:----------:|:----------:|
| [FAR-B-Long](options/train/far/long_video_prediction/FAR_B_Long_dmlab_res64_400K_bs32.yml) | 150 M | DMLab | 22.3 | 0.687 | 0.104 | 64 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/long_video_prediction/FAR_B_Long_DMLab_Action64-c09441dc.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
| [FAR-M-Long](options/train/far/long_video_prediction/FAR_M_Long_minecraft_res128_400K_bs32.yml) | 280 M | Minecraft | 16.9 | 0.448 | 0.251 | 39 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/long_video_prediction/FAR_M_Long_Minecraft_Action128-4c041561.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
## π§ Dependencies and Installation
### 1. Setup Environment:
```bash
# Setup Conda Environment
conda create -n FAR python=3.10
conda activate FAR
# Install Pytorch
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia
# Install Other Dependences
pip install -r requirements.txt
```
### 2. Prepare Dataset:
We have uploaded the dataset used in this paper to Hugging Face datasets for faster download. Please follow the instructions below to prepare.
```python
from huggingface_hub import snapshot_download, hf_hub_download
dataset_url = {
"ucf101": "guyuchao/UCF101",
"bair": "guyuchao/BAIR",
"minecraft": "guyuchao/Minecraft",
"minecraft_latent": "guyuchao/Minecraft_Latent",
"dmlab": "guyuchao/DMLab",
"dmlab_latent": "guyuchao/DMLab_Latent"
}
for key, url in dataset_url.items():
snapshot_download(
repo_id=url,
repo_type="dataset",
local_dir=f"datasets/{key}",
token="input your hf token here"
)
```
Then, enter its directory and execute:
```bash
find . -name "shard-*.tar" -exec tar -xvf {} \;
```
### 3. Prepare Pretrained Models of FAR:
We have uploaded the pretrained models of FAR to Hugging Face models. Please follow the instructions below to download if you want to evaluate FAR.
```bash
from huggingface_hub import snapshot_download, hf_hub_download
for key, url in dataset_url.items():
snapshot_download(
repo_id="guyuchao/FAR_Models",
repo_type="model",
local_dir="experiments/pretrained_models/FAR_Models",
token="input your hf token here"
)
```
## π Training
To train different models, you can run the following command:
```bash
accelerate launch \
--num_processes 8 \
--num_machines 1 \
--main_process_port 19040 \
train.py \
-opt train_config.yml
```
* **Wandb:** Set ```use_wandb``` to ```True``` in config to enable wandb monitor.
* **Periodally Evaluation:** Set ```val_freq``` to control the peroidly evaluation in training.
* **Auto Resume:** Directly rerun the script, the model will find the lastest checkpoint to resume, the wandb log will automatically resume.
* **Efficient Training on Pre-Extracted Latent:** Set ```use_latent``` to ```True```, and set the ```data_list``` to correponding latent path list.
## π» Sampling & Evaluation
To evaluate the performance of a pretrained model, just copy the training config and set the ```pretrain_network: ~``` to your trained folder. Then run the following scripts:
```bash
accelerate launch \
--num_processes 8 \
--num_machines 1 \
--main_process_port 10410 \
test.py \
-opt test_config.yml
```
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Citation
If our work assists your research, feel free to give us a star β or cite us using:
```
@article{gu2025long,
title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
author={Gu, Yuchao and Mao, weijia and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2503.19325},
year={2025}
}
``` |