| | --- |
| | license: mit |
| | --- |
| | |
| | # Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis |
| |
|
| | This repository contains the model and code for the paper [Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis](https://huggingface.co/papers/2507.23785). |
| |
|
| | This work presents a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. It introduces a *Direct 4DMesh-to-GS Variation Field VAE* to encode canonical Gaussian Splats (GS) and their temporal variations into a compact latent space. Building on this, a *Gaussian Variation Field diffusion model* is trained with a temporal-aware Diffusion Transformer, conditioned on input videos and canonical GS. The model demonstrates superior generation quality and remarkable generalization to in-the-wild video inputs. |
| |
|
| | Project Page: [https://gvfdiffusion.github.io/](https://gvfdiffusion.github.io/) |
| | Code: [https://github.com/ForeverFancy/GVFDiffusion](https://github.com/ForeverFancy/GVFDiffusion) |
| |
|
| | ## Abstract |
| | We present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. |
| |
|
| | ## Installation and Quick Start |
| |
|
| | For detailed installation instructions and how to run a minimal inference example, please refer to the [GitHub repository](https://github.com/ForeverFancy/GVFDiffusion). |
| |
|
| | ```bash |
| | # Clone the repository |
| | git clone https://github.com/ForeverFancy/GVFDiffusion.git |
| | cd GVFDiffusion |
| | |
| | # Setup environment and dependencies |
| | . ./setup.sh --new-env --basic --xformers --flash-attn --diffoctreerast --spconv --mipgaussian --kaolin --nvdiffrast |
| | |
| | # Run a minimal inference example |
| | accelerate launch --num_processes 1 inference_dpm_latent.py --batch_size 1 --exp_name /path/to/your/output --config configs/diffusion.yml --start_idx 0 --end_idx 2 --txt_file ./assets/in_the_wild.txt --use_fp16 --num_samples 2 --adaptive --data_dir ./assets/ --num_timesteps 32 --download_assets --in_the_wild |
| | ``` |
| |
|
| | ## Citation |
| | If you find the work useful, please consider citing: |
| | ```bibtex |
| | @article{zhang2025gaussian, |
| | title={Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis}, |
| | author={Zhang, Bowen and Xu, Sicheng and Wang, Chuxin and Yang, Jiaolong and Zhao, Feng and Chen, Dong and Guo, Baining}, |
| | journal={arXiv preprint arXiv:2507.23785}, |
| | year={2025} |
| | } |
| | ``` |