V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes

Abstract

This paper introduces V²Edit, a novel training-free framework for instruction-guided video and 3D scene editing. Addressing the critical challenge of balancing original content preservation with editing task fulfillment, our approach employs a progressive strategy that decomposes complex editing tasks into a sequence of simpler subtasks. Each subtask is controlled through three key synergistic mechanisms: the initial noise, noise added at each denoising step, and cross-attention maps between text prompts and video content. This ensures robust preservation of original video elements while effectively applying the desired edits. Beyond its native video editing capability, we extend V²Edit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits even for tasks involving substantial geometric changes such as object insertion. Extensive experiments demonstrate that our V²Edit achieves high-quality and successful edits across various challenging video editing tasks and complex 3D scene editing tasks, thereby establishing state-of-the-art performance in both domains.

Gallery: Video Editing

Gallery: 3D Scene Editing

Citation

@misc{V2Edit,
    title={{V2Edit}: Versatile Video Diffusion Editor for Videos and {3D} Scenes}, 
    author={Yanming Zhang and Jun-Kun Chen and Jipeng Lyu and Yu-Xiong Wang},
    year={2025},
    eprint={2503.10634},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2503.10634}, 
}

Acknowledgements

The website template is borrowed from ProEdit.
We thank you and the other visitors for visiting our project page.

V²Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes