Overview
We introduce Dress&Dance, a general framework for video generation conditioned on a wide range of inputs, including text, layout, sparse pose sequences, and multimodal combinations. Instead of building task-specific architectures, we employ a unified conditioning mechanism that we call CondNet, which processes different modalities into a shared space. This allows a single video generator to support diverse control signals without retraining. Our approach further combines hybrid training paradigms to balance realism and controllability.
Across a range of examples—including those shown in the accompanying videos—our method generates temporally coherent and photorealistic results, often capturing subtle dynamics and maintaining faithful alignment with input prompts. The same model generalizes well across tasks, demonstrating strong compositionality and responsiveness to diverse conditioning inputs.
Result Gallery
Citation
Acknowledgements
We thank you and the other visitors for visiting our project page.