We present the comparison results of our method with other single reference style-guided Text-to-Video methods, including:
Style Reference | VideoComposer([1]) | VideoCrafter*([2]) | Gen2*([4]) | Ours |
---|---|---|---|---|
"A chef preparing meals in kitchen." | ||||
> | > | > | > | |
"A wolf walking stealthily through the forest." | ||||
> | > | > | > | |
"A field of sunflowers on a sunny day." | ||||
> | > | > | > | |
"A rocketship heading towards the moon." | ||||
> | > | > | > | |
"A bear catching fish in a river." | ||||
> | > | > | > | |
"A knight riding a horse through a field." | ||||
> | > | > | > | |
"A river flowing gently under a bridge." | ||||
> | > | > | > | |
"A street performer playing the guitar." | ||||
> | > | > | > | |
We present the comparison results of our method with AnimateDiff([5]) for multi-reference style-guided Text-to-Video methods.
Our method effectively generate high-quality stylized video that align with prompts and confrom the style of reference images without any additional finetuning costs
Style Reference | AnimateDiff([5]) | Ours(S-R) | Ours(M-R) |
---|---|---|---|
"A wooden sailboat docked in a harbor." | |||
> | > | > | |
"A student walking to school with backpack." | |||
> | > | > | |
"A street performer playing the guitar." | |||
> | > | > | |
"A wolf walking stealthily through the forest." | |||
> | > | > | |
"A student walking to school with backpack." | |||
> | > | > | |
"A knight riding a horse through a field." | |||
> | > | > | |
"A chef preparing meals in kitchen." | |||
> | > | > | |
"A rocketship heading towards the moon." | |||
> | > | > | |
we ablate the two-stage training strategy to verify its effectiveness in stylized video generation.
Style Reference | Only Stage1 | Only Joint Training | Ours |
---|---|---|---|
"A student walking to school with backpack." | |||
> | > | > | |
We present sample results of our method combined with depth control, and compare with VideoComposer
Style Reference | Input Depth | VideoComposer([1]) | Ours |
---|---|---|---|
"A tiger walks in the forest." | |||
> | > | > | |
> | > | > | |
"A car turning around on a countryside road." | |||
> | > | > | |
> | > | > | |
We present additional style-guided text-to-video generation results of our method.
"A bear catching fish in a river." | ||||
---|---|---|---|---|
> | hidden | > | ||
> | > | |||
> | ||||
"A wodden sailboat docked in a harbor" | ||||
> | > | |||
> | > | |||
> | ||||
"A chef preparing meals in kitchen" | ||||
> | > | |||
> | > | |||
> | ||||
"A campfire surrounded by tents" | ||||
> | > | |||
> | > | |||
> | ||||
[1] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint:2306.02018, 2023
[2] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation. preprint arXiv:2310.19512, 2023
[3] OpenAI. Gpt-4v(ision) system card. Technical report, 2023.
[4] Gen-2 contributors. Gen-2. Gen-2. Accessed Nov. 1, 2023 [Online] https://research.runwayml.com/gen2.
[5] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint:2307.04725, 2023