StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Supplementary Material

 

 


Comparison to Baselines(Single Reference)

We present the comparison results of our method with other single reference style-guided Text-to-Video methods, including:

Style Reference VideoComposer([1]) VideoCrafter*([2]) Gen2*([4]) Ours
"A chef preparing meals in kitchen."
"A wolf walking stealthily through the forest."
"A field of sunflowers on a sunny day."
"A rocketship heading towards the moon."
"A bear catching fish in a river."
"A knight riding a horse through a field."
"A river flowing gently under a bridge."
"A street performer playing the guitar."

 


Comparison to Baselines(Multiple Reference)

We present the comparison results of our method with AnimateDiff([5]) for multi-reference style-guided Text-to-Video methods.

Our method effectively generate high-quality stylized video that align with prompts and confrom the style of reference images without any additional finetuning costs

 

Style Reference AnimateDiff([5]) Ours(S-R) Ours(M-R)
"A wooden sailboat docked in a harbor."
"A student walking to school with backpack."
"A street performer playing the guitar."
"A wolf walking stealthily through the forest."
"A student walking to school with backpack."
"A knight riding a horse through a field."
"A chef preparing meals in kitchen."
"A rocketship heading towards the moon."

 


Ablation Study

 

we ablate the two-stage training strategy to verify its effectiveness in stylized video generation.
Style Reference Only Stage1 Only Joint Training Ours
"A student walking to school with backpack."

 


StyleCrafter Combined with Depth Control

We present sample results of our method combined with depth control, and compare with VideoComposer

 

Style Reference Input Depth VideoComposer([1]) Ours
"A tiger walks in the forest."
"A car turning around on a countryside road."

 


Additional Results of StyleCrafter

We present additional style-guided text-to-video generation results of our method.
"A bear catching fish in a river."
hidden
"A wodden sailboat docked in a harbor"
"A chef preparing meals in kitchen"
"A campfire surrounded by tents"


 

 

 


References

[1] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint:2306.02018, 2023

[2] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation. preprint arXiv:2310.19512, 2023

[3] OpenAI. Gpt-4v(ision) system card. Technical report, 2023.

[4] Gen-2 contributors. Gen-2. Gen-2. Accessed Nov. 1, 2023 [Online] https://research.runwayml.com/gen2.

[5] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint:2307.04725, 2023