Improving Video Generation with Human Feedback

Jie Liu^1,3,5^*, Gongye Liu^2,3^*, Jiajun Liang³^†, Ziyang Yuan^2,3, Xiaokun Liu³, Mingwu Zheng³, Xiele Wu^3,4,
Qiulin Wang³, Wenyu Qin³, Menghan Xia³, Xintao Wang³, Xiaohong Liu⁴, Fei Yang³,
Pengfei Wan³, Di Zhang³, Kun Gai³, Yujiu Yang²^✉️, Wanli Ouyang^1,5,

¹The Chinese University of Hong Kong, ²Tsinghua University, ³Kuaishou Technology,
⁴Shanghai Jiao Tong University, ⁵Shanghai AI Laboratory

^*Equal contribution ^†Project Leader ^✉️Corresponding Author

ArXiv Code

VideoReward 🏆 VideoGen-RewardBench

Eval Dataset 🏁 Online Demo(Reward) 🎬 Qualitative Results

In this project, we develop a systematic pipeline that leverage human feedback to harnesses current challenges in video generation and refine the flow-based VDMs.

Our project is structured around four key components (click to jump to the corresponding section):

I. Human Preference Annotation: We create a large-scale, human-labeled dataset comprising 182k video generation preference annotations from 12 latest video generation models, capturing multiple dimensions of quality.
II. VideoReward: We propose and systematically study a multi-dimensional video reward model, investigating how annotaions and different design decisions influence its rewarding efficacy.
III. VideoGen-RewardBench: We extend VideoGen-Eval with human preference annotations to create VideoGen-RewardBench, a leaderboard for evaluating video reward models. It includes 26.5k annotated triplets from diverse prompts and videos generated by state-of-the-art video generation models.
IV. Video Alignment: We extend three alignment algorithms to flow-based VDMs, including two training-time strategies (Flow-DPO and Flow-RWR) and one inference-time technique (Flow-NRG).

Part I. Human Preference Annotation

We utilized 12 text-to-video models to generate a total of 108k videos from 16k unique prompts. This process ultimately resulted in 182k annotated triplets, each consisting of a prompt paired with two videos and corresponding preference annotations.

Annotators are asked to perform pairwise assessments across three critical dimensions: Visual Quality (VQ), Motion Quality (MQ), and Text Alignment (TA). For each dimension, annotators are presented with a prompt and two generated videos, and asked to provide choices indicating their preference (A wins/Ties/B wins).

In addition to pairwise annotations, we also conduct pointwise annotations using a similar procedure, where annotators rate each video on a Likert scale from 1 to 5 for the same three dimensions (i.e., VQ, MQ, TA). This dual annotation setup enables us to explore the advantages and limitations of pairwise preferences versus pointwise scores as annotation strategies.

Part II. VideoReward

Score Regression v.s. Bradley-Terry. Given that our training dataset includes both pointwise scores and pairwise preferences, we investigate two types of reward models: Score Regression and Bradley-Terry. We find that the BT model consistently outperforming the regression model across various dataset sizes. While as the dataset size increases, the performance gap between the two models narrows.

Ties matters. The importance of tie annotations is often overlooked in BT models. We find that the inclusion of tie annotations significantly improves the robustness of the reward model. We adopt Bradley-Terry model with ties(BTT) instead of the traditional BT model to capture the human preferences.

Separation for Decoupling. We observe that a shared token for multi-dimensional rewarding tends to cause coupling between context-agnostic dimensions (e.g., VQ and MQ) and input context. We adopt separate tokens for each dimension to ensure an independent evaluation for context-agnostic dimensions.

Part III. Reward Benchmark & Evaluation

VideoGen-RewardBench. Current evaluation benchmarks for human annotations are limited in videos generated by pre-Sora-era T2V models. We manually construct 26.5k triplets and hire annotators to provide pairwise preference labels. Additionally, annotators are asked to assess the overall quality between the two videos, serving as a universal label for evaluating across various reward model despite their specific-dimensions.

We visualize the model coverage across the training sets of different methods and the two evaluation benchmarks:

Evaluations across reward models. We employs two primary benchmarks for evaluation. (1) VideoGen-RewardBench, serving as a modern benchmark to evaluate the performance of reward models on latest T2V models. (2) VideoGen-Eval, serving as a complementary benchmark to evaluate the performance of reward models on pre-Sora-era T2V models. Two accuracy metrics are reported: ties-included accuracy and ties-excluded accuracy. Results among different reward models are compared.

Part IV. Video Alignment

From a unified RL perspective:

$$\max_{p_\theta} \mathbb{E}_{\mathbf{y} \sim \mathcal{D}_c, \mathbf{x}_0 \sim p_\theta(\mathbf{x}_0 \mid \mathbf{y})} \left[ r(\mathbf{x}_0,\mathbf{y}) \right] \notag - \beta \, \mathbb{D}_{\text{KL}} \left[ p_\theta(\mathbf{x}_0 \mid \mathbf{y}) \,\|\, p_{\text{ref}}(\mathbf{x}_0 \mid \mathbf{y}) \right]$$

We generalize the three alignment algorithms to flow-based models. These include two training-time strategies (Flow-DPO and Flow-RWR) and one inference-time technique (Reward Guidance).

Flow-DPO:

$$ \begin{align} \mathcal{L}_\text{DPO}(\theta) = - \mathbb{E} \Bigg[ \log \sigma \Bigg(& -\frac{\beta_t}{2} \Big(\| \mathbf{v}^w - \mathbf{v}_\theta(\mathbf{x}_{t}^w, t)\|^2 - \|\mathbf{v}^w - \mathbf{v}_\text{ref}(\mathbf{x}_{t}^w, t)\|^2 \notag \\ &\quad - \bigl(\|\mathbf{v}^l - \mathbf{v}_\theta(\mathbf{x}_{t}^l, t)\|^2 - \|\mathbf{v}^l - \mathbf{v}_\text{ref}(\mathbf{x}_{t}^l, t)\|^2 \bigr) \Big) \Bigg) \Bigg], \end{align} $$

Flow-RWR:

$$ \begin{align} \mathcal{L}_\text{RWR}(\theta) &= \mathbb{E} \bigl[\exp(r(\mathbf{x}_0, \mathbf{y}))\|\mathbf{v} - \mathbf{v}_\theta(\mathbf{x}_{t}, t, \mathbf{y})\|^2\bigr], \end{align} $$

Reward Guidance:

$$ \tilde{\mathbf{v}}_t(\mathbf{x}_t \mid \mathbf{y}) = \mathbf{v}_t(\mathbf{x}_t \mid \mathbf{y}) - w \,\frac{t}{1 - t}\,\nabla r(\mathbf{x}_t, \mathbf{y}), $$

We make comparisons among the three alignment algorithms along with the baseline models and SFT. We find that Flow-DPO outperforms the other algorithms in our setting. An interesting observation is that Flow-DPO with a constant β outperforms the one with a timestep dependent β, while the latter is a direct derivation from the Diffusion-DPO algorithm. Timestep dependent β may cause uneven training across different timesteps, since T2V models utilize shared model weights for various noise levels.

We showcase part of the results after alignment by Flow-DPO with a constant β:

A cowboy rides his horse across an open plain at sunset, with the camera capturing the warm colors of the sky and the soft light on the landscape.

A pair of animated sneakers with eyes and a mouth and a talking basketball with a face playing a game of one-on-one on an urban basketball court. The sneakers are dribbling and making quick moves, while the basketball is bouncing and trying to score. The court is surrounded by graffiti-covered walls and cheering spectators

A mechanical knight with steam-powered joints standing guard at an ancient castle gate. Gears whir softly as its head turns to scan the surroundings, while steam occasionally escapes from its armor joints.

A wandering alchemist with potion-filled vials clinking on their belt, gathering herbs in an enchanted forest where mushrooms glow and flowers whisper secrets

The camera follows a person standing alone by the lake, gazing at the distant sunset, with their reflection mirrored on the water’s surface.

The camera remains still, a woman with long brown hair and wearing a pink nightgown walks towards the bed in the bedroom and lays on it, the background is a cozy bedroom, warm evening light.