Bridging Visual Dynamics and Reasoning Evaluation: Multimodal Large Language Models for Short Drama Quality Assessment
Apr 13, 2026·,,,,,,,,,,·
0 min read
Qingyang Liu
Jiangtong Li#
Zelin Peng
Shaobo Wang
Zhaohe Liao
Shuochen Chang
Bingjie Gao
Haonan Zhao
Mu Liu
Jidong Jiang
Li Niu#
Abstract
Short drama quality assessment is crucial for industrial applications, including procurement decision support and addressing the cold-start problem in recommendation systems. However, existing video quality assessment approaches primarily focus on visual fidelity and often neglect higher-level narrative structure and reasoning logic. Likewise, other video understanding techniques tend to be event-centric, failing to adequately connect narrative elements with visual content. To bridge this gap between visual dynamics and narrative reasoning, we propose a user-centric quality indicator alongside an automated pipeline for constructing a Chain-of-Thought (CoT) dataset. To ensure data quality, this pipeline incorporates a hierarchical filtering mechanism that refines assessment accuracy, logical consistency, and the relevance of the reasoning, thereby steering the Multimodal Large Language Model (MLLM) toward human-aligned short drama assessment. We also develop the first MLLM for this task using a two-stage training framework: a Supervised Fine-Tuning (SFT) stage adapts the model to the assessment task, while a Group Relative Policy Optimization (GRPO) stage, using a customized reward function, further aligns its outputs with human preferences. Experimental results demonstrate that our model shows strong alignment with human preferences in short drama quality assessment and generates coherent explanations. Furthermore, online tests confirm our model boosts the cold-start performance of recommendation systems by improving multiple user engagement metrics.
Type
Publication
Proceedings of the ACM Web Conference (WWW 2026)