Bridging Visual Dynamics and Reasoning Evaluation: Multimodal Large Language Models for Short Drama Quality Assessment

Apr 13, 2026·

Qingyang Liu

Jiangtong Li#

Zelin Peng

Shaobo Wang

Zhaohe Liao

Shuochen Chang

Bingjie Gao

Haonan Zhao

Mu Liu

Jidong Jiang

Li Niu#

· 0 min read

PDF Code

Abstract

Short drama quality assessment is crucial for industrial applications, including procurement decision support and addressing the cold-start problem in recommendation systems. However, existing video quality assessment approaches primarily focus on visual fidelity and often neglect higher-level narrative structure and reasoning logic. Likewise, other video understanding techniques tend to be event-centric, failing to adequately connect narrative elements with visual content. To bridge this gap between visual dynamics and narrative reasoning, we propose a user-centric quality indicator alongside an automated pipeline for constructing a Chain-of-Thought (CoT) dataset. To ensure data quality, this pipeline incorporates a hierarchical filtering mechanism that refines assessment accuracy, logical consistency, and the relevance of the reasoning, thereby steering the Multimodal Large Language Model (MLLM) toward human-aligned short drama assessment. We also develop the first MLLM for this task using a two-stage training framework: a Supervised Fine-Tuning (SFT) stage adapts the model to the assessment task, while a Group Relative Policy Optimization (GRPO) stage, using a customized reward function, further aligns its outputs with human preferences. Experimental results demonstrate that our model shows strong alignment with human preferences in short drama quality assessment and generates coherent explanations. Furthermore, online tests confirm our model boosts the cold-start performance of recommendation systems by improving multiple user engagement metrics.

Type

Conference paper

Publication

Proceedings of the ACM Web Conference (WWW 2026)

Last updated on May 21, 2026

Multimodal LLM Video Quality Assessment Short Drama

← Cross-Paradigm Graph Backdoor Attacks with Promptable Subgraph Triggers May 1, 2026

CFBenchmark-MM: Chinese Financial Assistant Benchmark for Multimodal Large Language Model Apr 7, 2026 →