Parse, Align and Aggregate: Graph-driven Compositional Reasoning for Video Question Answering

Jan 1, 2026·
Jiangtong Li
,
Zhaohe Liao
,
Fengshun Xiao
,
Tianjiao Li
,
Qiang Zhang
,
Haohua Zhao
,
Li Niu
,
Guang Chen
,
Liqing Zhang
,
Changjun Jiang
· 0 min read
Abstract
Video Question-Answering (VideoQA) enables machines to interpret and respond to complex video content, advancing human-computer interaction. However, existing multimodal large language models (MLLMs) often provide incomplete or opaque explanations and existing benchmarks mainly focus on the correction of final answers, limiting insight into their reasoning processes and hindering both transparency and verifiability. To address this gap, we propose the Question Parsing, Video Alignment and Answer Aggregation framework (QPVA3), which leverages a compositional graph to drive visual and logical reasoning in VideoQA. Specifically, QPVA3 consists of three core components, the planner, executor, and reasoner to generate the compositional graph and conduct graph-driven reasoning. For the original question, the planner parses it into the compositional graph, capturing the underlying reasoning logic and structuring it into a series of interconnected questions. For each question in compositional graph, the executor aligns the video by selecting relevant video clips and generates answers, ensuring accurate, context-specific responses. For each question with its first-order descents, the reasoner aggregates answers by integrating reasoning logic with visual evidence, resolving conflicts to produce a coherent and accurate response. Moreover, to assess the performance of existing MLLMs in the reasoning processes of VideoQA, we introduce novel compositional consistency metrics and construct a VideoQA benchmark (QPVA3Bench) with 3,492 question-video tuples, each annotated with detailed compositional graphs and fine-grained answers.
Type
Publication
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI 2026)