Jiangtong Li
  • About
  • News
  • Publications
  • Projects
  • Contact
  • Projects
    • Multimodal Finance
    • Image Harmonization
    • Causal-VidQA
  • Publications
    • Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners
    • Can LLMs Really Judge? A Progressive Argumentation-Mining Framework for Distinguishing Understanding from Aggregation
    • Cross-Paradigm Graph Backdoor Attacks with Promptable Subgraph Triggers
    • Bridging Visual Dynamics and Reasoning Evaluation: Multimodal Large Language Models for Short Drama Quality Assessment
    • DynaMind: Reconstructing Dynamic Visual Scenes from EEG by Aligning Temporal Dynamics and Multimodal Semantics to Guided Diffusion
    • Parse, Align and Aggregate: Graph-driven Compositional Reasoning for Video Question Answering
    • Attack by Yourself: Effective and Unnoticeable Multi-Category Graph Backdoor Attacks with Subgraph Triggers Pool
    • Divide and Conquer: Exploring Language-centric Tree Reasoning for Video Question-Answering
    • InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating
    • HFTCRNet: Hierarchical Fusion Transformer for Interbank Credit Rating and Risk Assessment
    • RA-CFGPT: Chinese Financial Assistant with Retrieval-Augmented Large Language Model
    • Multi-Patch Prediction: Adapting LLMs for Time Series Representation Learning
    • Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering
    • Knowledge Proxy Intervention for Deconfounded Video Question Answering
    • From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering
    • Zero-Shot Sketch-Based Image Retrieval with Structure-aware Asymmetric Disentanglement
    • Action-Aware Embedding Enhancement for Image-Text Retrieval
    • Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval,
    • Video Semantic Segmentation via Sparse Temporal Transformer,
    • Modeling Multi-turn Conversation with Deep Utterance Aggregation

Causal-VidQA

Mar 1, 2022 · 1 min read
Go to Project Site

A benchmark for causal and temporal reasoning in video question answering, featuring evidence and commonsense reasoning tasks. Published at CVPR 2022.

Last updated on May 21, 2026
Video Understanding Causal Inference
Jiangtong Li (李江彤)
Authors
Jiangtong Li (李江彤)
Postdoctoral Associate

← Image Harmonization Aug 1, 2023

© 2026 Jiangtong Li

Published with Hugo Blox Builder — the free, open source website builder that empowers creators.

Cite