Jiangtong Li
  • About
  • News
  • Publications
  • Projects
  • Contact
  • Projects
    • Multimodal Finance
    • Image Harmonization
    • Causal-VidQA
  • Publications
    • Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners
    • Can LLMs Really Judge? A Progressive Argumentation-Mining Framework for Distinguishing Understanding from Aggregation
    • Cross-Paradigm Graph Backdoor Attacks with Promptable Subgraph Triggers
    • Bridging Visual Dynamics and Reasoning Evaluation: Multimodal Large Language Models for Short Drama Quality Assessment
    • DynaMind: Reconstructing Dynamic Visual Scenes from EEG by Aligning Temporal Dynamics and Multimodal Semantics to Guided Diffusion
    • Parse, Align and Aggregate: Graph-driven Compositional Reasoning for Video Question Answering
    • Attack by Yourself: Effective and Unnoticeable Multi-Category Graph Backdoor Attacks with Subgraph Triggers Pool
    • Divide and Conquer: Exploring Language-centric Tree Reasoning for Video Question-Answering
    • InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating
    • HFTCRNet: Hierarchical Fusion Transformer for Interbank Credit Rating and Risk Assessment
    • RA-CFGPT: Chinese Financial Assistant with Retrieval-Augmented Large Language Model
    • Multi-Patch Prediction: Adapting LLMs for Time Series Representation Learning
    • Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering
    • Knowledge Proxy Intervention for Deconfounded Video Question Answering
    • From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering
    • Zero-Shot Sketch-Based Image Retrieval with Structure-aware Asymmetric Disentanglement
    • Action-Aware Embedding Enhancement for Image-Text Retrieval
    • Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval,
    • Video Semantic Segmentation via Sparse Temporal Transformer,
    • Modeling Multi-turn Conversation with Deep Utterance Aggregation

Visual Reasoning

Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

Qingyang Liu, Bingjie Gao, Canmiao Fu, Zhipeng Huang, Chen Li, Feng Wang, Shuochen Chang, Shaobo Wang, Yali Wang, Keming Ye, Jiangtong Li, Li Niu

International Conference on Machine Learning (ICML 2026)

Multimodal LLM Visual Reasoning Image Generation

© 2026 Jiangtong Li

Published with Hugo Blox Builder — the free, open source website builder that empowers creators.

Cite