Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

May 1, 2026·

Qingyang Liu

Bingjie Gao

Canmiao Fu

Zhipeng Huang

Chen Li

Feng Wang

Shuochen Chang

Shaobo Wang

Yali Wang

Keming Ye

Jiangtong Li

Li Niu

· 0 min read

PDF

Abstract

Recent unified models integrate multimodal understanding and generation within a single framework. However, an understanding-generation gap persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. We construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions.

Type

Conference paper

Publication

International Conference on Machine Learning (ICML 2026)

Last updated on May 21, 2026

Multimodal LLM Visual Reasoning Image Generation

Can LLMs Really Judge? A Progressive Argumentation-Mining Framework for Distinguishing Understanding from Aggregation May 1, 2026 →