Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners
May 1, 2026·,,,,,,,,,,,·
0 min read
Qingyang Liu
Bingjie Gao
Canmiao Fu
Zhipeng Huang
Chen Li
Feng Wang
Shuochen Chang
Shaobo Wang
Yali Wang
Keming Ye
Jiangtong Li
Li Niu

Abstract
Recent unified models integrate multimodal understanding and generation within a single framework. However, an understanding-generation gap persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. We construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions.
Type
Publication
International Conference on Machine Learning (ICML 2026)