MVAR

MVAR: Visual Autoregressive Modeling with

Scale and Spatial Markovian Conditioning

ICLR 2026

Jinhua Zhang, Wei Long, Minghao Han, Weiyi You, Shuhang Gu^*
University of Electronic Science and Technology of China
^*Corresponding Author

TL;DR:

🔥 3.0× Memory Efficiency & KV-Cache Free Inference. We propose Markovian Visual AutoRegressive modeling (MVAR), which introduces scale and spatial Markov assumptions to mitigate redundancy in visual generation. By reducing complexity from O(N^2) to O(N×k), MVAR enables parallel training on just eight RTX 4090s and achieves state-of-the-art performance on ImageNet.

Method

Conventional visual autoregressive methods exhibit scale and spatial redundancy by conditioning each scale on all previous scales and requiring each token to attend to all preceding ones. We propose MVAR, a novel framework that introduces:

(a) Scale-Markovian Trajectory: Unlike previous methods that depend on all prior scales , MVAR assumes that the current scale only depends on its immediate predecessor. This decoupling allows us to train all scales in parallel rather than sequentially, significantly lowering GPU memory consumption.

(b) Spatial-Markovian Attention: We restrict the receptive field of each visual token to a localized k×k neighborhood at the corresponding position of the adjacent scale. This sparse attention mechanism avoids the quadratic explosion of global self-attention while effectively capturing necessary spatial priors.

(c) KV-Cache Free Inference: Due to the Markovian nature of our spatial attention, the model does not need to store long-term token dependencies. This allows for extremely memory-efficient sampling without the heavy overhead of Key-Value caches.

Experimental Results

State-of-the-Art Performance: MVAR achieves comparable or superior FID scores on ImageNet 256×256 compared to VAR and other diffusion-based models.

Efficiency: MVAR reduces the average GPU memory footprint by 3.0×, making the training of large-scale generative models accessible on consumer-grade hardware like RTX 4090.

Visual Results: MVAR demonstrates high-fidelity image generation capabilities. Below are selected samples generated on ImageNet 256×256, showcasing the diversity and structural coherence of our Markovian modeling.

References

[1] Tian K, Jiang Y, Yuan Z, et al. Visual autoregressive modeling: Scalable image generation via next-scale prediction[J]. Advances in neural information processing systems, 2024, 37: 84839-84865.
[2] Hassani A, Walton S, Li J, et al. Neighborhood attention transformer[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 6185-6194.