Alibaba logo
Alibaba
Open Weights

Qwen3 VL 30B A3B Instruct

Released Oct 2025

Intelligence
#263
Coding
#221
Math
#90
Context256K
Parameters30B (3B active)

Qwen3-VL-30B-A3B-Instruct is a multi-modal vision-language model developed by Alibaba’s Qwen team, utilizing a Mixture-of-Experts (MoE) architecture with 30 billion total parameters and approximately 3 billion active parameters. It features significant architectural updates such as Interleaved-MRoPE for spatial-temporal encoding and DeepStack for visual feature fusion, enabling high-resolution image perception and multilingual OCR for 32 languages. The model supports a 262,144-token context window for long-context video understanding and timestamp-based event localization. It is optimized for instruction following and can act as a visual agent, capable of interpreting mobile and desktop graphical user interfaces (GUIs) for tasks like visual coding and tool invocation. The model also provides advanced spatial reasoning capabilities, including 2D and 3D object grounding within complex environments.

Rankings & Comparison