Qwen3-VL-235B-A22B-Instruct is a multimodal large language model (MLLM) developed by Alibaba Cloud's Qwen team. It utilizes a Mixture-of-Experts (MoE) architecture with a total of 235 billion parameters, of which approximately 22 billion are active during inference. Designed to process interleaved text, image, and video inputs, the model represents the flagship vision-language entry in the Qwen3 series, succeeding the Qwen2.5-VL lineage with significant improvements in visual reasoning and document understanding.
Capabilities and Architecture
The model incorporates architectural advancements such as Interleaved-MRoPE, which provides robust positional embeddings for temporal, width, and height dimensions to enhance long-horizon video reasoning. It also features DeepStack, a multi-level feature fusion mechanism that captures fine-grained visual details. The model supports a native context window of 256,000 tokens, enabling it to analyze hours-long video sequences and extensive document sets with high recall.
Specialized Multimodal Features
Qwen3-VL-235B-A22B-Instruct is optimized for agentic tasks, including the ability to operate PC and mobile graphical user interfaces (GUIs) by recognizing elements and invoking tools. Its visual coding capabilities allow it to generate code (such as HTML, CSS, or JS) directly from design mockups or videos. Furthermore, it supports enhanced Optical Character Recognition (OCR) in 32 languages and provides spatial perception for 2D and 3D grounding, making it applicable for embodied AI and complex scene analysis.