Step3-VL-10B is a high-performance, open-source multimodal foundation model developed by StepFun. Despite its compact 10-billion parameter footprint, the model is designed to rival significantly larger systems by leveraging unified pre-training and specialized reasoning architectures. It integrates a 1.8B parameter language-aligned perception encoder (PE-lang) with a Qwen3-8B decoder, enabling deep synergy between visual processing and text generation. The model was trained on a 1.2 trillion-token multimodal corpus, focusing on reasoning-heavy tasks such as competitive mathematics, document parsing, and spatial understanding.
A defining feature of Step3-VL-10B is its dual inference paradigm, which allows users to balance speed and accuracy. The Sequential Reasoning (SeRe) mode is optimized for general tasks and efficient generation, while the Parallel Coordinated Reasoning (PaCoRe) mode scales test-time compute by exploring multiple visual hypotheses in parallel. In PaCoRe mode, the model has demonstrated frontier-level results on benchmarks such as AIME 2025 and MMMU, often outperforming models 10 to 20 times its size.
Architecture and Capabilities
The model's vision system utilizes a multi-crop strategy, combining a 728×728 global view with multiple local crops to maintain high-resolution details during processing. This makes it particularly effective for high-precision OCR and complex GUI interactions. Step3-VL-10B supports a context window of up to 128,000 tokens in its advanced reasoning mode, allowing for the analysis of long documents and multiple high-density images within a single session.
Step3-VL-10B is released under the Apache 2.0 license, facilitating both research and commercial applications. Its efficient design allows it to run on consumer-grade hardware, making enterprise-grade multimodal reasoning accessible for on-device deployment and local specialized workflows.