StepFun logo
StepFun
Open Weights

Step3 VL 10B

Released Jan 2026

Intelligence
#277
Coding
#231
Context66K
Parameters10B

Step3-VL-10B is a 10-billion parameter multimodal foundation model developed by StepFun (Jieyue Xingchen). Released in January 2026, the model is designed to deliver high-performance visual and linguistic reasoning within a compact footprint, effectively competing with models ten to twenty times its size. Its architecture integrates a language-aligned Perception Encoder with a Qwen3-8B decoder, pre-trained on a 1.2 trillion token multimodal corpus.

The model's reasoning capabilities are driven by two distinct paradigms: Sequential Reasoning (SeRe) and Parallel Coordinated Reasoning (PaCoRe). While SeRe follows standard chain-of-thought generation, PaCoRe scales test-time computation by exploring multiple perceptual hypotheses in parallel and synthesizing them into a final conclusion. This dual-approach enables the model to handle challenging tasks such as high-precision Optical Character Recognition (OCR), document parsing, spatial relationship reasoning, and grounding for graphical user interfaces (GUI).

Step3-VL-10B has achieved state-of-the-art results in its parameter class across several benchmarks, including MMMU, ChartQA, and MathVision. It is particularly noted for its proficiency in competitive mathematics, demonstrating strong performance on the AIME 2025 benchmark. The model supports a context window of up to 128,000 tokens, allowing for the processing of high-resolution images and complex, multi-image sequences. Its weights are released under the Apache 2.0 license to support open-source research and on-device AI applications.

Rankings & Comparison