Step-3 is a large-scale multimodal Mixture-of-Experts (MoE) model developed by StepFun, featuring 321 billion total parameters with 38 billion activated per token. It is designed to balance frontier-level reasoning performance with high decoding efficiency, particularly for long-context and multimodal tasks. The model was pretrained on a dataset comprising 20 trillion text tokens and 4 trillion image-text pairs.
The architecture introduces two key technical innovations: Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD). MFA is a specialized attention mechanism that reduces KV cache demands and computational overhead, while AFD is a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. These optimizations are intended to maximize hardware utilization and increase decoding throughput on flagship accelerators.
In terms of capabilities, Step-3 excels in visual perception and complex reasoning, demonstrating strong performance in benchmarks such as MMMU, MathVision, and AIME 2025. It is capable of cross-domain knowledge understanding, mathematical analysis, and detailed visual interpretation, making it suitable for applications requiring both high intelligence and cost-effective deployment.