Mistral Small 4 is a Mixture-of-Experts (MoE) language model that unifies several specialized model families into a single, versatile architecture. It consolidates the instruction-following capabilities of Mistral Small, the deep reasoning of Magistral, the multimodal understanding of Pixtral, and the agentic coding performance of Devstral. Released under the Apache 2.0 license, it is designed to eliminate the need for routing requests between specialized models by providing a single engine capable of handling varied workloads with high efficiency.
The model architecture features 119 billion total parameters, utilizing a sparse MoE setup with 128 experts. During inference, only 4 experts (approximately 6 billion parameters) are active per token, or roughly 8 billion parameters when including the embedding and output layers. This design allows the model to maintain the knowledge capacity of a much larger dense model while operating with significantly lower latency and higher throughput. It supports a 256k context window, enabling the processing of extensive documents and complex, multi-turn conversations.
A key feature of Mistral Small 4 is the introduction of a configurable reasoning_effort parameter. This mechanism allows users to dynamically toggle the model's behavior between a "fast" mode for low-latency instruction following and a "reasoning" mode that utilizes test-time compute for deep, step-by-step problem solving. In reasoning mode, the model excels at complex mathematics, logic, and scientific reasoning, while the fast mode provides concise responses comparable to earlier Mistral Small iterations.
Mistral Small 4 is natively multimodal, accepting both text and image inputs for tasks such as document parsing, visual analysis, and chart interpretation. It is optimized for agentic workflows, demonstrating strong adherence to system prompts and native support for function calling and structured JSON output. Performance benchmarks indicate that the model matches or exceeds the accuracy of comparable 120B-class models while generating significantly shorter, more efficient outputs, which reduces end-to-end completion times and inference costs.