MiMo-V2-Flash is a large-scale Mixture-of-Experts (MoE) language model developed by Xiaomi, released in late 2025. It features a total parameter count of 309 billion, with approximately 15 billion active parameters per forward pass. The model is designed for high-throughput inference and efficient performance in reasoning, coding, and agentic workflows.
The architecture incorporates several technical innovations to optimize speed and memory usage. It utilizes a Hybrid Attention mechanism that interleaves Sliding Window Attention (SWA) with Global Attention (GA) in a 5:1 ratio. This configuration, paired with an aggressive 128-token sliding window, reportedly reduces the Key-Value (KV) cache memory requirements by nearly six times compared to standard dense attention models.
To further enhance generation speeds, MiMo-V2-Flash integrates a native Multi-Token Prediction (MTP) module. Unlike traditional speculative decoding that requires a separate draft model, MiMo-V2-Flash embeds lightweight dense feed-forward networks (FFNs) into the architecture to predict multiple future tokens simultaneously. This self-speculative capability allows the model to achieve significant speedups, reaching inference rates of up to 150 tokens per second on optimized hardware.
In terms of capabilities, the model excels in software engineering and mathematical reasoning, performing competitively on benchmarks such as SWE-bench Verified and AIME 2025. It supports a 256K token context window and includes a reasoning toggle that allows users to enable or disable internal "thinking" processes during generation. The model weights are released under the MIT license.