MiMo-V2-Flash is a high-efficiency Mixture-of-Experts (MoE) large language model developed by Xiaomi, designed specifically for high-speed inference and agentic workflows. The model features a total of 309 billion parameters, with only 15 billion active during any single forward pass. This sparse architecture allows it to maintain the performance of much larger models while significantly reducing computational overhead and latency.
The model's architecture incorporates a Hybrid Attention mechanism that interleaves five layers of Sliding Window Attention (SWA) with one layer of Global Attention (GA). This 5:1 hybrid ratio utilizes an aggressive 128-token sliding window to reduce Key-Value (KV) cache storage requirements by approximately six times compared to standard dense attention models, while a learnable attention sink bias helps maintain coherence over long sequences.
To further enhance throughput, MiMo-V2-Flash utilizes Multi-Token Prediction (MTP), which enables the model to generate and verify multiple tokens in parallel. This self-speculative decoding approach allows the model to achieve generation speeds of up to 150 tokens per second. The model is released under the MIT License and supports a context window of up to 256,000 tokens.
In terms of capabilities, MiMo-V2-Flash is optimized for coding, mathematical reasoning, and multi-turn agent interactions. It features a "hybrid thinking" toggle that allows users to switch between instant replies for speed-sensitive tasks and an internal reasoning mode for complex problem-solving. It has demonstrated competitive performance on benchmarks such as SWE-Bench and AIME 2025.