MiMo-V2-Flash is an open-source Mixture-of-Experts (MoE) foundation language model developed by Xiaomi, specifically engineered for high-throughput reasoning, coding, and agentic tasks. The model architecture comprises 309 billion total parameters, with 15 billion active parameters during any single forward pass. It is designed to provide competitive intelligence while maintaining high inference efficiency and low operational costs.
The model utilizes a hybrid attention architecture that alternates between Sliding Window Attention (SWA) and Global Attention (GA) in a 5:1 ratio. This design, paired with a 128-token window, reduces KV-cache memory requirements by nearly six times compared to standard global attention mechanisms. Additionally, the model integrates a Multi-Token Prediction (MTP) module that enables parallel decoding, which increases token generation throughput significantly during inference.
A distinguishing feature of the model is its reasoning capability, which can be controlled via a "thinking" toggle. This functionality allows the model to engage in extended internal chain-of-thought processing, a behavior refined through Multi-Teacher On-Policy Distillation (MOPD) and large-scale reinforcement learning. MiMo-V2-Flash supports a context window of 262,144 tokens, facilitating long-document analysis and complex multi-step agent interactions.