MiMo-V2.5 is a native omnimodal language model developed by Xiaomi, designed to process and understand text, image, video, and audio within a unified architecture. It is built on a Sparse Mixture-of-Experts (MoE) framework, consisting of 310 billion total parameters with 15 billion active per token. The model is optimized for agentic performance and long-context reasoning, succeeding previous iterations with significant improvements in cross-modal perception and token efficiency.
The model's backbone utilizes a hybrid attention architecture that interleaves local Sliding Window Attention (SWA) and Global Attention (GA) at a 5:1 ratio. This configuration is designed to reduce memory overhead and KV-cache storage by approximately six times while sustaining coherence across its 1,048,576 token (1M) context window. For enhanced generation speed, it integrates a three-layer Multi-Token Prediction (MTP) head, which enables faster inference through speculative decoding techniques.
Training for MiMo-V2.5 followed a rigorous five-stage pipeline, including large-scale pre-training on 48 trillion tokens and specialized post-training. This process incorporates Supervised Fine-Tuning (SFT), agentic reinforcement learning, and Multi-Teacher On-Policy Distillation (MOPD) to refine its ability to handle complex, long-horizon tasks and tool-use scenarios. Its multimodal capabilities are supported by dedicated vision and audio encoders, allowing the model to reason seamlessly across different types of input data.
The model is released under the MIT License, encouraging open-source collaboration and commercial application. It supports a native "thinking" mode for reasoning-heavy queries, and official recommendations suggest using a temperature of 1.0 and top_p of 0.95 for standard generation tasks.