MiMo-V2.5-Pro is Xiaomi's flagship large language model, designed to handle complex autonomous agent tasks and long-horizon software engineering workflows. Built on a Mixture-of-Experts (MoE) architecture, the model features 1 trillion total parameters with 42 billion active parameters per inference pass. It is the successor to the MiMo-V2-Pro, offering significant gains in instruction following, logical consistency, and token efficiency.
The model utilizes a specialized hybrid attention mechanism with a 7:1 ratio between sliding window and global attention, which maintains high inference speeds even at its maximum context window of 1 million tokens. This architecture allows the model to process massive datasets, including entire code repositories or long-form video, while sustaining performance across sequences involving over a thousand sequential tool calls.
Capabilities and Multimodal Integration
MiMo-V2.5-Pro is natively multimodal, integrating text, image, audio, and video processing into a single unified architecture. This allows it to perform tasks such as end-to-end application development, professional-grade video editing via code execution, and complex reasoning across different input types. According to internal evaluations, the model demonstrates high performance on benchmarks like ClawEval and SWE-bench Pro, often rivaling or surpassing other frontier-tier models in agentic execution.
Xiaomi emphasizes the model's token efficiency, reporting that it can achieve comparable benchmark scores to other leading models while consuming significantly fewer tokens per trajectory. It is positioned as the central "brain" for the Xiaomi "Human x Car x Home" ecosystem, providing a foundation for sophisticated automation and cross-device interaction.