MiMo-V2-Omni is a frontier omni-modal foundation model developed by Xiaomi, officially released in March 2026 as part of the MiMo-V2 family. Designed to serve as a versatile multimodal powerhouse, the model natively processes text, image, video, and audio inputs within a unified architecture. This approach enables seamless cross-modality reasoning and perception without the latency or information loss associated with modular encoder-decoder systems.
The model is specifically optimized for agentic workflows, integrating high-fidelity perception with direct action capabilities. Key functionalities include visual grounding, multi-step planning, tool invocation, and autonomous code execution. With a 262,144-token context window, MiMo-V2-Omni can analyze long-duration media, such as 10 hours of continuous audio, and perform complex reasoning across extensive multimodal datasets.
During its pre-release testing phase on developer platforms, the model was known by the codename Healer Alpha. In this stage, it gained recognition for topping several agentic benchmarks, including the PinchBench leaderboard. Following its official debut, Xiaomi reported that the model achieved leading scores in audio reasoning and video event forecasting, outperforming contemporary frontier models on benchmarks like BigBench Audio and MMAU-Pro.
MiMo-V2-Omni is a central component of Xiaomi's "Agent Era" strategy, positioned alongside the trillion-parameter MiMo-V2-Pro reasoning model and the high-speed MiMo-V2-Flash. The Omni variant is primarily intended for real-world interaction scenarios, such as browser-based automation, embodied intelligence, and complex media content analysis.