Xiaomi logo
Xiaomi

MiMo-V2-Omni-0327

Released Mar 2026

MiMo-V2-Omni-0327 is a multimodal foundation model developed by Xiaomi, released as a high-performance snapshot of the MiMo-V2 series in late March 2026. It is designed as a unified "omni" model, natively processing text, images, video, and audio within a shared backbone architecture. Unlike models that append separate encoders for different modalities, MiMo-V2-Omni-0327 treats perception as a continuous reasoning process, allowing it to interpret complex cross-modal relationships more effectively.

A standout capability of the model is its support for long-form audio, specifically the ability to process over 10 hours of continuous audio in a single request without the need for chunking or intermediate summaries. This makes it suitable for analyzing entire podcast series, lengthy legal recordings, or full-day meeting logs. Additionally, the model is highly optimized for agentic workflows, featuring native capabilities for visual grounding, multi-step planning, and UI element positioning, which allow it to interact with software environments and physical robotics frameworks.

In terms of performance, MiMo-V2-Omni-0327 achieves an Intelligence Index of approximately 44.9 and a Coding Index of 36.9 on standardized AI benchmarks. It has demonstrated strong results on visual reasoning tests like MMMU-Pro and video-based understanding tasks like FutureOmni, often outperforming larger competitors in specific multimodal contexts. The model also supports structured tool calling and function execution, enabling it to act as the core logic engine for autonomous digital assistants.

With a context window of 262,144 tokens, the model can maintain long-range coherence across vast amounts of input data. While Xiaomi provides an open-weights version of its smaller language model, MiMo-V2-Flash, the Omni variant is a proprietary model offered via API. It is positioned as a cost-effective alternative for developers seeking frontier-level multimodal capabilities with significantly lower inference costs compared to other flagship models.

Rankings & Comparison