MiniCPM-V 4.6 1.3B is a compact, high-efficiency multimodal vision-language model (VLM) developed by OpenBMB in collaboration with Tsinghua University. Released in May 2026, it is designed specifically for edge-device deployment, functioning as a pocket-sized model for ultra-efficient image and video understanding. The architecture integrates the SigLIP2-400M vision encoder with the Qwen3.5-0.8B language model, resulting in a total of 1.3 billion dense parameters.
Technically, the model introduces mixed 4x/16x visual token compression and utilizes the LLaVA-UHD v4 framework to optimize visual encoding. This approach features an early-exit strategy for visual processing and a tiled processing method for high-resolution inputs, which reduces computational requirements by over 50% while maintaining the ability to process detailed documents, UI screenshots, and medical images. These optimizations allow the model to achieve high token throughput, reportedly outperforming larger models like Gemma4-E2B-it and matching the capabilities of many 2B-scale models.
In terms of capabilities, MiniCPM-V 4.6 supports a wide array of multimodal tasks, including single and multi-image understanding, video summarization, OCR, and multi-turn conversational dialogue. It is optimized for on-device performance across platforms such as iOS, Android, and HarmonyOS. A reasoning-focused variant, MiniCPM-V 4.6 Thinking, is also available, which generates explicit chain-of-thought traces to enhance accuracy on complex mathematical and logical reasoning tasks. The model supports a significant context window of 262,144 tokens, enabling long-form text processing alongside visual data.