GLM-4.6V is a multimodal foundation model developed by Z.ai (formerly Zhipu AI), released as part of the GLM-4.6 family in late 2025. It utilizes a Mixture-of-Experts (MoE) architecture with approximately 106 billion total parameters, specifically optimized for high-performance visual reasoning and multimodal understanding. Unlike traditional vision models that rely on text-based tool descriptions, GLM-4.6V integrates native multimodal function calling, allowing it to process images and document screenshots directly as tool inputs.
The model supports a diverse range of input modalities, including text, images, and video, with a context window of 128,000 tokens. Its key capabilities include high-resolution OCR, complex document layout interpretation (understanding charts, tables, and figures without prior text conversion), and frontend replication, where the model can reconstruct pixel-accurate code from UI screenshots. While related to Z.ai's "Thinking" reasoning-centric models, the standard GLM-4.6V focuses on efficient, direct multimodal inference for agentic workflows and real-world business applications.