GLM-4.6V is a multimodal large language model series developed by Z AI (Zhipu AI), specifically engineered for high-fidelity visual understanding and long-context reasoning. Released as an iteration of the GLM-V family, the model unifies vision, text, and tool-calling capabilities to perform complex analytical tasks across images, video, multi-page documents, and technical charts.
The model utilizes a 106B parameter architecture (with a 9B "Flash" variant) and features a 128,000-token context window. This scale enables the processing of extensive multimodal inputs—such as hour-long video segments or 150-page financial reports—in a single pass, maintaining context awareness across mixed media. The models are released with open weights, supporting self-hosted deployments for agentic and research workflows.
Key capabilities of GLM-4.6V include native multimodal function calling, which allows the system to invoke external tools directly from visual inputs, effectively bridging perception and executable action. It excels in tasks such as UI-to-code generation, visual document QA, and spatial grounding, where it can identify object locations and visual references reliably. These features facilitate the automation of complex workflows, including visual web search and the synthesis of structured image-text content from raw document data.