GLM-4.6V is Z.ai's full-scale 106B parameter vision-language model, designed for cloud and high-performance cluster deployment. Released September 30, 2025, it upgrades GLM-4.5V with a context window of 128K tokens, native multimodal function calling, and reported multimodal benchmark results at comparable parameter scales.
A defining capability is native multimodal function calling, which lets you pass images and screenshots directly as tool inputs without converting them to text descriptions first. This enables agentic workflows where the model reasons about visual content, decides to call tools based on what it sees, and processes visual results across multiple steps. Combined with interleaved image-text content generation, GLM-4.6V produces mixed-media outputs from complex inputs.
GLM-4.6V handles frontend replication and visual editing: given a screenshot, it can reconstruct pixel-accurate HTML/CSS and apply iterative modifications. This capability, combined with multimodal document understanding (joint interpretation of text, layout, charts, and figures), suits UI development workflows and complex document processing pipelines.