GLM 4.5V extends the GLM-4.5-Air foundation with multimodal vision capabilities. Built by Z.ai, it targets image reasoning, document understanding, and visual grounding tasks at a comparable scale to other models in its class.
The model supports a broad range of visual input types: single images, multi-image analysis, long video understanding with event recognition, complex chart and document parsing, and GUI task handling including screen reading and icon recognition. A distinctive feature is visual grounding, where the model localizes specific elements in images with bounding box coordinates, enabling applications that need to point at or interact with visual content programmatically.
GLM 4.5V includes a thinking mode switch that balances quick responses against deeper reasoning. For straightforward visual questions, disable thinking for fast responses. For complex multi-image analysis or document interpretation, enable thinking to improve accuracy. The model operates within a context window of 66K tokens.