Skip to content

GLM 4.5V

GLM 4.5V is Z.ai's vision-language model built on GLM-4.5-Air. It supports image reasoning, long video understanding, GUI task handling, and visual grounding.

ReasoningTool UseVision (Image)
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'zai/glm-4.5v',
prompt: 'Why is the sky blue?'
})

Playground

Try out GLM 4.5V by Z.ai. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

About GLM 4.5V

GLM 4.5V extends the GLM-4.5-Air foundation with multimodal vision capabilities. Built by Z.ai, it targets image reasoning, document understanding, and visual grounding tasks at a comparable scale to other models in its class.

The model supports a broad range of visual input types: single images, multi-image analysis, long video understanding with event recognition, complex chart and document parsing, and GUI task handling including screen reading and icon recognition. A distinctive feature is visual grounding, where the model localizes specific elements in images with bounding box coordinates, enabling applications that need to point at or interact with visual content programmatically.

GLM 4.5V includes a thinking mode switch that balances quick responses against deeper reasoning. For straightforward visual questions, disable thinking for fast responses. For complex multi-image analysis or document interpretation, enable thinking to improve accuracy. The model operates within a context window of 66K tokens.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Novita AI
Legal:Terms
Privacy
66K
0.9s
61tps
$0.60/M$1.80/M
Read:$0.11/M
Write:
08/11/2025
Z.ai
Legal:Terms
Privacy
66K
0.7s
64tps
$0.60/M$1.80/M
Read:$0.11/M
Write:
08/11/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Z.ai

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
205K
0.7s
46tps
$1.40/M$4.40/M
Read:$0.26/M
Write:
deepinfra logo
fireworks logo
novita logo
+1
04/07/2026
200K
0.9s
158tps
$1.20/M$4.00/M
Read:$0.24/M
Write:
zai logo
04/01/2026
203K
1.0s
76tps
$1.20/M$4.00/M
Read:$0.24/M
Write:
zai logo
03/15/2026
203K
0.2s
70tps
$0.80/M$2.56/M
Read:$0.16/M
Write:
bedrock logo
deepinfra logo
fireworks logo
+3
02/12/2026
205K
0.1s
551tps
$2.25/M$2.75/M
Read:$2.25/M
Write:
bedrock logo
cerebras logo
deepinfra logo
+2
12/22/2025
200K
0.2s
116tps
$0.07/M$0.40/M
Read:$0.01/M
Write:
bedrock logo
zai logo

What To Consider When Choosing a Provider

  • Configuration: High-resolution images consume more input tokens. Consider resizing images to the minimum resolution your task requires to control costs.
  • Configuration: Enable thinking for complex visual reasoning tasks (chart analysis, multi-image comparison). Disable it for simple captioning or classification to reduce latency.
  • Configuration: When using visual grounding, the model returns bounding box coordinates normalized by image dimensions. Your application needs to handle this coordinate format for downstream processing.
  • Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use GLM 4.5V

Best For

  • Document and chart understanding: Parsing complex layouts with mixed text, tables, and figures requires joint visual-textual reasoning
  • GUI automation and testing: Screen reading, icon recognition, and visual element localization in one model
  • Multi-image analysis: Multiple images or image sequences processed in a single request
  • Long video understanding: Event recognition and temporal reasoning across extended video content
  • Visual grounding tasks: Bounding box coordinates for detected elements are returned natively

Consider Alternatives When

  • Text-only capabilities: GLM-4.5 or GLM-4.5-Air provides the same language foundation without the vision overhead
  • Image generation needed: GLM 4.5V is input-multimodal only and produces text output
  • Advanced vision features: GLM-4.6V offers an upgraded 128K context window and native multimodal function calling
  • Pixel-accurate frontend replication: GLM-4.6V includes targeted improvements for HTML/CSS reconstruction from screenshots

Conclusion

GLM 4.5V brings vision-language capability to the GLM-4.5 generation, with a focus on document understanding, GUI interaction, and visual grounding. The thinking mode switch gives you control over the accuracy-latency tradeoff on a per-request basis.

Frequently Asked Questions

  • What visual inputs does GLM 4.5V support?

    Single images, multiple images, long videos, screenshots, charts, documents, and GUI interfaces. It processes these alongside text prompts in a single request.

  • What is visual grounding in GLM 4.5V?

    Visual grounding lets the model identify and localize specific elements in images by returning bounding box coordinates. Coordinates are normalized by image dimensions, enabling programmatic interaction with detected visual elements.

  • Does GLM 4.5V support video input?

    Yes. It handles long video understanding with event recognition and temporal reasoning, processing extended video content within the context window.

  • How does the thinking mode work?

    You can toggle thinking on or off per request. Thinking mode enables deeper chain-of-thought reasoning for complex visual tasks. Disabling it provides faster, more direct responses for simpler queries.

  • How do I authenticate with GLM 4.5V through AI Gateway?

    AI Gateway provides a unified API key. No separate Z.ai account is needed. Configure your API key and use the model identifier to route requests. BYOK is also supported for direct provider accounts.

  • How does GLM 4.5V compare to GLM-4.6V?

    GLM 4.5V builds on GLM-4.5-Air and targets vision-language tasks at its scale. GLM-4.6V is the next generation with a 128K context window, native multimodal function calling, and improved frontend replication capabilities.

  • Can GLM 4.5V generate images?

    No. GLM 4.5V accepts visual inputs and produces text output only. For image generation, use a dedicated image generation model.