Skip to content

GLM-4.6V-Flash

GLM-4.6V-Flash is Z.ai's lightweight 9B parameter vision-language model for low-latency applications. It shares GLM-4.6V's multimodal capabilities at a fraction of the compute cost.

Vision (Image)ReasoningFile InputTool UseImplicit Caching
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'zai/glm-4.6v-flash',
prompt: 'Why is the sky blue?'
})

Playground

Try out GLM-4.6V-Flash by Z.ai. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Z.ai
Legal:Terms
Privacy
128K
09/30/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Z.ai

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
205K
1.1s
55tps
$1.40/M$4.40/M
Read:$0.26/M
Write:
deepinfra logo
fireworks logo
novita logo
+1
04/07/2026
203K
0.9s
86tps
$1.20/M$4.00/M
Read:$0.24/M
Write:
zai logo
03/15/2026
203K
0.4s
65tps
$0.80/M$2.56/M
Read:$0.16/M
Write:
bedrock logo
deepinfra logo
fireworks logo
+3
02/12/2026
205K
0.1s
666tps
$2.25/M$2.75/M
Read:$2.25/M
Write:
bedrock logo
cerebras logo
deepinfra logo
+2
12/22/2025
205K
0.4s
208tps
$0.60/M$2.20/M
Read:$0.11/M
Write:
baseten logo
deepinfra logo
novita logo
+1
09/30/2025
200K
0.1s
149tps
$0.07/M$0.40/M
Read:$0.01/M
Write:
bedrock logo
zai logo

About GLM-4.6V-Flash

GLM-4.6V-Flash is the 9B parameter efficiency variant in Z.ai's GLM-4.6V family, released September 30, 2025. Where GLM-4.6V targets maximum capability at 106B parameters, GLM-4.6V-Flash delivers vision-language understanding at a scale suitable for latency-sensitive production workloads.

Despite its compact size, GLM-4.6V-Flash retains the core multimodal capabilities of the GLM-4.6V generation: context window of 128K tokens, multimodal document understanding, and visual reasoning. The reduced parameter count translates to faster inference and lower per-token cost, making high-volume visual processing pipelines economically viable.

Route traffic through AI Gateway for managed access, unified billing, and built-in observability across providers.

What To Consider When Choosing a Provider

  • Configuration: GLM-4.6V-Flash is optimized for speed and efficiency. For tasks requiring the deepest visual reasoning or the most accurate frontend replication, GLM-4.6V (106B) may produce better results.
  • Configuration: GLM-4.6V-Flash shares the context window of 128K tokens with the full GLM-4.6V, so long-document and multi-image workflows don't require compromises on input size.
  • Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use GLM-4.6V-Flash

Best For

  • High-volume visual processing: Per-request cost and latency determine pipeline feasibility
  • Low-latency multimodal applications: Real-time user interactions that require visual understanding
  • Document scanning and extraction pipelines: High volumes of images and documents processed at speed
  • Visual classification and captioning: Throughput matters more than peak reasoning depth at scale

Consider Alternatives When

  • Maximum visual reasoning: GLM-4.6V (106B) provides the full-scale model for the most demanding visual tasks
  • Pixel-accurate frontend replication: The full GLM-4.6V model is better suited for HTML/CSS reconstruction from screenshots
  • Text-only capabilities: GLM-4.6 provides the coding-focused model without vision processing overhead
  • Next-generation features: Evaluate GLM-4.7 and GLM-5 for capabilities beyond this generation

Conclusion

GLM-4.6V-Flash makes vision-language capability accessible at a scale that fits latency-sensitive and cost-conscious deployments. For teams processing visual content at volume, it provides the core multimodal capabilities of the GLM-4.6V generation in a practical 9B parameter package.

Frequently Asked Questions

  • How does GLM-4.6V-Flash compare to the full GLM-4.6V?

    GLM-4.6V-Flash is a 9B parameter model optimized for speed and lower per-token cost. GLM-4.6V is the full 106B parameter model for maximum visual reasoning capability. Both share a context window of 128K tokens.

  • Does GLM-4.6V-Flash support the same inputs as GLM-4.6V?

    Yes. It processes images, documents, charts, and text within a context window of 128K tokens, though peak performance on the most complex visual reasoning tasks will be lower than the full 106B model.

  • What is the context window for GLM-4.6V-Flash?

    128K tokens, matching the full GLM-4.6V model.

  • How do I authenticate with GLM-4.6V-Flash through AI Gateway?

    AI Gateway provides a unified API key. No separate Z.ai account is needed. Specify the model identifier and AI Gateway handles routing. BYOK is also supported.

  • What is the pricing for GLM-4.6V-Flash?

    See the pricing section on this page for today's rates. AI Gateway exposes each provider's pricing for GLM-4.6V-Flash.

  • Is GLM-4.6V-Flash suitable for agentic visual workflows?

    Yes, for agent steps that prioritize speed. For complex visual planning requiring deep reasoning or native multimodal function calling at peak accuracy, route those steps to the full GLM-4.6V model.