How does GLM-4.6V-Flash compare to the full GLM-4.6V?

GLM-4.6V-Flash is a 9B parameter model optimized for speed and lower per-token cost. GLM-4.6V is the full 106B parameter model for maximum visual reasoning capability. Both share a context window of 128K tokens.

Does GLM-4.6V-Flash support the same inputs as GLM-4.6V?

Yes. It processes images, documents, charts, and text within a context window of 128K tokens, though peak performance on the most complex visual reasoning tasks will be lower than the full 106B model.

What is the context window for GLM-4.6V-Flash?

128K tokens, matching the full GLM-4.6V model.

How do I authenticate with GLM-4.6V-Flash through AI Gateway?

AI Gateway provides a unified API key. No separate Z.ai account is needed. Specify the model identifier and AI Gateway handles routing. BYOK is also supported.

What is the pricing for GLM-4.6V-Flash?

See the pricing section on this page for today's rates. AI Gateway exposes each provider's pricing for GLM-4.6V-Flash.

Is GLM-4.6V-Flash suitable for agentic visual workflows?

Yes, for agent steps that prioritize speed. For complex visual planning requiring deep reasoning or native multimodal function calling at peak accuracy, route those steps to the full GLM-4.6V model.

Dashboard

GLM-4.6V-Flash

GLM-4.6V-Flash is Z.ai's lightweight 9B parameter vision-language model for low-latency applications. It shares GLM-4.6V's multimodal capabilities at a fraction of the compute cost.

Vision (Image)ReasoningFile InputTool UseImplicit Caching

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'zai/glm-4.6v-flash',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Uptime Status Similar FAQ

Playground

Try out GLM-4.6V-Flash by Z.ai. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

128K

—

09/30/2025

More models by Z.ai

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

205K

1.1s

55tps

$1.40/M

$4.40/M

Read:$0.26/M

Write:—

—

04/07/2026

203K

0.9s

86tps

$1.20/M

$4.00/M

Read:$0.24/M

Write:—

—

03/15/2026

203K

0.4s

65tps

$0.80/M

$2.56/M

Read:$0.16/M

Write:—

—

02/12/2026

205K

0.1s

666tps

$2.25/M

$2.75/M

Read:$2.25/M

Write:—

—

12/22/2025

205K

0.4s

208tps

$0.60/M

$2.20/M

Read:$0.11/M

Write:—

—

09/30/2025

200K

0.1s

149tps

$0.07/M

$0.40/M

Read:$0.01/M

Write:—

—

About GLM-4.6V-Flash

GLM-4.6V-Flash is the 9B parameter efficiency variant in Z.ai's GLM-4.6V family, released September 30, 2025. Where GLM-4.6V targets maximum capability at 106B parameters, GLM-4.6V-Flash delivers vision-language understanding at a scale suitable for latency-sensitive production workloads.

Despite its compact size, GLM-4.6V-Flash retains the core multimodal capabilities of the GLM-4.6V generation: context window of 128K tokens, multimodal document understanding, and visual reasoning. The reduced parameter count translates to faster inference and lower per-token cost, making high-volume visual processing pipelines economically viable.

Route traffic through AI Gateway for managed access, unified billing, and built-in observability across providers.

What To Consider When Choosing a Provider

Configuration: GLM-4.6V-Flash is optimized for speed and efficiency. For tasks requiring the deepest visual reasoning or the most accurate frontend replication, GLM-4.6V (106B) may produce better results.
Configuration: GLM-4.6V-Flash shares the context window of 128K tokens with the full GLM-4.6V, so long-document and multi-image workflows don't require compromises on input size.
Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use GLM-4.6V-Flash

Best For

High-volume visual processing: Per-request cost and latency determine pipeline feasibility
Low-latency multimodal applications: Real-time user interactions that require visual understanding
Document scanning and extraction pipelines: High volumes of images and documents processed at speed
Visual classification and captioning: Throughput matters more than peak reasoning depth at scale

Consider Alternatives When

Maximum visual reasoning: GLM-4.6V (106B) provides the full-scale model for the most demanding visual tasks
Pixel-accurate frontend replication: The full GLM-4.6V model is better suited for HTML/CSS reconstruction from screenshots
Text-only capabilities: GLM-4.6 provides the coding-focused model without vision processing overhead
Next-generation features: Evaluate GLM-4.7 and GLM-5 for capabilities beyond this generation

Conclusion

GLM-4.6V-Flash makes vision-language capability accessible at a scale that fits latency-sensitive and cost-conscious deployments. For teams processing visual content at volume, it provides the core multimodal capabilities of the GLM-4.6V generation in a practical 9B parameter package.

Frequently Asked Questions

How does GLM-4.6V-Flash compare to the full GLM-4.6V?
GLM-4.6V-Flash is a 9B parameter model optimized for speed and lower per-token cost. GLM-4.6V is the full 106B parameter model for maximum visual reasoning capability. Both share a context window of 128K tokens.
Does GLM-4.6V-Flash support the same inputs as GLM-4.6V?
Yes. It processes images, documents, charts, and text within a context window of 128K tokens, though peak performance on the most complex visual reasoning tasks will be lower than the full 106B model.
What is the context window for GLM-4.6V-Flash?
128K tokens, matching the full GLM-4.6V model.
How do I authenticate with GLM-4.6V-Flash through AI Gateway?
AI Gateway provides a unified API key. No separate Z.ai account is needed. Specify the model identifier and AI Gateway handles routing. BYOK is also supported.
What is the pricing for GLM-4.6V-Flash?
See the pricing section on this page for today's rates. AI Gateway exposes each provider's pricing for GLM-4.6V-Flash.
Is GLM-4.6V-Flash suitable for agentic visual workflows?
Yes, for agent steps that prioritize speed. For complex visual planning requiring deep reasoning or native multimodal function calling at peak accuracy, route those steps to the full GLM-4.6V model.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

GLM-4.6V-Flash

Playground

Providers

More models by Z.ai

About GLM-4.6V-Flash

What To Consider When Choosing a Provider

When to Use GLM-4.6V-Flash

Best For

Consider Alternatives When

Conclusion

Frequently Asked Questions