GLM-4.6V-Flash
GLM-4.6V-Flash is Z.ai's lightweight 9B parameter vision-language model for low-latency applications. It shares GLM-4.6V's multimodal capabilities at a fraction of the compute cost.
import { streamText } from 'ai'
const result = streamText({ model: 'zai/glm-4.6v-flash', prompt: 'Why is the sky blue?'})Playground
Try out GLM-4.6V-Flash by Z.ai. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Ask GLM-4.6V-Flash anything to try it out.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by Z.ai
| Model |
|---|
About GLM-4.6V-Flash
GLM-4.6V-Flash is the 9B parameter efficiency variant in Z.ai's GLM-4.6V family, released September 30, 2025. Where GLM-4.6V targets maximum capability at 106B parameters, GLM-4.6V-Flash delivers vision-language understanding at a scale suitable for latency-sensitive production workloads.
Despite its compact size, GLM-4.6V-Flash retains the core multimodal capabilities of the GLM-4.6V generation: context window of 128K tokens, multimodal document understanding, and visual reasoning. The reduced parameter count translates to faster inference and lower per-token cost, making high-volume visual processing pipelines economically viable.
Route traffic through AI Gateway for managed access, unified billing, and built-in observability across providers.
What To Consider When Choosing a Provider
- Configuration: GLM-4.6V-Flash is optimized for speed and efficiency. For tasks requiring the deepest visual reasoning or the most accurate frontend replication, GLM-4.6V (106B) may produce better results.
- Configuration: GLM-4.6V-Flash shares the context window of 128K tokens with the full GLM-4.6V, so long-document and multi-image workflows don't require compromises on input size.
- Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use GLM-4.6V-Flash
Best For
- High-volume visual processing: Per-request cost and latency determine pipeline feasibility
- Low-latency multimodal applications: Real-time user interactions that require visual understanding
- Document scanning and extraction pipelines: High volumes of images and documents processed at speed
- Visual classification and captioning: Throughput matters more than peak reasoning depth at scale
Consider Alternatives When
- Maximum visual reasoning: GLM-4.6V (106B) provides the full-scale model for the most demanding visual tasks
- Pixel-accurate frontend replication: The full GLM-4.6V model is better suited for HTML/CSS reconstruction from screenshots
- Text-only capabilities: GLM-4.6 provides the coding-focused model without vision processing overhead
- Next-generation features: Evaluate GLM-4.7 and GLM-5 for capabilities beyond this generation
Conclusion
GLM-4.6V-Flash makes vision-language capability accessible at a scale that fits latency-sensitive and cost-conscious deployments. For teams processing visual content at volume, it provides the core multimodal capabilities of the GLM-4.6V generation in a practical 9B parameter package.