Skip to content

GLM 4.7 FlashX

GLM 4.7 FlashX is the ultra-fast inference variant in Z.ai's GLM-4.7 generation, released January 1, 2025. Designed for the lowest latency workloads, it provides the fastest response times in the GLM-4.7 family while retaining core coding and reasoning capabilities.

ReasoningTool UseImplicit Caching
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'zai/glm-4.7-flashx',
prompt: 'Why is the sky blue?'
})

Playground

Try out GLM 4.7 FlashX by Z.ai. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Z.ai
Legal:Terms
Privacy
200K
157tps
$0.06/M$0.40/M
Read:$0.01/M
Write:
01/01/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Z.ai

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
205K
0.8s
49tps
$1.40/M$4.40/M
Read:$0.26/M
Write:
deepinfra logo
fireworks logo
novita logo
+1
04/07/2026
203K
1.0s
94tps
$1.20/M$4.00/M
Read:$0.24/M
Write:
zai logo
03/15/2026
203K
0.4s
52tps
$0.80/M$2.56/M
Read:$0.16/M
Write:
bedrock logo
deepinfra logo
fireworks logo
+3
02/12/2026
205K
0.1s
453tps
$2.25/M$2.75/M
Read:$2.25/M
Write:
bedrock logo
cerebras logo
deepinfra logo
+2
12/22/2025
205K
0.3s
73tps
$0.60/M$2.20/M
Read:$0.11/M
Write:
baseten logo
deepinfra logo
novita logo
+1
09/30/2025
200K
0.1s
$0.07/M$0.40/M
Read:$0.01/M
Write:
bedrock logo
zai logo

About GLM 4.7 FlashX

GLM 4.7 FlashX was released January 1, 2025 as the fastest inference tier in Z.ai's GLM-4.7 generation. It targets workloads where response latency is the dominant constraint: real-time user-facing applications, high-frequency API calls, and pipeline steps that block downstream processing.

As the most aggressively speed-optimized variant in the 4.7 family, GLM 4.7 FlashX makes the largest capability tradeoff compared to the full GLM-4.7. It retains the generation's core improvements in coding, reasoning, and conversational tone, but peak performance on the most complex tasks will be lower. The tradeoff is intentional: for the majority of production requests that don't require maximum reasoning depth, GLM 4.7 FlashX delivers adequate quality at the lowest possible latency.

The model shares the same API surface as GLM-4.7 and GLM-4.7-Flash, enabling seamless tier switching. Teams can route simple requests to GLM 4.7 FlashX and complex ones to GLM-4.7, optimizing both cost and quality across their request distribution.

What To Consider When Choosing a Provider

  • Configuration: GLM 4.7 FlashX is the right choice when response time is the binding constraint. If quality on complex tasks matters more, step up to GLM-4.7-Flash or GLM-4.7.
  • Configuration: Use AI Gateway to route requests by complexity. Simple extraction, classification, and short generation tasks perform well on GLM 4.7 FlashX. Route complex reasoning to higher-tier models.
  • Configuration: At the lowest per-token cost in the 4.7 generation, GLM 4.7 FlashX is the most economical option for workloads measured in millions of daily requests.
  • Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use GLM 4.7 FlashX

Best For

  • Real-time user-facing applications: Sub-second response times are required for acceptable user experience
  • High-frequency API endpoints: Thousands of requests per minute where latency compounds into throughput bottlenecks
  • Simple extraction and classification: Tasks that need language understanding without deep reasoning
  • Pipeline preprocessing steps: Steps that block downstream processing benefit from the fastest possible completion
  • Cost-optimized batch processing: Extreme volume where per-token cost is the primary economic driver

Consider Alternatives When

  • Complex reasoning quality: GLM-4.7 or GLM-4.7-Flash provides deeper capability for multi-step planning
  • Balanced speed and capability: GLM-4.7-Flash offers a middle ground in the 4.7 generation
  • Speed-optimized vision: Evaluate GLM-4.6V-Flash for multimodal processing when vision is needed
  • Advanced thinking modes: GLM-5 provides multiple thinking modes and an expanded reasoning architecture

Conclusion

GLM 4.7 FlashX occupies the speed extreme of Z.ai's GLM-4.7 generation. For teams that measure success in milliseconds and process requests at massive scale, it provides the lowest-latency entry point to the 4.7 generation's improvements in coding, reasoning, and conversational quality.

Frequently Asked Questions

  • How fast is GLM 4.7 FlashX compared to other GLM-4.7 variants?

    GLM 4.7 FlashX is the fastest inference tier in the GLM-4.7 generation. It provides the lowest latency, followed by GLM-4.7-Flash, then the full GLM-4.7.

  • What capability tradeoffs does GLM 4.7 FlashX make?

    It trades peak reasoning and coding depth for speed. Core capabilities are retained, but the most complex multi-step reasoning and code generation tasks will produce better results on GLM-4.7 or GLM-4.7-Flash.

  • Can I mix GLM 4.7 FlashX with other GLM-4.7 models?

    Yes. All GLM-4.7 variants share the same API surface. Route simple requests to GLM 4.7 FlashX for speed and complex ones to GLM-4.7 for quality.

  • What is the context window for GLM 4.7 FlashX?

    200K tokens.

  • How do I authenticate with GLM 4.7 FlashX through AI Gateway?

    AI Gateway provides a unified API key. No separate Z.ai account is needed. Use the model identifier to route requests. BYOK is also supported.

  • What workloads is GLM 4.7 FlashX best for?

    Real-time user-facing applications, high-frequency API calls, simple classification and extraction, and any workload where response latency is the primary constraint.

  • How does pricing compare to other GLM-4.7 variants?

    Check the pricing panel on this page for today's numbers. AI Gateway tracks rates across every provider that serves GLM 4.7 FlashX.