Skip to content

Kimi K2 Turbo

Kimi K2 Turbo is Moonshot AI's throughput-oriented K2 variant. It runs the K2 Mixture-of-Experts (MoE) architecture without thinking overhead, built for streaming interfaces, high-volume pipelines, and agentic workflows where first-token latency drives responsiveness.

Tool Use
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'moonshotai/kimi-k2-turbo',
prompt: 'Why is the sky blue?'
})

Playground

Try out Kimi K2 Turbo by Moonshot AI. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

About Kimi K2 Turbo

Kimi K2 Turbo is the throughput-maximized configuration of Kimi K2, launched on September 5, 2025. It runs the K2 Mixture-of-Experts (MoE) architecture (1T total parameters, 32B active per forward pass) without the extended thinking layer. All generation capacity goes toward token throughput. When a reasoning model's internal monologue adds latency without adding value, Kimi K2 Turbo removes that overhead entirely.

The K2 MoE architecture keeps agentic capabilities in turbo mode: multi-step tool calling, long-horizon task management, and parallel function execution all operate at the turbo speed profile. The model handles sequences of tool invocations (query an API, process the result, call another API, synthesize a response) without triggering thinking mode. For agentic pipelines where many such sequences run in parallel or in tight loops, the per-step latency reduction compounds into wall-clock savings.

Streaming interfaces benefit from lower first-token latency. A chat interface that starts streaming tokens sooner cuts wait time before visible output. One that waits for a thinking model to finish deliberation before streaming does not. Kimi K2 Turbo targets cases where streaming latency defines the product experience.

Kimi K2 Turbo is available through AI Gateway at $1.15 per million input tokens and $8 per million output tokens. Release history from Moonshot AI may also appear on https://www.moonshot.ai.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Moonshot AI
Legal:Terms
Privacy
256K
2.3s
69tps
$1.15/M$8.00/M
Read:$0.15/M
Write:
09/05/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Moonshot AI

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
262K
1.2s
86tps
$0.95/M$4.00/M
Read:$0.16/M
Write:
fireworks logo
moonshotai logo
novita logo
04/20/2026
262K
0.5s
60tps
$0.50/M$2.80/M
Read:$0.1/M
Write:
bedrock logo
fireworks logo
moonshotai logo
+2
01/26/2026
262K
0.7s
23tps
$0.60/M$2.50/M
Read:$0.15/M
Write:
deepinfra logo
moonshotai logo
11/06/2025
262K
0.7s
112tps
$1.15/M$8.00/M
Read:$0.15/M
Write:
moonshotai logo
11/06/2025
131K
0.9s
39tps
$0.57/M$2.30/M
novita logo
09/05/2025

What To Consider When Choosing a Provider

  • Configuration: For agentic pipelines that need the lowest first-token latency, verify provider response time benchmarks for your deployment region.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Kimi K2 Turbo

Best For

  • Real-time streaming chat: First-token latency drives perceived responsiveness in chat interfaces. No thinking overhead means the first token arrives sooner, which users notice immediately
  • High-frequency tool-calling agents: Agents that execute many sequential or parallel tool calls benefit from the per-call latency reduction. A 100-step agentic workflow is faster at turbo latency than thinking latency
  • Sub-agents in multi-agent orchestration: When K2 Turbo serves as a worker node in a larger agentic system, its response time affects overall orchestration throughput. Fast sub-agents keep the pipeline moving
  • Cost-sensitive high-volume production: Lower latency often correlates with lower compute cost at scale. Kimi K2 Turbo delivers K2-level capability at a throughput-oriented configuration

Consider Alternatives When

  • The task requires explicit reasoning steps: When chain-of-thought deliberation improves output quality, Kimi K2 Thinking or K2 Thinking Turbo is more appropriate
  • Complex multi-step planning is central to the workflow: Tasks where the model needs to plan before acting benefit from the thinking variants' deliberation budget
  • You're building a reasoning benchmark or evaluation: Reasoning benchmarks that reward explicit deliberation will show different scores from thinking-enabled variants

Conclusion

Kimi K2 Turbo is the right K2 configuration when speed is the binding constraint and chain-of-thought is overhead rather than a benefit. For streaming interfaces, high-frequency agents, and latency-sensitive pipelines, it delivers K2-generation capability at high throughput.

Frequently Asked Questions

  • What is the throughput advantage of Kimi K2 Turbo over other K2 variants?

    It drops the extended thinking overhead in K2 Thinking and K2 Thinking Turbo, so generation goes to token output. Check this page for live throughput and latency figures.

  • Does Kimi K2 Turbo support tool calling without thinking mode?

    Yes. Multi-step tool calling, parallel function execution, and long-horizon tool-use sequences all run in turbo mode without thinking overhead.

  • When was Kimi K2 Turbo launched?

    Moonshot AI launched Kimi K2 Turbo on September 5, 2025. Moonshot AI publishes release notes on https://www.moonshot.ai. See https://moonshotai.github.io/Kimi-K2/ for AI Gateway pricing, routing, and limits.

  • How does Kimi K2 Turbo differ from Kimi K2 Thinking Turbo?

    Both operate at turbo latency, but Kimi K2 Turbo has no thinking mode. Responses generate directly without deliberation. K2 Thinking Turbo keeps a compressed thinking budget and emits chain-of-thought reasoning at turbo speed. Use Turbo when reasoning isn't needed. Use Thinking Turbo when deliberation improves quality.

  • What is the context window for Kimi K2 Turbo?

    256K tokens, consistent with the K2 family.

  • How do I start using Kimi K2 Turbo?

    Use the identifier moonshotai/kimi-k2-turbo with any supported interface. AI Gateway manages provider routing automatically.