Kimi K2 Turbo
Kimi K2 Turbo is Moonshot AI's throughput-oriented K2 variant. It runs the K2 Mixture-of-Experts (MoE) architecture without thinking overhead, built for streaming interfaces, high-volume pipelines, and agentic workflows where first-token latency drives responsiveness.
import { streamText } from 'ai'
const result = streamText({ model: 'moonshotai/kimi-k2-turbo', prompt: 'Why is the sky blue?'})What To Consider When Choosing a Provider
Zero Data Retention
AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.Authentication
AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
For agentic pipelines that need the lowest first-token latency, verify provider response time benchmarks for your deployment region.
When to Use Kimi K2 Turbo
Best For
Real-time streaming chat:
First-token latency drives perceived responsiveness in chat interfaces. No thinking overhead means the first token arrives sooner, which users notice immediately
High-frequency tool-calling agents:
Agents that execute many sequential or parallel tool calls benefit from the per-call latency reduction. A 100-step agentic workflow is faster at turbo latency than thinking latency
Sub-agents in multi-agent orchestration:
When K2 Turbo serves as a worker node in a larger agentic system, its response time affects overall orchestration throughput. Fast sub-agents keep the pipeline moving
Cost-sensitive high-volume production:
Lower latency often correlates with lower compute cost at scale. Kimi K2 Turbo delivers K2-level capability at a throughput-oriented configuration
Consider Alternatives When
The task requires explicit reasoning steps:
When chain-of-thought deliberation improves output quality, Kimi K2 Thinking or K2 Thinking Turbo is more appropriate
Complex multi-step planning is central to the workflow:
Tasks where the model needs to plan before acting benefit from the thinking variants' deliberation budget
You're building a reasoning benchmark or evaluation:
Reasoning benchmarks that reward explicit deliberation will show different scores from thinking-enabled variants
Conclusion
Kimi K2 Turbo is the right K2 configuration when speed is the binding constraint and chain-of-thought is overhead rather than a benefit. For streaming interfaces, high-frequency agents, and latency-sensitive pipelines, it delivers K2-generation capability at high throughput.
FAQ
It drops the extended thinking overhead in K2 Thinking and K2 Thinking Turbo, so generation goes to token output. Check this page for live throughput and latency figures.
Yes. Multi-step tool calling, parallel function execution, and long-horizon tool-use sequences all run in turbo mode without thinking overhead.
Moonshot AI launched Kimi K2 Turbo on September 5, 2025. Moonshot AI publishes release notes on https://www.moonshot.ai. See https://moonshotai.github.io/Kimi-K2/ for AI Gateway pricing, routing, and limits.
Both operate at turbo latency, but Kimi K2 Turbo has no thinking mode. Responses generate directly without deliberation. K2 Thinking Turbo keeps a compressed thinking budget and emits chain-of-thought reasoning at turbo speed. Use Turbo when reasoning isn't needed. Use Thinking Turbo when deliberation improves quality.
256K tokens, consistent with the K2 family.
Use the identifier moonshotai/kimi-k2-turbo with any supported interface. AI Gateway manages provider routing automatically.