Skip to content

Kimi K2 Instruct

Kimi K2 Instruct is Moonshot AI's Mixture-of-Experts (MoE) language model with one trillion total parameters and 32 billion active per forward pass, a context window of 131.1K tokens, available through AI Gateway via novita.

Tool Use
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'moonshotai/kimi-k2',
prompt: 'Why is the sky blue?'
})

Playground

Try out Kimi K2 Instruct by Moonshot AI. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

About Kimi K2 Instruct

Kimi K2 Instruct, released September 5, 2025, is a Mixture-of-Experts (MoE) language model from Moonshot AI.

Sparse expert routing at 32B activation. The full trillion parameters encode broad knowledge: programming languages, API conventions, domain facts, and tool-use patterns. At inference time, a routing mechanism selects roughly 32 billion parameters per token. Latency and compute cost stay comparable to a dense 32B model, while the knowledge base spans the entire trillion-parameter budget.

With 32B active parameters for reasoning depth and a full 1T parameter budget encoding broad tool-use and coding knowledge, K2 handles structured sequences of API calls, multi-step planning, and code synthesis.

Kimi K2 Instruct is available through AI Gateway at $0.57 per million input tokens and $2.3 per million output tokens.

AI Gateway routes K2 across novita, giving you automatic failover across multiple providers.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Novita AI
Legal:Terms
Privacy
131K
0.8s
36tps
$0.57/M$2.30/M
09/05/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Moonshot AI

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
262K
1.1s
111tps
$0.95/M$4.00/M
Read:$0.16/M
Write:
fireworks logo
moonshotai logo
novita logo
04/20/2026
262K
0.4s
81tps
$0.50/M$2.80/M
Read:$0.1/M
Write:
bedrock logo
fireworks logo
moonshotai logo
+2
01/26/2026
262K
0.6s
32tps
$0.60/M$2.50/M
Read:$0.15/M
Write:
deepinfra logo
moonshotai logo
11/06/2025
262K
0.6s
125tps
$1.15/M$8.00/M
Read:$0.15/M
Write:
moonshotai logo
11/06/2025
256K
0.8s
73tps
$1.15/M$8.00/M
Read:$0.15/M
Write:
moonshotai logo
09/05/2025

What To Consider When Choosing a Provider

  • Configuration: K2 routes across novita. Choose it when uptime and provider redundancy matter most.
  • Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Kimi K2 Instruct

Best For

  • Agentic pipelines: Structured sequences of API calls, data processing, and code synthesis
  • Provider redundancy: Deployments where failover across multiple providers matters most
  • K2 architecture baseline: Teams evaluating the K2 architecture for the first time who want the original release
  • Broad knowledge at low cost: Workloads that benefit from trillion-parameter knowledge breadth at 32B-dense inference economics

Consider Alternatives When

  • Chain-of-thought traces: Kimi K2 Thinking layers extended reasoning on top of this foundation
  • Minimum latency: Kimi K2 Turbo is the speed-optimized variant
  • September 2025 checkpoint: Use Kimi K2-0905 for expanded context and refined agentic training
  • Multimodal inputs: K2 processes text only, so reach for a vision-capable model

Conclusion

Kimi K2 Instruct established that sparse expert routing can deliver dense-model responsiveness at trillion-parameter scale. Its architecture anchors the entire K2 family of specialized variants. Routing across novita gives you automatic failover for high-availability production.

Frequently Asked Questions

  • How does sparse routing translate to cost savings?

    The full 1T parameters store broad knowledge, but only ~32B activate per token via the expert router. You pay compute proportional to a 32B dense model while drawing on knowledge encoded across the entire trillion-parameter budget.

  • Why does base K2 list many providers on AI Gateway?

    It was the first K2 variant adopted across providers, so routing across novita reflects earlier integration. Later checkpoints and variants can have narrower provider sets.

  • Is K2 text-only?

    Yes. Kimi K2 Instruct accepts and produces text. Multimodal capabilities are not part of this release.

  • What agentic patterns does K2 handle well?

    Structured multi-step sequences: invoke an API, parse the response, branch on results, call a second API, and synthesize a final output. The function-calling interface in AI Gateway maps directly to these workflows.

  • Can I bring my own provider credentials?

    Yes. AI Gateway supports Bring Your Own Key for providers where you hold a direct account. BYOK requests are excluded from ZDR coverage.