Skip to content

MiMo V2 Flash

MiMo V2 Flash is Xiaomi's MiMo v2 Flash MoE reasoning model with 309B total parameters and 15B active per forward pass, using hybrid attention and multi-token prediction for inference efficiency. It supports a context window of 262.1K tokens at $0.1 per million input tokens and $0.3 per million output tokens.

ReasoningTool Use
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'xiaomi/mimo-v2-flash',
prompt: 'Why is the sky blue?'
})

Playground

Try out MiMo V2 Flash by Xiaomi. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Novita AI
Legal:Terms
Privacy
262K
1.4s
139tps
$0.10/M$0.30/M
Read:$0.02/M
Write:
12/17/2025
Chutes
Legal:Terms
Privacy
262K
$0.09/M$0.29/M
Read:$0.04/M
Write:
12/17/2025
Xiaomi
Legal:Terms
Privacy
262K
1.4s
111tps
$0.10/M$0.30/M
Read:$0.01/M
Write:
12/17/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Xiaomi

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
1.1M
2.3s
61tps
$1.00/M
$3.00/M
Read:
$0.2/M
Write:
xiaomi logo
04/22/2026
1.1M
1.4s
98tps
$0.40/M
$2.00/M
Read:
$0.08/M
Write:
xiaomi logo
04/22/2026
1M
1.8s
74tps
$1.00/M
$3.00/M
Read:
$0.2/M
Write:
xiaomi logo
03/18/2026

About MiMo V2 Flash

MiMo V2 Flash is a Mixture-of-Experts model from Xiaomi, released December 17, 2025 under the MIT license. Each forward pass activates a fraction of total parameters, which keeps per-token cost down while the full parameter count stores broader knowledge.

The architecture uses hybrid sliding window attention: sliding window and global attention run in a fixed ratio with a 128-token window, which cuts KV-cache storage versus standard attention and makes a window of 262.1K tokens practical. A multi-token prediction (MTP) block enables speculative decoding so generation can run faster during inference.

For benchmark figures and methodology, see https://novita.ai/ (listed in the model changelog as the MiMo v2 Flash announcement).

MiMo V2 Flash is text-in, text-out. Call it through novita, chutes, xiaomi with AI Gateway; input is $0.1 per million tokens and output is $0.3 per million tokens.

What To Consider When Choosing a Provider

  • Configuration: MoE plus sliding-window attention gives an unusual cost-to-score tradeoff. Track real spend in AI Gateway at your volume; list pricing is $0.1 in and $0.3 out per million tokens.
  • Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use MiMo V2 Flash

Best For

  • Software engineering, math, and long-context tasks: You rely on tables on https://novita.ai/ (SWE-Bench Verified, LiveCodeBench, AIME-style math, long-context tests)
  • Long-context analysis: Up to 262.1K tokens, including needle-in-haystack-style checks at long lengths
  • Cost-aware deployment: Low active-parameter compute matters next to published benchmark results

Consider Alternatives When

  • Multimodal input required: MiMo V2 Flash is text-in, text-out only
  • MoE hosting constraints: You can't host or route to a large MoE stack, even with a small active count
  • Enterprise support terms: You need support guarantees this SKU doesn't offer
  • Simple classification jobs: A smaller model handles extraction at lower cost and fast enough

Conclusion

MiMo V2 Flash pairs MoE routing, sliding-window attention, and MTP-style decoding for a text window of 262.1K tokens at $0.1/$0.3 per million input/output tokens. See https://novita.ai/ for benchmark tables. Use AI Gateway for routing, retries, and usage tracking.

Frequently Asked Questions

  • How does MiMo V2 Flash score well with a small active parameter count?

    MoE routes tokens to expert blocks and only activates part of the weights each step. That keeps compute low while the full weight count still holds broad knowledge.

  • What is hybrid sliding window attention?

    It mixes sliding-window and global attention on a fixed schedule with a 128-token window. MiMo V2 Flash uses much smaller KV caches than full attention, which helps on a context of 262.1K tokens.

  • How does the multi-token prediction module work?

    It adds a small MTP block per layer so the stack can propose several future tokens and verify them in fewer full steps, which raises output tokens per second during inference.

  • How do I authenticate requests to MiMo V2 Flash through AI Gateway?

    Add your API key in AI Gateway project settings. Use xiaomi/mimo-v2-flash in API calls. AI Gateway routes, retries, and fails over across novita, chutes, xiaomi.

  • What does MiMo V2 Flash cost?

    Check the pricing panel on this page for today's numbers. AI Gateway tracks rates across every provider that serves MiMo V2 Flash.

  • How does MiMo V2 Flash compare to DeepSeek-V3?

    DeepSeek-V3 uses a larger active parameter count from a larger total than MiMo V2 Flash. Compare published tables on each vendor's page; both are MoE stacks with different size and training choices.