LongCat Flash Chat
LongCat Flash Chat is Meituan's 560B Mixture-of-Experts (MoE) conversational model that activates roughly 27B parameters per token on average. It targets high-throughput agentic tool use and complex multi-step interactions under an MIT license.
import { streamText } from 'ai'
const result = streamText({ model: 'meituan/longcat-flash-chat', prompt: 'Why is the sky blue?'})Playground
Try out LongCat Flash Chat by Meituan. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by Meituan
| Model |
|---|
About LongCat Flash Chat
LongCat Flash Chat is the direct-response conversational variant of Meituan's LongCat-Flash series. It answers immediately rather than generating extended internal reasoning chains. The architecture uses a zero-computation expert gating mechanism to activate 18.6 to 31.3 billion parameters per token (roughly 27B on average) from the 560B total. This keeps per-token compute aligned with active-parameter-scale pricing; see [pricing] for current rates. The full 560B parameter breadth shapes knowledge and generalization.
The design emphasizes agentic tool use and reliable instruction following across sequential steps. In practice, the model handles structured function calls consistently across many turns, maintains task state through long tool-augmented conversations, and keeps response formatting stable. These properties matter when an agent invokes tools dozens of times in a session without behavior drift.
Meituan released LongCat Flash Chat under an MIT license. Model weights are publicly available; see the upstream listing. Access it through AI Gateway with one API key.
What To Consider When Choosing a Provider
- Configuration: Dynamic activation scales per-token compute with input complexity. This can produce variable response latency depending on context density. Account for this when you set timeouts in agentic pipeline configurations.
- Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use LongCat Flash Chat
Best For
- Agentic tool invocation: Fast, reliable tool calls across many sequential steps with consistent function-calling behavior
- High-volume chat: Instruction-following applications where fast inference matters (see live metrics on this page)
- Long multi-turn sessions: Tool-augmented conversations within the context window of 128K tokens
- Cost-sensitive deployments: Mixture-of-Experts (MoE) dynamic activation keeps per-token cost competitive despite the 560B parameter count
Consider Alternatives When
- Deliberative reasoning needs: Tasks require extended reasoning or formal mathematical proof (LongCat Flash Thinking targets those)
- Peak STEM benchmarks: Maximum benchmark performance on complex STEM reasoning outweighs throughput and cost
- Multimodal input required: You need image, audio, or video input alongside text
Conclusion
LongCat Flash Chat delivers the conversational and agentic throughput of a 560B parameter model at roughly the cost of 27B active parameters per token. See live metrics on this page. Its context window of 128K tokens and reliable tool-calling behavior make it a usable foundation for high-throughput agentic products.
Frequently Asked Questions
How does the zero-computation expert gating work in LongCat Flash Chat?
The MoE gating mechanism evaluates each input token and activates only the most relevant expert networks. It selects 18.6 to 31.3B parameters (roughly 27B on average) from the 560B total. The "zero-computation" label means the routing decision itself adds no additional inference cost.
What throughput does LongCat Flash Chat sustain in practice?
MoE dynamic activation reduces per-token compute versus a dense model of equivalent parameter count. Live throughput metrics appear on this page.
What distinguishes Flash Chat from Flash Thinking for tool-calling workflows?
Flash Chat invokes tools and replies immediately without extended internal deliberation. Flash Thinking generates reasoning chains before responding, which improves accuracy on complex tasks but increases latency and token cost. Choose Flash Chat for high-frequency tool calling where response speed is the priority.
What is the context window for LongCat Flash Chat?
It supports a context window of 128K tokens, up to 100K tokens per request. This accommodates long conversation histories, multi-document contexts, and extended agentic session transcripts in a single call.
Which benchmarks reflect Flash Chat's strengths?
Flash Chat targets agentic tool use and instruction following at high throughput. For reasoning benchmarks like ARC-AGI, formal proof, and advanced STEM, the Flash Thinking variant covers those capabilities.