How does the zero-computation expert gating work in LongCat Flash Chat?

The MoE gating mechanism evaluates each input token and activates only the most relevant expert networks. It selects 18.6 to 31.3B parameters (roughly 27B on average) from the 560B total. The "zero-computation" label means the routing decision itself adds no additional inference cost.

What throughput does LongCat Flash Chat sustain in practice?

MoE dynamic activation reduces per-token compute versus a dense model of equivalent parameter count. Live throughput metrics appear on this page.

What distinguishes Flash Chat from Flash Thinking for tool-calling workflows?

Flash Chat invokes tools and replies immediately without extended internal deliberation. Flash Thinking generates reasoning chains before responding, which improves accuracy on complex tasks but increases latency and token cost. Choose Flash Chat for high-frequency tool calling where response speed is the priority.

What is the context window for LongCat Flash Chat?

It supports a context window of 128K tokens, up to 100K tokens per request. This accommodates long conversation histories, multi-document contexts, and extended agentic session transcripts in a single call.

Which benchmarks reflect Flash Chat's strengths?

Flash Chat targets agentic tool use and instruction following at high throughput. For reasoning benchmarks like ARC-AGI, formal proof, and advanced STEM, the Flash Thinking variant covers those capabilities.

LongCat Flash Chat

LongCat Flash Chat is Meituan's 560B Mixture-of-Experts (MoE) conversational model that activates roughly 27B parameters per token on average. It targets high-throughput agentic tool use and complex multi-step interactions under an MIT license.

Tool Use

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'meituan/longcat-flash-chat',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Playground

Try out LongCat Flash Chat by Meituan. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Legal:Terms

•

Privacy

128K

2.1s

102tps

—

08/30/2025

More models by Meituan

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

33K

2.5s

47tps

—

About LongCat Flash Chat

LongCat Flash Chat is the direct-response conversational variant of Meituan's LongCat-Flash series. It answers immediately rather than generating extended internal reasoning chains. The architecture uses a zero-computation expert gating mechanism to activate 18.6 to 31.3 billion parameters per token (roughly 27B on average) from the 560B total. This keeps per-token compute aligned with active-parameter-scale pricing; see [pricing] for current rates. The full 560B parameter breadth shapes knowledge and generalization.

The design emphasizes agentic tool use and reliable instruction following across sequential steps. In practice, the model handles structured function calls consistently across many turns, maintains task state through long tool-augmented conversations, and keeps response formatting stable. These properties matter when an agent invokes tools dozens of times in a session without behavior drift.

Meituan released LongCat Flash Chat under an MIT license. Model weights are publicly available; see the upstream listing. Access it through AI Gateway with one API key.

What To Consider When Choosing a Provider

Configuration: Dynamic activation scales per-token compute with input complexity. This can produce variable response latency depending on context density. Account for this when you set timeouts in agentic pipeline configurations.
Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use LongCat Flash Chat

Best For

Agentic tool invocation: Fast, reliable tool calls across many sequential steps with consistent function-calling behavior
High-volume chat: Instruction-following applications where fast inference matters (see live metrics on this page)
Long multi-turn sessions: Tool-augmented conversations within the context window of 128K tokens
Cost-sensitive deployments: Mixture-of-Experts (MoE) dynamic activation keeps per-token cost competitive despite the 560B parameter count

Consider Alternatives When

Deliberative reasoning needs: Tasks require extended reasoning or formal mathematical proof (LongCat Flash Thinking targets those)
Peak STEM benchmarks: Maximum benchmark performance on complex STEM reasoning outweighs throughput and cost
Multimodal input required: You need image, audio, or video input alongside text

Conclusion

LongCat Flash Chat delivers the conversational and agentic throughput of a 560B parameter model at roughly the cost of 27B active parameters per token. See live metrics on this page. Its context window of 128K tokens and reliable tool-calling behavior make it a usable foundation for high-throughput agentic products.

Frequently Asked Questions

How does the zero-computation expert gating work in LongCat Flash Chat?
The MoE gating mechanism evaluates each input token and activates only the most relevant expert networks. It selects 18.6 to 31.3B parameters (roughly 27B on average) from the 560B total. The "zero-computation" label means the routing decision itself adds no additional inference cost.
What throughput does LongCat Flash Chat sustain in practice?
MoE dynamic activation reduces per-token compute versus a dense model of equivalent parameter count. Live throughput metrics appear on this page.
What distinguishes Flash Chat from Flash Thinking for tool-calling workflows?
Flash Chat invokes tools and replies immediately without extended internal deliberation. Flash Thinking generates reasoning chains before responding, which improves accuracy on complex tasks but increases latency and token cost. Choose Flash Chat for high-frequency tool calling where response speed is the priority.
What is the context window for LongCat Flash Chat?
It supports a context window of 128K tokens, up to 100K tokens per request. This accommodates long conversation histories, multi-document contexts, and extended agentic session transcripts in a single call.
Which benchmarks reflect Flash Chat's strengths?
Flash Chat targets agentic tool use and instruction following at high throughput. For reasoning benchmarks like ARC-AGI, formal proof, and advanced STEM, the Flash Thinking variant covers those capabilities.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users