Skip to content

LongCat Flash Chat

meituan/longcat-flash-chat

LongCat Flash Chat is Meituan's 560B Mixture-of-Experts (MoE) conversational model that activates roughly 27B parameters per token on average. It targets high-throughput agentic tool use and complex multi-step interactions under an MIT license.

Tool Use
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'meituan/longcat-flash-chat',
prompt: 'Why is the sky blue?'
})

What To Consider When Choosing a Provider

  • Zero Data Retention

    AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.

    Authentication

    AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

Dynamic activation scales per-token compute with input complexity. This can produce variable response latency depending on context density. Account for this when you set timeouts in agentic pipeline configurations.

When to Use LongCat Flash Chat

Best For

  • Agentic tool invocation:

    Fast, reliable tool calls across many sequential steps with consistent function-calling behavior

  • High-volume chat:

    Instruction-following applications where fast inference matters (see live metrics on this page)

  • Long multi-turn sessions:

    Tool-augmented conversations within the context window of 128K tokens

  • Cost-sensitive deployments:

    Mixture-of-Experts (MoE) dynamic activation keeps per-token cost competitive despite the 560B parameter count

Consider Alternatives When

  • Deliberative reasoning needs:

    Tasks require extended reasoning or formal mathematical proof (LongCat Flash Thinking targets those)

  • Peak STEM benchmarks:

    Maximum benchmark performance on complex STEM reasoning outweighs throughput and cost

  • Multimodal input required:

    You need image, audio, or video input alongside text

Conclusion

LongCat Flash Chat delivers the conversational and agentic throughput of a 560B parameter model at roughly the cost of 27B active parameters per token. See live metrics on this page. Its context window of 128K tokens and reliable tool-calling behavior make it a usable foundation for high-throughput agentic products.

FAQ

The MoE gating mechanism evaluates each input token and activates only the most relevant expert networks. It selects 18.6 to 31.3B parameters (roughly 27B on average) from the 560B total. The "zero-computation" label means the routing decision itself adds no additional inference cost.

MoE dynamic activation reduces per-token compute versus a dense model of equivalent parameter count. Live throughput metrics appear on this page.

Flash Chat invokes tools and replies immediately without extended internal deliberation. Flash Thinking generates reasoning chains before responding, which improves accuracy on complex tasks but increases latency and token cost. Choose Flash Chat for high-frequency tool calling where response speed is the priority.

It supports a context window of 128K tokens, up to 100K tokens per request. This accommodates long conversation histories, multi-document contexts, and extended agentic session transcripts in a single call.

Flash Chat targets agentic tool use and instruction following at high throughput. For reasoning benchmarks like ARC-AGI, formal proof, and advanced STEM, the Flash Thinking variant covers those capabilities.