How does the zero-computation expert gating work in LongCat Flash Chat?

The MoE gating mechanism evaluates each input token and activates only the most relevant expert networks. It selects 18.6 to 31.3B parameters (roughly 27B on average) from the 560B total. The "zero-computation" label means the routing decision itself adds no additional inference cost.

What throughput does LongCat Flash Chat sustain in practice?

MoE dynamic activation reduces per-token compute versus a dense model of equivalent parameter count. Live throughput metrics appear on this page.

What distinguishes Flash Chat from Flash Thinking for tool-calling workflows?

Flash Chat invokes tools and replies immediately without extended internal deliberation. Flash Thinking generates reasoning chains before responding, which improves accuracy on complex tasks but increases latency and token cost. Choose Flash Chat for high-frequency tool calling where response speed is the priority.

What is the context window for LongCat Flash Chat?

It supports a context window of 128K tokens, up to 100K tokens per request. This accommodates long conversation histories, multi-document contexts, and extended agentic session transcripts in a single call.

Which benchmarks reflect Flash Chat's strengths?

Flash Chat targets agentic tool use and instruction following at high throughput. For reasoning benchmarks like ARC-AGI, formal proof, and advanced STEM, the Flash Thinking variant covers those capabilities.

LongCat Flash Chat

LongCat Flash Chat is Meituan's 560B Mixture-of-Experts (MoE) conversational model that activates roughly 27B parameters per token on average. It targets high-throughput agentic tool use and complex multi-step interactions under an MIT license.

Tool Use

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'meituan/longcat-flash-chat',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Frequently Asked Questions

How does the zero-computation expert gating work in LongCat Flash Chat?
The MoE gating mechanism evaluates each input token and activates only the most relevant expert networks. It selects 18.6 to 31.3B parameters (roughly 27B on average) from the 560B total. The "zero-computation" label means the routing decision itself adds no additional inference cost.
What throughput does LongCat Flash Chat sustain in practice?
MoE dynamic activation reduces per-token compute versus a dense model of equivalent parameter count. Live throughput metrics appear on this page.
What distinguishes Flash Chat from Flash Thinking for tool-calling workflows?
Flash Chat invokes tools and replies immediately without extended internal deliberation. Flash Thinking generates reasoning chains before responding, which improves accuracy on complex tasks but increases latency and token cost. Choose Flash Chat for high-frequency tool calling where response speed is the priority.
What is the context window for LongCat Flash Chat?
It supports a context window of 128K tokens, up to 100K tokens per request. This accommodates long conversation histories, multi-document contexts, and extended agentic session transcripts in a single call.
Which benchmarks reflect Flash Chat's strengths?
Flash Chat targets agentic tool use and instruction following at high throughput. For reasoning benchmarks like ARC-AGI, formal proof, and advanced STEM, the Flash Thinking variant covers those capabilities.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

LongCat Flash Chat

Frequently Asked Questions