Qwen3 Next 80B A3B Instruct
Qwen3 Next 80B A3B Instruct is an 80-billion-parameter hybrid Transformer-Mamba model that activates only 3B parameters per token, delivering 10x inference throughput over dense alternatives at a native context window of 262.1K tokens.
import { streamText } from 'ai'
const result = streamText({ model: 'alibaba/qwen3-next-80b-a3b-instruct', prompt: 'Why is the sky blue?'})Playground
Try out Qwen3 Next 80B A3B Instruct by Alibaba. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by Alibaba
| Model |
|---|
About Qwen3 Next 80B A3B Instruct
Qwen3 Next 80B A3B Instruct introduces a Hybrid Transformer-Mamba architecture that alternates between Gated DeltaNet (a linear attention mechanism) and standard Gated Attention within a 48-layer, 512-expert MoE stack. The layout follows a 12-block repeating pattern: three Gated DeltaNet + MoE layers followed by one Gated Attention + MoE layer. This design is purpose-built for ultra-long-context efficiency: linear attention handles the vast majority of layers at sub-quadratic cost, while sparse Gated Attention layers maintain the precision needed for complex cross-token reasoning.
With only 10 of 512 experts activated per token (plus one shared expert), Qwen3 Next 80B A3B Instruct achieves a 3.75% activation ratio. Combined with Multi-Token Prediction during inference, this translates to approximately 10x higher throughput over comparable 32B dense models on sequences of 32K tokens or longer, a meaningful operational advantage for workloads that process long documents or transcripts at scale. The Instruct variant is tuned for direct instruction following and doesn't generate thinking traces (that variant is Qwen3-Next-80B-A3B-Thinking).
On the 1M RULER benchmark for extreme-length context, Qwen3 Next 80B A3B Instruct scores 80.3% accuracy, and its context of 262.1K tokens is extensible to approximately one million tokens via YaRN rope scaling. On knowledge benchmarks, it scores 80.6 on MMLU-Pro and 82.7 on Arena-Hard v2, tracking competitively with models that require far more compute per token.
What To Consider When Choosing a Provider
- Configuration: Providers vary in their support for ultra-long context windows; confirm that your selected provider can handle requests approaching 262.1K tokens before deploying at scale.
- Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Qwen3 Next 80B A3B Instruct
Best For
- High-throughput document pipelines: Workloads where long context and inference speed must coexist
- Long-document professional workloads: Legal, financial, or research tasks that regularly process 100K+ token documents without needing reasoning traces
- Long-session conversational assistants: Summarization and chat services that require reliable instruction following across very long sessions
- Multi-document synthesis: Tasks where entire reports must remain in-context simultaneously
- Cost-sensitive deployments: Production traffic that can't afford dense model pricing at high query volume
Consider Alternatives When
- Explicit chain-of-thought needed: Hard math or coding problems are better served by Qwen3-Next-80B-A3B-Thinking
- Short-context exchanges: The architectural advantages for long sequences don't apply to short prompts
- Multimodal input required: This model processes text only; use a vision-language model for images or video
- Narrow-task accuracy ceiling: Maximum benchmark accuracy on a specific task can outweigh throughput efficiency gains
Conclusion
Qwen3 Next 80B A3B Instruct fits production deployments that combine long context with high throughput requirements. Its Hybrid Transformer-Mamba architecture delivers the efficiency of linear attention without surrendering the precision of sparse full attention, making it practical for document-scale workloads that would otherwise require expensive dense models.
Frequently Asked Questions
What does "80B-A3B" mean in this model's name?
"80B" refers to 80 billion total parameters in the MoE pool; "A3B" indicates that approximately 3 billion parameters are activated per token. Only 10 of 512 experts fire for each token, giving the model a 3.75% activation ratio.
How does Hybrid Transformer-Mamba architecture affect performance on long contexts?
The architecture alternates Gated DeltaNet (linear attention) with sparse Gated Attention. Linear attention scales sub-quadratically with sequence length, enabling the model to process sequences of 262.1K tokens with significantly lower compute than a fully quadratic attention model.
What is the throughput advantage over a dense model?
On sequences of 32K tokens or longer, Qwen3 Next 80B A3B Instruct achieves approximately 10x higher throughput than a comparable Qwen3-32B dense model, according to technical specifications.
Does this model support a thinking or reasoning mode?
No. The Instruct variant is optimized for direct instruction following without thinking traces. The Qwen3-Next-80B-A3B-Thinking variant provides reasoning mode.
What is the maximum context length?
The native context is 262.1K tokens. Using YaRN rope scaling, this can be extended to approximately one million tokens. The model achieves 80.3% accuracy on the 1M RULER extreme-context benchmark.
What benchmarks has this model been evaluated on?
Key scores include MMLU-Pro (80.6), MMLU-Redux (90.9), GPQA (72.9), AIME25 (69.5), LiveCodeBench (56.6), Arena-Hard v2 (82.7), and BFCL-v3 (70.3) for function calling.
Is Multi-Token Prediction supported?
Yes. Multi-Token Prediction further accelerates inference beyond the baseline throughput gains from the sparse MoE architecture.