Qwen3 Next 80B A3B Instruct
Qwen3 Next 80B A3B Instruct is an 80-billion-parameter hybrid Transformer-Mamba model that activates only 3B parameters per token, delivering 10x inference throughput over dense alternatives at a native context window of 262.1K tokens.
import { streamText } from 'ai'
const result = streamText({ model: 'alibaba/qwen3-next-80b-a3b-instruct', prompt: 'Why is the sky blue?'})What To Consider When Choosing a Provider
Zero Data Retention
AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.Authentication
AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
Providers vary in their support for ultra-long context windows; confirm that your selected provider can handle requests approaching 262.1K tokens before deploying at scale.
When to Use Qwen3 Next 80B A3B Instruct
Best For
High-throughput document pipelines:
Workloads where long context and inference speed must coexist
Long-document professional workloads:
Legal, financial, or research tasks that regularly process 100K+ token documents without needing reasoning traces
Long-session conversational assistants:
Summarization and chat services that require reliable instruction following across very long sessions
Multi-document synthesis:
Tasks where entire reports must remain in-context simultaneously
Cost-sensitive deployments:
Production traffic that can't afford dense model pricing at high query volume
Consider Alternatives When
Explicit chain-of-thought needed:
Hard math or coding problems are better served by Qwen3-Next-80B-A3B-Thinking
Short-context exchanges:
The architectural advantages for long sequences don't apply to short prompts
Multimodal input required:
This model processes text only; use a vision-language model for images or video
Narrow-task accuracy ceiling:
Maximum benchmark accuracy on a specific task can outweigh throughput efficiency gains
Conclusion
Qwen3 Next 80B A3B Instruct fits production deployments that combine long context with high throughput requirements. Its Hybrid Transformer-Mamba architecture delivers the efficiency of linear attention without surrendering the precision of sparse full attention, making it practical for document-scale workloads that would otherwise require expensive dense models.
FAQ
"80B" refers to 80 billion total parameters in the MoE pool; "A3B" indicates that approximately 3 billion parameters are activated per token. Only 10 of 512 experts fire for each token, giving the model a 3.75% activation ratio.
The architecture alternates Gated DeltaNet (linear attention) with sparse Gated Attention. Linear attention scales sub-quadratically with sequence length, enabling the model to process sequences of 262.1K tokens with significantly lower compute than a fully quadratic attention model.
On sequences of 32K tokens or longer, Qwen3 Next 80B A3B Instruct achieves approximately 10x higher throughput than a comparable Qwen3-32B dense model, according to technical specifications.
No. The Instruct variant is optimized for direct instruction following without thinking traces. The Qwen3-Next-80B-A3B-Thinking variant provides reasoning mode.
The native context is 262.1K tokens. Using YaRN rope scaling, this can be extended to approximately one million tokens. The model achieves 80.3% accuracy on the 1M RULER extreme-context benchmark.
Key scores include MMLU-Pro (80.6), MMLU-Redux (90.9), GPQA (72.9), AIME25 (69.5), LiveCodeBench (56.6), Arena-Hard v2 (82.7), and BFCL-v3 (70.3) for function calling.
Yes. Multi-Token Prediction further accelerates inference beyond the baseline throughput gains from the sparse MoE architecture.