Qwen3 Next 80B A3B Instruct
Qwen3 Next 80B A3B Instruct is an 80-billion-parameter hybrid Transformer-Mamba model that activates only 3B parameters per token, delivering 10x inference throughput over dense alternatives at a native context window of 262.1K tokens.
import { streamText } from 'ai'
const result = streamText({ model: 'alibaba/qwen3-next-80b-a3b-instruct', prompt: 'Why is the sky blue?'})Frequently Asked Questions
What does "80B-A3B" mean in this model's name?
"80B" refers to 80 billion total parameters in the MoE pool; "A3B" indicates that approximately 3 billion parameters are activated per token. Only 10 of 512 experts fire for each token, giving the model a 3.75% activation ratio.
How does Hybrid Transformer-Mamba architecture affect performance on long contexts?
The architecture alternates Gated DeltaNet (linear attention) with sparse Gated Attention. Linear attention scales sub-quadratically with sequence length, enabling the model to process sequences of 262.1K tokens with significantly lower compute than a fully quadratic attention model.
What is the throughput advantage over a dense model?
On sequences of 32K tokens or longer, Qwen3 Next 80B A3B Instruct achieves approximately 10x higher throughput than a comparable Qwen3-32B dense model, according to technical specifications.
Does this model support a thinking or reasoning mode?
No. The Instruct variant is optimized for direct instruction following without thinking traces. The Qwen3-Next-80B-A3B-Thinking variant provides reasoning mode.
What is the maximum context length?
The native context is 262.1K tokens. Using YaRN rope scaling, this can be extended to approximately one million tokens. The model achieves 80.3% accuracy on the 1M RULER extreme-context benchmark.
What benchmarks has this model been evaluated on?
Key scores include MMLU-Pro (80.6), MMLU-Redux (90.9), GPQA (72.9), AIME25 (69.5), LiveCodeBench (56.6), Arena-Hard v2 (82.7), and BFCL-v3 (70.3) for function calling.
Is Multi-Token Prediction supported?
Yes. Multi-Token Prediction further accelerates inference beyond the baseline throughput gains from the sparse MoE architecture.