Qwen3 Next 80B A3B Instruct introduces a Hybrid Transformer-Mamba architecture that alternates between Gated DeltaNet (a linear attention mechanism) and standard Gated Attention within a 48-layer, 512-expert MoE stack. The layout follows a 12-block repeating pattern: three Gated DeltaNet + MoE layers followed by one Gated Attention + MoE layer. This design is purpose-built for ultra-long-context efficiency: linear attention handles the vast majority of layers at sub-quadratic cost, while sparse Gated Attention layers maintain the precision needed for complex cross-token reasoning.
With only 10 of 512 experts activated per token (plus one shared expert), Qwen3 Next 80B A3B Instruct achieves a 3.75% activation ratio. Combined with Multi-Token Prediction during inference, this translates to approximately 10x higher throughput over comparable 32B dense models on sequences of 32K tokens or longer, a meaningful operational advantage for workloads that process long documents or transcripts at scale. The Instruct variant is tuned for direct instruction following and doesn't generate thinking traces (that variant is Qwen3-Next-80B-A3B-Thinking).
On the 1M RULER benchmark for extreme-length context, Qwen3 Next 80B A3B Instruct scores 80.3% accuracy, and its context of 262.1K tokens is extensible to approximately one million tokens via YaRN rope scaling. On knowledge benchmarks, it scores 80.6 on MMLU-Pro and 82.7 on Arena-Hard v2, tracking competitively with models that require far more compute per token.