Skip to content

Qwen3 Next 80B A3B Thinking

Qwen3 Next 80B A3B Thinking is a hybrid Transformer-Mamba reasoning model that combines 80 billion total parameters (3B active per token) with a dedicated thinking mode, achieving strong results on AIME25 while supporting ultra-long contexts of 262.1K tokens.

index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'alibaba/qwen3-next-80b-a3b-thinking',
prompt: 'Why is the sky blue?'
})

Playground

Try out Qwen3 Next 80B A3B Thinking by Alibaba. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Alibaba
Legal:Terms
Privacy
131K
0.6s
285tps
$0.15/M$1.20/M
09/12/2025
Novita AI
Legal:Terms
Privacy
66K
0.9s
436tps
$0.15/M$1.50/M
09/12/2025
Google Vertex AI
Legal:Terms
Privacy
262K
0.3s
165tps
$0.15/M$1.20/M
09/12/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Alibaba

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
240K
3.5s
45tps
$1.30/M
$7.80/M
Read:
$0.26/M
Write:
$1.63/M
alibaba logo
04/20/2026
1M
0.2s
99tps
$0.50/M
$3.00/M
Read:
$0.1/M
Write:
$0.63/M
alibaba logo
fireworks logo
04/02/2026
1M
1.8s
82tps
$0.10/M$0.40/M
Read:$0.0/M
Write:$0.13/M
alibaba logo
02/24/2026
1M
1.3s
110tps
$0.40/M
$2.40/M
Read:
$0.04/M
Write:
$0.5/M
alibaba logo
02/16/2026
256K
2.0s
44tps
$0.50/M$1.20/M
bedrock logo
togetherai logo
07/22/2025
262K
0.1s
1039tps
$0.07/M$0.46/M
Read:$0.6/M
Write:
cerebras logo
deepinfra logo
novita logo
+1
04/01/2025

About Qwen3 Next 80B A3B Thinking

Qwen3 Next 80B A3B Thinking is the reasoning-mode counterpart to Qwen3-Next-80B-A3B-Instruct. It shares the identical Hybrid Transformer-Mamba architecture, 48 layers in a 12-block pattern of three Gated DeltaNet + MoE layers followed by one Gated Attention + MoE layer, with 512 total experts and only 10 activated per token. What distinguishes the Thinking variant is that thinking mode is the only mode: the model always generates a <think> reasoning trace before its final answer, and the recommended token budget for that trace ranges from 32,768 tokens for typical queries to 81,920 tokens for difficult mathematical or coding problems.

This exclusive thinking mode is a deliberate design choice. By eliminating mode switching, the model is specialized for tasks where getting the right answer matters more than minimizing output length. The architecture's linear-attention Gated DeltaNet layers keep context processing efficient even as reasoning traces extend the total sequence length substantially beyond the prompt, which helps when reasoning chains grow long.

Benchmark results reflect this specialization. Across math and coding benchmarks the model outperforms both the Qwen3-30B-A3B-Thinking-2507 and Qwen3-32B-Thinking predecessors, as well as several proprietary reasoning models in Qwen's published comparisons. See https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-next-80b-a3b-thinking for detailed benchmark tables.

What To Consider When Choosing a Provider

  • Configuration: Because thinking-mode responses can exceed 32K output tokens for complex reasoning tasks, verify that your provider and application timeout settings accommodate extended generation times before deploying.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Qwen3 Next 80B A3B Thinking

Best For

  • Competitive mathematics and science: Rigorous reasoning problems where step-by-step derivation is required
  • Hard coding challenges: Competitive programming and algorithmic design that benefit from explicit problem decomposition before code generation
  • Cross-reference long-document analysis: Tasks that reason across 100K+ token inputs while maintaining structured thought
  • Tutoring and explanation systems: Applications where visible reasoning chains are pedagogically valuable
  • Auditable research workflows: Use cases where a transparent inference process allows human review of the model's logic

Consider Alternatives When

  • High-throughput instruction following: Use Qwen3-Next-80B-A3B-Instruct for short-to-medium tasks without reasoning overhead
  • Strict token budgets: Thinking traces add significant output volume and cost per request
  • Multimodal input required: This model is text-only; use a vision-language variant for images or video
  • Real-time latency requirements: Extended reasoning generation can't meet hard low-latency response targets

Conclusion

Qwen3 Next 80B A3B Thinking occupies a distinct space: an architecture built for long-context efficiency that is simultaneously dedicated exclusively to extended reasoning. Teams working on hard STEM problems, detailed code analysis, or any domain where a visible reasoning chain adds quality and auditability can use it without resorting to a fully dense trillion-parameter alternative.

Frequently Asked Questions

  • Why does this model only support thinking mode, not a standard non-thinking mode?

    This variant is specialized for complex reasoning. By committing entirely to thinking mode, it avoids the quality compromises that come from training a single model to switch between reasoning and direct-answer behaviors.

  • How long can the thinking trace be?

    The recommended budget is 32,768 tokens for typical queries and up to 81,920 tokens for complex mathematics or coding problems. These are recommendations; actual trace length is determined by the model based on problem complexity.

  • How does the AIME25 score compare to other models in the family?

    The Thinking variant outperforms the Instruct variant's 69.5% on AIME25, and also surpasses Qwen3-30B-A3B-Thinking-2507 and several proprietary reasoning models in Qwen's published comparisons on this benchmark. See https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-next-80b-a3b-thinking for specific scores.

  • Does the Hybrid Transformer-Mamba architecture help during reasoning?

    Yes. The linear-attention Gated DeltaNet layers allow the model to handle sequences that grow long during reasoning, prompt plus extended thinking trace, at sub-quadratic cost compared to full attention. This keeps generation efficient even for hard problems that trigger long traces.

  • What is the native context length?

    The native context is 262.1K tokens, extensible to approximately one million tokens via YaRN rope scaling. This allows the model to reason over very long input documents alongside its own thinking trace.

  • How should I parse the thinking content from responses?

    The model outputs reasoning between <think> and </think> before the final answer. If the opening tag is missing, find the closing </think> token (see Qwen reference parsers) and split there into thinking content and final response.

  • How does this model compare to Qwen3-Max-Thinking for reasoning tasks?

    Both models support extended reasoning, but they represent different architectural tradeoffs. Qwen3 Next 80B A3B Thinking uses a sparse hybrid architecture optimized for throughput on long sequences; Qwen3-Max-Thinking uses a trillion-parameter model with autonomous tool invocation. The right choice depends on whether autonomous search/code-execution or architecture-driven efficiency is more valuable for your workload.