Qwen3 235B A22B Thinking 2507
Qwen3 235B A22B Thinking 2507 is Alibaba's 235B MoE model configured for extended chain-of-thought reasoning, combining 235 billion total parameters with always-on deliberative reasoning for demanding inference tasks.
import { streamText } from 'ai'
const result = streamText({ model: 'alibaba/qwen3-235b-a22b-thinking', prompt: 'Why is the sky blue?'})Playground
Try out Qwen3 235B A22B Thinking 2507 by Alibaba. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by Alibaba
| Model |
|---|
About Qwen3 235B A22B Thinking 2507
Qwen3 235B A22B Thinking 2507 is the Qwen3-235B-A22B configured with thinking mode as the default. The base model can switch between extended reasoning and direct response per request. This variant targets applications that need deliberate, chain-of-thought processing on every query.
The underlying architecture is the same 235B MoE: 235 billion total parameters with 22 billion activated per inference step. That MoE structure makes thinking mode tractable at this scale. Because only 22 billion parameters activate per token, Qwen3 235B A22B Thinking 2507 sustains long reasoning traces without the serving costs of a fully dense 235B model generating the same sequence length.
The chain-of-thought behavior was explicitly trained and optimized, not simply prompted. Alibaba's research indicates that the model demonstrates "scalable and smooth performance improvements that are directly correlated with the computational reasoning budget allocated." Thinking longer genuinely helps on hard problems in a measurable way.
For the hardest categories of tasks (competitive mathematics, multi-hop logical reasoning, complex code debugging, and structured scientific analysis), this thinking-configured variant makes fuller use of the 235B parameter capacity. Benchmark results for the underlying model are competitive with other strong reasoning models on reasoning-heavy evaluations.
What To Consider When Choosing a Provider
- Configuration: Provider selection may affect time-to-first-token for reasoning models, since longer thinking traces amplify any latency differences between providers.
- Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Qwen3 235B A22B Thinking 2507
Best For
- Mathematical problem solving requiring detailed derivation: When answers need to show work, such as proofs, step-by-step calculations, or theorem verification, the always-on thinking mode ensures the model reasons carefully before committing to an answer
- Complex debugging and code analysis: Tracing through multi-file codebases, identifying subtle bugs, or reasoning about race conditions and edge cases benefits from extended deliberation rather than pattern-matched output
- Structured decision-support tasks: Applications in legal analysis, medical information synthesis, or financial modeling that require the model to consider multiple factors and surface its reasoning process explicitly
- Difficult multi-hop question answering: Tasks where the final answer requires correctly executing a chain of dependent reasoning steps are where thinking models show the largest quality gains over non-thinking alternatives
- Research assistance requiring transparent reasoning: When users need to audit or follow the model's reasoning process, the thinking trace provides visibility into how conclusions were reached
Consider Alternatives When
- Response latency is critical: Thinking mode generates substantial internal tokens before producing the final answer. For real-time conversational interfaces or latency-sensitive pipelines, the non-thinking variant or a smaller model will respond much faster
- Most queries are simple and don't require deliberation: Using a thinking model for routine tasks, formatting, translation, simple extraction, pays the latency and token cost of reasoning without meaningful quality benefit. The base Qwen3-235B-A22B model with thinking disabled is more appropriate for mixed workloads
- Budget constraints are strict: Thinking traces add tokens to every response. If your application is cost-constrained, evaluate whether the quality improvement on your specific task distribution justifies the additional token usage
Conclusion
Qwen3 235B A22B Thinking 2507 is built for the class of tasks where getting the right answer justifies spending more tokens on reasoning. The MoE architecture makes it more economical to sustain long thinking traces than a dense model of comparable total scale, and the reasoning capability is built into the model rather than being a prompting trick. AI Gateway wraps the model with automated failover across novita, deepinfra, alibaba and a unified API surface.
Frequently Asked Questions
How does this model differ from the standard Qwen3-235B-A22B listing?
This variant is specifically configured for thinking mode, extended chain-of-thought reasoning is the default behavior rather than something toggled per request. It's intended for workloads where deliberative reasoning is always desired, rather than mixed applications that need to switch modes.
Does the thinking trace count toward the context window and output token limit?
Yes. The reasoning trace is generated within the model's context and contributes to token usage. Long thinking sequences on complex problems can be substantial, so setting appropriate thinking budgets prevents runaway token consumption. Output pricing applies to all generated tokens including the trace, depending on provider implementation.
Why is the MoE architecture particularly useful for thinking mode?
Thinking mode generates long internal token sequences before producing the final answer. With a fully dense model, every one of those tokens would activate all parameters. The MoE design activates only 22B of 235B parameters per token, making the extended reasoning trace significantly cheaper to generate than it would be with a dense model of equivalent total capacity.
What benchmarks has the underlying model been evaluated on?
The Qwen3-235B-A22B model was benchmarked against other strong reasoning models on coding, mathematics, and general reasoning tasks, with competitive results reported. See the Qwen3 blog at https://novita.ai/models/model-detail/qwen-qwen3-vl-235b-a22b-thinking for detailed benchmark tables.
Can thinking mode be adjusted or turned off for specific requests?
The thinking budget can be configured per request. If you occasionally need a faster response, reducing the thinking budget will constrain the reasoning phase. Completely disabling thinking on this variant may not reflect its intended use case; the standard Qwen3-235B-A22B model is better suited for workloads that need to toggle thinking on and off.
What languages does this model support for reasoning tasks?
The model covers 119 languages and dialects. Thinking-mode reasoning works across this multilingual coverage, though the highest benchmark performance data tends to come from English and Chinese evaluations.
Can I configure Qwen3 235B A22B Thinking 2507 for both thinking and direct-response traffic from one integration?
This variant is configured with thinking mode as the default. For mixed workloads that need to toggle thinking on and off per request, the standard Qwen3-235B-A22B listing exposes both modes.