Skip to content

Qwen3-30B-A3B

Qwen3-30B-A3B is a mixture-of-experts model from Alibaba that activates only 3 billion of its 30 billion parameters per inference, outperforming QwQ-32B while running at a fraction of the compute cost.

ReasoningTool Use
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'alibaba/qwen-3-30b',
prompt: 'Why is the sky blue?'
})

Playground

Try out Qwen3-30B-A3B by Alibaba. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

About Qwen3-30B-A3B

Qwen3-30B-A3B occupies a distinctive position in the Qwen3 lineup: it's the smaller of two MoE models in the family, but its efficiency story is the more striking one. Inference activates only 3 billion parameters, comparable to serving a small model, yet the full 30 billion parameter capacity gives it a much larger representational space than a genuinely 3B model would have.

Alibaba's benchmarks position this model above QwQ-32B, which was previously one of the stronger open reasoning models. QwQ-32B is a dense model that activates all 32 billion of its parameters on every token, meaning Qwen3-30B-A3B achieves superior results at roughly one-tenth the active parameter count. For teams running inference at volume, this ratio has direct cost implications.

Like the rest of the Qwen3 family, the model supports hybrid thinking modes. The enable_thinking parameter switches between step-by-step chain-of-thought reasoning and direct-response mode. The thinking budget can be configured per request, so applications can use extended reasoning for genuinely complex queries while defaulting to fast responses for routine ones.

The 30B-A3B supports 119 languages and dialects, the same multilingual coverage as the rest of the Qwen3 family, and includes tool calling, agentic workflow, and MCP support, giving the model strong instruction-following and coding capabilities relative to its inference cost.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
DeepInfra
Legal:Terms
Privacy
41K
0.2s
62tps
$0.08/M$0.29/M
04/01/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Alibaba

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
240K
1.7s
82tps
$1.30/M
$7.80/M
Read:
$0.26/M
Write:
$1.63/M
alibaba logo
04/20/2026
1M
1.0s
55tps
$0.50/M
$3.00/M
Read:
$0.1/M
Write:
$0.63/M
alibaba logo
fireworks logo
04/02/2026
1M
1.1s
284tps
$0.10/M$0.40/M
Read:$0.0/M
Write:$0.13/M
alibaba logo
02/24/2026
1M
2.3s
55tps
$0.40/M
$2.40/M
Read:
$0.04/M
Write:
$0.5/M
alibaba logo
02/16/2026
256K
0.2s
143tps
$0.50/M$1.20/M
bedrock logo
togetherai logo
07/22/2025
33K
$0.02/M
deepinfra logo
06/05/2025

What To Consider When Choosing a Provider

  • Configuration: Provider selection is most consequential when your application has region-specific latency requirements or data handling policies that point to particular infrastructure.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Qwen3-30B-A3B

Best For

  • High-volume inference where token cost matters: Activating only 3B parameters per request makes this model economical to run at scale. Applications processing thousands of requests per hour benefit from the efficiency gap over fully-dense alternatives at similar quality levels
  • Reasoning tasks that previously required larger models: If your workload was reaching for Qwen2.5-32B or QwQ-32B, the 30B-A3B delivers comparable or better results with significantly lower serving costs
  • Applications with variable complexity: The hybrid thinking mode is particularly useful here; route complex queries through thinking mode and simpler ones through non-thinking mode, keeping costs proportional to actual task difficulty
  • Production deployments requiring predictable throughput: MoE models with small active parameter counts tend to be faster to serve than dense models of comparable benchmark performance, which helps when maintaining consistent response latency under load

Consider Alternatives When

  • You need maximum reasoning headroom: For the most demanding tasks, the Qwen3-235B-A22B offers a higher capability ceiling. The 30B-A3B is efficient; the 235B MoE variant offers more reasoning headroom when needed
  • You're comparing against genuinely tiny models for simple tasks: If your application primarily handles simple classification, short-form generation, or keyword extraction, even smaller models may provide adequate quality at lower cost
  • Multimodal input processing is required: Qwen3-30B-A3B handles text only

Conclusion

Qwen3-30B-A3B delivers strong reasoning performance without the serving costs of large dense models. It outperforms QwQ-32B while activating one-tenth the parameters per token, fitting well into high-throughput applications where quality and efficiency need to coexist. AI Gateway adds reliable failover across deepinfra and a single billing integration.

Frequently Asked Questions

  • How is it possible for a 3B-active-parameter model to outperform QwQ-32B?

    The mixture-of-experts architecture separates total parameter count from inference compute. At inference, routing selects the most relevant 3 billion parameters for each token. The model benefits from the broad capacity of its 30 billion total parameters while keeping serving costs proportional to the 3B active count. QwQ-32B activates all 32 billion parameters but has less total representational capacity.

  • What does "A3B" mean in the model name?

    "A3B" indicates that 3 billion parameters are activated during inference (A = activated, 3B = 3 billion). The "30B" is the total parameter count across all expert layers.

  • How does the 30B-A3B architecture affect serving cost?

    At inference, only 3 billion parameters activate per token, so per-token compute is comparable to a 3B dense model even though the full MoE has 30 billion parameters. This is the source of the cost advantage over dense 32B-class models at similar quality.

  • How does the thinking budget control work in practice?

    You set a token budget for the thinking trace via the API. Higher budgets allow the model to explore more reasoning steps before producing its answer. Lower budgets constrain the reasoning phase, producing faster responses, useful when a question is straightforward and extended reasoning wouldn't add value.

  • Does Qwen3-30B-A3B support the same 119 languages as other Qwen3 models?

    Yes. The 119-language coverage applies across the Qwen3 family, including this model.

  • What agentic use cases is this model suited for?

    The model supports tool calling and MCP (Model Context Protocol). It fits automated workflows where the model needs to select and invoke tools across multiple steps, particularly in cost-sensitive deployments where running a larger model per agent step would be prohibitive.

  • How does AI Gateway route requests for this model?

    AI Gateway selects among deepinfra based on availability and performance. If a provider returns an error or is slow to respond, requests automatically retry with another provider in the pool, so your application doesn't need to implement retry logic independently.