Skip to content
Vercel April 2026 security incident

Qwen3-30B-A3B

alibaba/qwen-3-30b

Qwen3-30B-A3B is a mixture-of-experts model from Alibaba that activates only 3 billion of its 30 billion parameters per inference, outperforming QwQ-32B while running at a fraction of the compute cost.

ReasoningTool Use
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'alibaba/qwen-3-30b',
prompt: 'Why is the sky blue?'
})

What To Consider When Choosing a Provider

  • Zero Data Retention

    AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.

    Authentication

    AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

Provider selection is most consequential when your application has region-specific latency requirements or data handling policies that point to particular infrastructure.

When to Use Qwen3-30B-A3B

Best For

  • High-volume inference where token cost matters:

    Activating only 3B parameters per request makes this model economical to run at scale. Applications processing thousands of requests per hour benefit from the efficiency gap over fully-dense alternatives at similar quality levels

  • Reasoning tasks that previously required larger models:

    If your workload was reaching for Qwen2.5-32B or QwQ-32B, the 30B-A3B delivers comparable or better results with significantly lower serving costs

  • Applications with variable complexity:

    The hybrid thinking mode is particularly useful here; route complex queries through thinking mode and simpler ones through non-thinking mode, keeping costs proportional to actual task difficulty

  • Production deployments requiring predictable throughput:

    MoE models with small active parameter counts tend to be faster to serve than dense models of comparable benchmark performance, which helps when maintaining consistent response latency under load

Consider Alternatives When

  • You need maximum reasoning headroom:

    For the most demanding tasks, the Qwen3-235B-A22B offers a higher capability ceiling. The 30B-A3B is efficient; the 235B MoE variant offers more reasoning headroom when needed

  • You're comparing against genuinely tiny models for simple tasks:

    If your application primarily handles simple classification, short-form generation, or keyword extraction, even smaller models may provide adequate quality at lower cost

  • Multimodal input processing is required:

    Qwen3-30B-A3B handles text only

Conclusion

Qwen3-30B-A3B delivers strong reasoning performance without the serving costs of large dense models. It outperforms QwQ-32B while activating one-tenth the parameters per token, fitting well into high-throughput applications where quality and efficiency need to coexist. AI Gateway adds reliable failover across deepinfra and a single billing integration.

FAQ

The mixture-of-experts architecture separates total parameter count from inference compute. At inference, routing selects the most relevant 3 billion parameters for each token. The model benefits from the broad capacity of its 30 billion total parameters while keeping serving costs proportional to the 3B active count. QwQ-32B activates all 32 billion parameters but has less total representational capacity.

"A3B" indicates that 3 billion parameters are activated during inference (A = activated, 3B = 3 billion). The "30B" is the total parameter count across all expert layers.

At inference, only 3 billion parameters activate per token, so per-token compute is comparable to a 3B dense model even though the full MoE has 30 billion parameters. This is the source of the cost advantage over dense 32B-class models at similar quality.

You set a token budget for the thinking trace via the API. Higher budgets allow the model to explore more reasoning steps before producing its answer. Lower budgets constrain the reasoning phase, producing faster responses, useful when a question is straightforward and extended reasoning wouldn't add value.

Yes. The 119-language coverage applies across the Qwen3 family, including this model.

The model supports tool calling and MCP (Model Context Protocol). It fits automated workflows where the model needs to select and invoke tools across multiple steps, particularly in cost-sensitive deployments where running a larger model per agent step would be prohibitive.

AI Gateway selects among deepinfra based on availability and performance. If a provider returns an error or is slow to respond, requests automatically retry with another provider in the pool, so your application doesn't need to implement retry logic independently.