Skip to content

Qwen3 VL 235B A22B Thinking

Qwen3 VL 235B A22B Thinking is the reasoning-specialized edition of Alibaba's Qwen3-VL vision-language model, combining a multimodal context of 131.1K tokens with extended chain-of-thought traces for STEM reasoning, mathematical problem solving, and compositional visual analysis.

Vision (Image)ReasoningTool UseFile Input
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'alibaba/qwen3-vl-thinking',
prompt: 'Why is the sky blue?'
})

Playground

Try out Qwen3 VL 235B A22B Thinking by Alibaba. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

About Qwen3 VL 235B A22B Thinking

Qwen3 VL 235B A22B Thinking is the reasoning-specialized counterpart to Qwen3-VL-Instruct. It shares the same foundational architecture, including DeepStack multi-level ViT fusion, enhanced interleaved MRoPE for spatial-temporal modeling, and text-based temporal alignment for video, but is tuned for long-horizon compositional reasoning. When Qwen3 VL 235B A22B Thinking encounters a question involving fine-grained visual detail, multi-step mathematical derivation, or causal inference across a sequence of images or video frames, it generates a visible chain-of-thought trace before committing to a final answer.

This reasoning orientation makes the Thinking variant particularly well-suited for STEM domains. Mathematical diagrams, physics problems presented with visual components, and scientific charts all benefit from a model that notices fine visual details (scale markings, axis labels, geometric relationships) and reasons about them systematically before producing a response. Qwen3 VL 235B A22B Thinking also applies compositional reasoning to multi-image inputs, for example, comparing experimental results across several scatter plots or identifying a trend across a sequence of microscopy images, where the answer can't be derived from any single visual element in isolation.

Like its Instruct counterpart, the Thinking variant operates over a context window of 131.1K tokens that accommodates interleaved text, images, and video. This combination of long multimodal context and extended reasoning depth enables applications such as detailed long-form video analysis where the model must both track temporal events and reason carefully about their relationships, or complex document-plus-figure analysis where diagrams and text must be jointly interpreted.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Alibaba
Legal:Terms
Privacy
131K
0.7s
74tps
$0.40/M$4.00/M
09/24/2025
Novita AI
Legal:Terms
Privacy
131K
0.9s
97tps
$0.98/M$3.95/M
09/24/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Alibaba

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
240K
1.7s
85tps
$1.30/M
$7.80/M
Read:
$0.26/M
Write:
$1.63/M
alibaba logo
04/20/2026
1M
0.2s
56tps
$0.50/M
$3.00/M
Read:
$0.1/M
Write:
$0.63/M
alibaba logo
fireworks logo
04/02/2026
1M
1.1s
202tps
$0.10/M$0.40/M
Read:$0.0/M
Write:$0.13/M
alibaba logo
02/24/2026
1M
1.7s
110tps
$0.40/M
$2.40/M
Read:
$0.04/M
Write:
$0.5/M
alibaba logo
02/16/2026
256K
0.2s
146tps
$0.50/M$1.20/M
bedrock logo
togetherai logo
07/22/2025
33K
$0.02/M
deepinfra logo
06/05/2025

What To Consider When Choosing a Provider

  • Configuration: Thinking-mode multimodal responses can be long, confirm that your application's timeout configuration and streaming implementation handle extended generation sequences correctly before deploying to production.
  • Zero Data Retention: AI Gateway does not currently support Zero Data Retention for this model. See the documentation for models that support ZDR.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Qwen3 VL 235B A22B Thinking

Best For

  • Visual STEM problem solving: Physics diagrams, geometry figures, and chemistry structural formulas that combine visual input with mathematical or scientific reasoning
  • Mathematical visual benchmarks: Tasks such as MathVista or MathVision where step-by-step derivation improves final accuracy
  • Multi-image comparative analysis: Reasoning across several images simultaneously to reach a single conclusion
  • Educational and tutoring applications: A visible reasoning chain helps learners understand how a visual problem is solved
  • Scientific figure analysis: Research workflows that reason over data visualizations or microscopy images at expert-level detail

Consider Alternatives When

  • Basic visual instruction following: Use Qwen3-VL-Instruct for faster, lower-cost responses when extended reasoning is unnecessary
  • Tight token and latency budgets: Thinking traces significantly increase both token usage and response time
  • Text-only workloads: A text-only reasoning model is more cost-efficient when there's no visual content
  • Simple OCR or extraction: Document extraction and GUI automation tasks don't require reasoning traces

Conclusion

Qwen3 VL 235B A22B Thinking brings extended chain-of-thought reasoning to multimodal inputs, a combination that is specifically valuable for visual STEM problems and compositional analysis tasks where surface-level pattern matching is insufficient. For teams building applications that must explain their visual reasoning or solve structured problems embedded in images and video, it provides a distinct capability over direct-answer vision models.

Frequently Asked Questions

  • What makes Qwen3 VL 235B A22B Thinking different from Qwen3-VL-Instruct?

    The Thinking variant is trained to produce extended chain-of-thought reasoning traces before its final answer. This improves accuracy on complex, multi-step visual problems but increases output token count and response time compared to the Instruct variant.

  • What kinds of visual STEM tasks benefit most from the thinking mode?

    Problems that require reading numerical values from diagrams, applying formulas based on geometric relationships, interpreting multi-axis scientific charts, or reasoning about causality across multiple images benefit the most from step-by-step visual reasoning.

  • Does the model support the same modalities as the Instruct variant?

    Yes. Both variants accept interleaved text, images, and video within a context window of 131.1K tokens. The difference is in the reasoning depth of the response, not the supported input types.

  • How does DeepStack improve reasoning accuracy on visual inputs?

    DeepStack fuses feature maps from multiple Vision Transformer depth levels, combining coarse and fine-grained visual representations, so the language model has richer input when constructing a reasoning chain. This is especially valuable for tasks requiring precise spatial measurement or small-detail recognition within an image.

  • Can the Thinking variant handle long video inputs that require temporal reasoning?

    Yes. Text-based temporal alignment grounds the model's understanding of when events occur in a video using explicit timestamp markers. Combined with the multimodal context window of 131.1K tokens, the model can reason about event sequences across extended video without losing temporal reference.

  • What benchmarks has this model been evaluated on?

    The Qwen3-VL family reports strong benchmark scores on MMMU, MathVista, MathVision, and MMBench (the 235B-A22B model scored 89.3/88.9 on MMBench and 79.2 on RealWorldQA). Specific thinking-variant scores should be verified against Qwen's published technical report at https://arxiv.org/abs/2511.21631.

  • How should I set timeouts for this model in production?

    Thinking-mode completions for complex visual reasoning problems can generate thousands of reasoning tokens before the final answer. Set your HTTP and streaming timeouts to accommodate generation times that may be several times longer than a comparable direct-answer request.