Qwen3 VL 235B A22B Thinking

Qwen3 VL 235B A22B Thinking is the reasoning-specialized edition of Alibaba's Qwen3-VL vision-language model, combining a multimodal context of 131.1K tokens with extended chain-of-thought traces for STEM reasoning, mathematical problem solving, and compositional visual analysis.

Vision (Image)ReasoningTool UseFile Input

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'alibaba/qwen3-vl-thinking',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

About Qwen3 VL 235B A22B Thinking

Qwen3 VL 235B A22B Thinking is the reasoning-specialized counterpart to Qwen3-VL-Instruct. It shares the same foundational architecture, including DeepStack multi-level ViT fusion, enhanced interleaved MRoPE for spatial-temporal modeling, and text-based temporal alignment for video, but is tuned for long-horizon compositional reasoning. When Qwen3 VL 235B A22B Thinking encounters a question involving fine-grained visual detail, multi-step mathematical derivation, or causal inference across a sequence of images or video frames, it generates a visible chain-of-thought trace before committing to a final answer.

This reasoning orientation makes the Thinking variant particularly well-suited for STEM domains. Mathematical diagrams, physics problems presented with visual components, and scientific charts all benefit from a model that notices fine visual details (scale markings, axis labels, geometric relationships) and reasons about them systematically before producing a response. Qwen3 VL 235B A22B Thinking also applies compositional reasoning to multi-image inputs, for example, comparing experimental results across several scatter plots or identifying a trend across a sequence of microscopy images, where the answer can't be derived from any single visual element in isolation.

Like its Instruct counterpart, the Thinking variant operates over a context window of 131.1K tokens that accommodates interleaved text, images, and video. This combination of long multimodal context and extended reasoning depth enables applications such as detailed long-form video analysis where the model must both track temporal events and reason carefully about their relationships, or complex document-plus-figure analysis where diagrams and text must be jointly interpreted.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Qwen3 VL 235B A22B Thinking

About Qwen3 VL 235B A22B Thinking