Skip to content

Qwen3 VL 235B A22B Instruct

Qwen3 VL 235B A22B Instruct is Alibaba's multimodal vision-language model supporting interleaved text, images, and video over a native context of 262.1K tokens, with architectural improvements in spatial-temporal modeling and agentic GUI interaction.

Vision (Image)
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'alibaba/qwen3-vl-instruct',
prompt: 'Why is the sky blue?'
})

Playground

Try out Qwen3 VL 235B A22B Instruct by Alibaba. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

About Qwen3 VL 235B A22B Instruct

Qwen3 VL 235B A22B Instruct is the general-purpose variant in the Qwen3-VL model family, built on a mixture-of-experts (MoE) architecture with 235 billion total parameters and approximately 22 billion active per token. Its context window of 262.1K tokens accommodates interleaved sequences of text, images, and video frames, making it practical for reasoning across large multimodal documents without segmenting input.

Three architectural innovations distinguish Qwen3-VL from prior generations. Enhanced interleaved Multimodal Rotary Position Embedding (MRoPE) improves spatial and temporal modeling across visual inputs, giving Qwen3 VL 235B A22B Instruct a stronger sense of object positions within images and event ordering within video. DeepStack integration fuses multi-level Vision Transformer (ViT) features from shallow, middle, and deep layers to tighten alignment between visual tokens and language tokens, improving grounding precision. Text-based temporal alignment for video replaces the prior T-RoPE approach with explicit textual timestamp grounding, enabling more reliable event localization within long video sequences.

Qwen3 VL 235B A22B Instruct extends its vision capabilities to agentic scenarios: it can parse GUI screenshots, understand layout and interactive elements, and plan actions for PC or mobile automation workflows. Optical character recognition (OCR) covers 32 languages and handles challenging conditions including low light, blurred text, and tilted documents. On standard multimodal benchmarks including MMMU and visual-math evaluations (MathVista, MathVision), Qwen3 VL 235B A22B Instruct reports competitive results against other frontier vision-language models.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Alibaba
Legal:Terms
Privacy
131K
0.9s
32tps
$0.40/M$1.60/M
09/24/2025
Novita AI
Legal:Terms
Privacy
131K
0.7s
34tps
$0.30/M$1.50/M
09/24/2025
DeepInfra
Legal:Terms
Privacy
262K
0.4s
33tps
$0.20/M$0.88/M
Read:$0.11/M
Write:
09/24/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Alibaba

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
240K
3.0s
57tps
$1.30/M
$7.80/M
Read:
$0.26/M
Write:
$1.63/M
alibaba logo
04/20/2026
1M
4.1s
55tps
$0.50/M
$3.00/M
Read:
$0.1/M
Write:
$0.63/M
alibaba logo
fireworks logo
04/02/2026
1M
0.9s
250tps
$0.10/M$0.40/M
Read:$0.0/M
Write:$0.13/M
alibaba logo
02/24/2026
1M
1.3s
110tps
$0.40/M
$2.40/M
Read:
$0.04/M
Write:
$0.5/M
alibaba logo
02/16/2026
256K
0.2s
169tps
$0.50/M$1.20/M
bedrock logo
togetherai logo
07/22/2025
262K
0.3s
79tps
$0.22/M$1.80/M
Read:$0.02/M
Write:
alibaba logo
deepinfra logo
novita logo
+1
04/01/2025

What To Consider When Choosing a Provider

  • Configuration: For applications that process video or multi-image inputs, confirm that your selected provider's serving infrastructure supports large multimodal payloads at your target throughput before routing production traffic.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Qwen3 VL 235B A22B Instruct

Best For

  • Multilingual document intelligence: Pipelines that require OCR across 32 languages under varied image quality conditions
  • GUI automation and screen reading: Agents that interpret application screenshots to plan and execute UI actions
  • Long video comprehension: Tasks that need precise event localization and temporal reasoning over extended sequences
  • Multi-image analysis: Comparing product photographs, reviewing multiple chart pages, or cross-referencing figures across a document
  • Spatial grounding: Reasoning tasks that require accurate 2D or 3D grounding of objects within images

Consider Alternatives When

  • Extended visual reasoning traces: Consider Qwen3-VL-Thinking when STEM and compositional visual reasoning need step-by-step traces
  • Text-only workloads: A text-only model will provide lower cost and faster throughput when vision isn't used
  • Latency-critical basic tasks: Simple instruction following without complex visual analysis doesn't need this model's scale

Conclusion

Qwen3 VL 235B A22B Instruct is a capable general-purpose vision-language model for production workflows that mix text with images, video, and documents. Its architectural improvements in spatial-temporal modeling and GUI-reading make it broadly applicable across document processing, video analysis, and screen-based automation, while the multimodal context window of 262.1K tokens accommodates inputs that would otherwise require splitting.

Frequently Asked Questions

  • What modalities does Qwen3 VL 235B A22B Instruct accept?

    The model accepts interleaved sequences of text, images, and video frames within a single context window of up to 262.1K tokens.

  • How does DeepStack improve vision-language alignment?

    DeepStack fuses feature maps from multiple depth levels of the Vision Transformer, shallow layers capture low-level detail while deeper layers encode abstract semantics. Combining these gives the language model richer visual grounding information than single-layer ViT representations.

  • What is the difference between MRoPE and standard positional encoding for video?

    Enhanced interleaved MRoPE assigns distinct positional axes to the temporal (time), height, and width dimensions of video inputs, giving the model an explicit spatial-temporal coordinate system. This improves its ability to reason about where and when events occur within a video sequence.

  • Can this model perform GUI automation tasks?

    Yes. The model is trained to parse GUI screenshots, identify interactive elements (buttons, forms, navigation), and plan multi-step actions for PC or mobile application automation.

  • What OCR languages and conditions are supported?

    The model covers OCR in 32 languages and has been evaluated for robustness in low-light, blurred, and tilted-text conditions.

  • How does Qwen3 VL 235B A22B Instruct differ from Qwen3-VL-Thinking?

    The Instruct variant is designed for direct instruction following and is generally faster and more cost-effective. The Thinking variant adds extended step-by-step reasoning traces optimized for complex STEM and compositional visual reasoning problems.

  • Is the context window of 262.1K tokens shared across text, image, and video tokens together?

    Yes. The limit of 262.1K tokens applies to the combined sequence of text tokens and visual tokens (image patches, video frames encoded as tokens) in an interleaved context.