Qwen 3 VL 235B A22B Instruct
Qwen 3 VL 235B A22B Instruct is Alibaba's 235B mixture-of-experts vision-language model with 22B active parameters per token, supporting interleaved text, images, and video over a context window of 262.1K tokens for visual coding, spatial perception, and fine-grained visual understanding.
import { streamText } from 'ai'
const result = streamText({ model: 'alibaba/qwen3-vl-235b-a22b-instruct', prompt: 'Why is the sky blue?'})Playground
Try out Qwen 3 VL 235B A22B Instruct by Alibaba. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Ask Qwen 3 VL 235B A22B Instruct anything to try it out.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by Alibaba
| Model |
|---|
About Qwen 3 VL 235B A22B Instruct
Qwen 3 VL 235B A22B Instruct is the September 23, 2025 version of Alibaba's 235B-A22B vision-language model in instruct configuration. Built on a mixture-of-experts (MoE) architecture, it carries 235 billion total parameters with approximately 22 billion active per token, and serves a context window of 262.1K tokens for interleaved sequences of text, images, and video frames.
Compared with prior Qwen vision-language generations, the Qwen3 VL series brings improvements across visual coding, spatial perception, and fine-grained visual understanding. Qwen 3 VL 235B A22B Instruct parses charts, diagrams, GUI screenshots, and document images with stronger grounding, and can identify and reason about object positions and relationships within complex scenes.
The instruct configuration is tuned for direct instruction following rather than extended chain-of-thought, which makes Qwen 3 VL 235B A22B Instruct a practical default for production multimodal workloads: document intelligence, screen-reading agents, multi-image analysis, and visual coding pipelines that need fast, structured responses. You can integrate Qwen 3 VL 235B A22B Instruct through AI SDK, Chat Completions API, Responses API, Messages API, or other API formats, from TypeScript or Python, with a maximum output of 262.1K tokens per request.
What To Consider When Choosing a Provider
- Configuration: Multimodal payloads that combine large images or video frames consume meaningful context tokens. Profile your typical request shape against the context window of 262.1K tokens and confirm your provider's serving infrastructure handles your throughput target before routing production traffic.
- Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Qwen 3 VL 235B A22B Instruct
Best For
- Visual Coding Pipelines: Turning screenshots, mockups, or diagrams into accurate component or function output
- Document Intelligence Tasks: Scanned pages, tables, and figures that need fine-grained visual perception
- Screen-Reading Agents: Interpreting GUI screenshots to plan and execute UI actions
- Multi-Image Comparative Analysis: Charts, product photos, or document figures reviewed side by side in a single request
- Unified Multimodal Context: Combined text, image, and video inputs handled within the window of 262.1K tokens
Consider Alternatives When
- Visible Reasoning Required: Qwen3-VL-Thinking is a closer match when tasks need step-by-step visual reasoning traces
- Text-Only Workloads: A dedicated text model offers lower cost per token when vision is never used
- Latency-Critical Basic Tasks: A smaller multimodal model can serve simple instruction following at lower cost
- Image Or Video Generation: A generation-class model fits tasks that produce pixels rather than read them
Conclusion
Qwen 3 VL 235B A22B Instruct is the pinned 235B-A22B instruct release in the Qwen3 vision-language line, suited to production multimodal workloads that need strong visual perception, spatial reasoning, and direct instruction following. Routing through AI Gateway gives you provider failover, unified billing, and a consistent integration surface across the Qwen3-VL family.