Llama 3.2 90B Vision Instruct
Llama 3.2 90B Vision Instruct is Meta's highest-capability vision-language model at the Llama 3.2 launch. It pairs large-scale language generation with image reasoning, a context window of 128K tokens, and support for complex multi-element visual analysis.
import { streamText } from 'ai'
const result = streamText({ model: 'meta/llama-3.2-90b', prompt: 'Why is the sky blue?'})Playground
Try out Llama 3.2 90B Vision Instruct by Meta. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by Meta
| Model |
|---|
About Llama 3.2 90B Vision Instruct
Llama 3.2 90B Vision Instruct is Meta's largest open vision-language model, released on September 25, 2024. It's built on the Llama 3.1 70B language foundation with a cross-attention vision adapter connecting to a vision encoder. At 90B parameters, the language model component has substantially more capacity for complex reasoning, synthesis, and generation than the 11B variant. Tasks that pair difficult visual understanding with demanding text generation are better served here.
The context window of 128K tokens accommodates extended visual conversations: a sequence of images with questions, a technical document with figures and tables, or an annotated slide deck can all fit in a single context. For enterprise use cases like research document analysis, technical diagram interpretation, and medical image description, the combination of large language model capacity and vision capability is more practical than at the 11B scale.
What To Consider When Choosing a Provider
- Configuration: The 90B parameter scale needs more compute per request than the 11B variant. Factor infrastructure cost into whether this scale is necessary for your visual reasoning tasks. Compare $0.72 and $0.72.
- Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Llama 3.2 90B Vision Instruct
Best For
- Complex visual reasoning with rich language output: Tasks that need intricate image analysis plus long-form language generation (technical diagrams, medical image description, annotated documents) benefit from 90B-scale language capacity
- Multi-turn extended visual conversations: The context window of 128K tokens supports long exchanges involving multiple images and questions in a single coherent context, maintaining reference to earlier visual content through the conversation
- Dense document analysis: Research papers, engineering specifications, and legal documents with embedded figures, tables, and diagrams can be processed holistically at 90B scale
Consider Alternatives When
- Visual tasks are straightforward: Object recognition, basic visual QA, and simple scene description often do not require 90B-scale reasoning capacity, and Llama 3.2 11B handles these at substantially lower cost
- Text-only workloads: Llama 3.3 70B or 4.x models are better optimized for pure text tasks without the overhead of vision capability
Conclusion
Llama 3.2 90B Vision Instruct is Meta's open-weight vision-language capability ceiling in the Llama 3.2 generation. It suits complex visual reasoning tasks that need detailed image analysis plus 90B-scale language generation. For tasks within the 11B reasoning envelope, the smaller variant is more economical.
Frequently Asked Questions
What makes Llama 3.2 90B Vision Instruct more capable than Llama 3.2 11B for vision tasks?
The 90B language model foundation has more capacity for complex reasoning, varied language generation, and difficult multi-element visual scenes. The gap shows most on tasks that combine intricate image understanding with long-form text generation, not simple visual QA.
How does the context window of 128K tokens work with image inputs?
Images are encoded as token sequences and consume context budget. A long conversation with multiple images and extended text history can be held in the window of 128K tokens, though very high-resolution images encode to more tokens than low-resolution ones.
What types of documents and images does Llama 3.2 90B Vision Instruct handle best?
Technical documents with figures, research papers with charts and equations, medical imaging contexts, and complex multi-element scenes benefit most from the 90B-scale reasoning capacity. Standard photography and simple diagrams can typically be handled by the 11B variant.
When was Llama 3.2 90B Vision Instruct released?
Meta released Llama 3.2 90B Vision Instruct on September 25, 2024.
How do I use Llama 3.2 90B Vision Instruct on AI Gateway?
Use the identifier
meta/llama-3.2-90bwith any supported interface. You can send image inputs alongside text in the same request. Send the request through AI Gateway; it routes providers and fails over automatically.