Skip to content
Dashboard

Llama 3.2 90B Vision Instruct

Llama 3.2 90B Vision Instruct is Meta's highest-capability vision-language model at the Llama 3.2 launch. It pairs large-scale language generation with image reasoning, a context window of 128K tokens, and support for complex multi-element visual analysis.

Tool UseVision (Image)
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'meta/llama-3.2-90b',
prompt: 'Why is the sky blue?'
})

Playground

Try out Llama 3.2 90B Vision Instruct by Meta. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Amazon Bedrock
128K
0.3s
57tps
$0.72/M$0.72/M——
09/25/2024
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Meta

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
131K
0.2s
51tps
$0.24/M$0.97/M——
bedrock logo
deepinfra logo
04/05/2025
131K
0.2s
169tps
$0.17/M$0.66/M——
bedrock logo
deepinfra logo
groq logo
04/05/2025
128K
0.3s
118tps
$0.59/M$0.72/M——
bedrock logo
groq logo
12/06/2024
128K
0.2s
53tps
$0.15/M$0.15/M——
bedrock logo
09/18/2024
131K
0.1s
150tps
$0.02/M$0.05/M
Read:$0.03/M
Write:—
——
bedrock logo
deepinfra logo
groq logo
+1
07/23/2024
131K
0.3s
32tps
$0.72/M$0.72/M——
bedrock logo
deepinfra logo
07/23/2024

About Llama 3.2 90B Vision Instruct

Llama 3.2 90B Vision Instruct is Meta's largest open vision-language model, released on September 25, 2024. It's built on the Llama 3.1 70B language foundation with a cross-attention vision adapter connecting to a vision encoder. At 90B parameters, the language model component has substantially more capacity for complex reasoning, synthesis, and generation than the 11B variant. Tasks that pair difficult visual understanding with demanding text generation are better served here.

The context window of 128K tokens accommodates extended visual conversations: a sequence of images with questions, a technical document with figures and tables, or an annotated slide deck can all fit in a single context. For enterprise use cases like research document analysis, technical diagram interpretation, and medical image description, the combination of large language model capacity and vision capability is more practical than at the 11B scale.

What To Consider When Choosing a Provider

  • Configuration: The 90B parameter scale needs more compute per request than the 11B variant. Factor infrastructure cost into whether this scale is necessary for your visual reasoning tasks. Compare $0.72 and $0.72.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Llama 3.2 90B Vision Instruct

Best For

  • Complex visual reasoning with rich language output: Tasks that need intricate image analysis plus long-form language generation (technical diagrams, medical image description, annotated documents) benefit from 90B-scale language capacity
  • Multi-turn extended visual conversations: The context window of 128K tokens supports long exchanges involving multiple images and questions in a single coherent context, maintaining reference to earlier visual content through the conversation
  • Dense document analysis: Research papers, engineering specifications, and legal documents with embedded figures, tables, and diagrams can be processed holistically at 90B scale

Consider Alternatives When

  • Visual tasks are straightforward: Object recognition, basic visual QA, and simple scene description often do not require 90B-scale reasoning capacity, and Llama 3.2 11B handles these at substantially lower cost
  • Text-only workloads: Llama 3.3 70B or 4.x models are better optimized for pure text tasks without the overhead of vision capability

Conclusion

Llama 3.2 90B Vision Instruct is Meta's open-weight vision-language capability ceiling in the Llama 3.2 generation. It suits complex visual reasoning tasks that need detailed image analysis plus 90B-scale language generation. For tasks within the 11B reasoning envelope, the smaller variant is more economical.

Frequently Asked Questions

  • What makes Llama 3.2 90B Vision Instruct more capable than Llama 3.2 11B for vision tasks?

    The 90B language model foundation has more capacity for complex reasoning, varied language generation, and difficult multi-element visual scenes. The gap shows most on tasks that combine intricate image understanding with long-form text generation, not simple visual QA.

  • How does the context window of 128K tokens work with image inputs?

    Images are encoded as token sequences and consume context budget. A long conversation with multiple images and extended text history can be held in the window of 128K tokens, though very high-resolution images encode to more tokens than low-resolution ones.

  • What types of documents and images does Llama 3.2 90B Vision Instruct handle best?

    Technical documents with figures, research papers with charts and equations, medical imaging contexts, and complex multi-element scenes benefit most from the 90B-scale reasoning capacity. Standard photography and simple diagrams can typically be handled by the 11B variant.

  • When was Llama 3.2 90B Vision Instruct released?

    Meta released Llama 3.2 90B Vision Instruct on September 25, 2024.

  • How do I use Llama 3.2 90B Vision Instruct on AI Gateway?

    Use the identifier meta/llama-3.2-90b with any supported interface. You can send image inputs alongside text in the same request. Send the request through AI Gateway; it routes providers and fails over automatically.