What makes Llama 3.2 90B Vision Instruct more capable than Llama 3.2 11B for vision tasks?

The 90B language model foundation has more capacity for complex reasoning, varied language generation, and difficult multi-element visual scenes. The gap shows most on tasks that combine intricate image understanding with long-form text generation, not simple visual QA.

How does the context window of 128K tokens work with image inputs?

Images are encoded as token sequences and consume context budget. A long conversation with multiple images and extended text history can be held in the window of 128K tokens, though very high-resolution images encode to more tokens than low-resolution ones.

What types of documents and images does Llama 3.2 90B Vision Instruct handle best?

Technical documents with figures, research papers with charts and equations, medical imaging contexts, and complex multi-element scenes benefit most from the 90B-scale reasoning capacity. Standard photography and simple diagrams can typically be handled by the 11B variant.

When was Llama 3.2 90B Vision Instruct released?

Meta released Llama 3.2 90B Vision Instruct on September 25, 2024.

How do I use Llama 3.2 90B Vision Instruct on AI Gateway?

Use the identifier `meta/llama-3.2-90b` with any supported interface. You can send image inputs alongside text in the same request. Send the request through AI Gateway; it routes providers and fails over automatically.

Llama 3.2 90B Vision Instruct

Llama 3.2 90B Vision Instruct is Meta's highest-capability vision-language model at the Llama 3.2 launch. It pairs large-scale language generation with image reasoning, a context window of 128K tokens, and support for complex multi-element visual analysis.

Tool UseVision (Image)

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'meta/llama-3.2-90b',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

Frequently Asked Questions

What makes Llama 3.2 90B Vision Instruct more capable than Llama 3.2 11B for vision tasks?
The 90B language model foundation has more capacity for complex reasoning, varied language generation, and difficult multi-element visual scenes. The gap shows most on tasks that combine intricate image understanding with long-form text generation, not simple visual QA.
How does the context window of 128K tokens work with image inputs?
Images are encoded as token sequences and consume context budget. A long conversation with multiple images and extended text history can be held in the window of 128K tokens, though very high-resolution images encode to more tokens than low-resolution ones.
What types of documents and images does Llama 3.2 90B Vision Instruct handle best?
Technical documents with figures, research papers with charts and equations, medical imaging contexts, and complex multi-element scenes benefit most from the 90B-scale reasoning capacity. Standard photography and simple diagrams can typically be handled by the 11B variant.
When was Llama 3.2 90B Vision Instruct released?
Meta released Llama 3.2 90B Vision Instruct on September 25, 2024.
How do I use Llama 3.2 90B Vision Instruct on AI Gateway?
Use the identifier meta/llama-3.2-90b with any supported interface. You can send image inputs alongside text in the same request. Send the request through AI Gateway; it routes providers and fails over automatically.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

Llama 3.2 90B Vision Instruct

Frequently Asked Questions