Skip to content

Llama 4 Scout 17B 16E Instruct

Llama 4 Scout 17B 16E Instruct is a natively multimodal Mixture of Experts (MoE) model with a context window of 131.1K tokens, purpose-built for processing entire codebases, multi-document corpora, and extended user activity logs in a single inference call.

Tool UseVision (Image)
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'meta/llama-4-scout',
prompt: 'Why is the sky blue?'
})

Playground

Try out Llama 4 Scout 17B 16E Instruct by Meta. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

About Llama 4 Scout 17B 16E Instruct

Meta released Llama 4 Scout 17B 16E Instruct on April 5, 2025 alongside Llama 4 Maverick as one of the founding models of the Llama 4 generation. Llama 4 Scout 17B 16E Instruct is a 17-billion-active-parameter Mixture of Experts model with 16 experts and 109 billion total parameters. It's substantially leaner in total parameter count than Maverick's 400B. Like Maverick, Llama 4 Scout 17B 16E Instruct was built with native multimodality across text, image, and video frame data.

Llama 4 Scout 17B 16E Instruct's defining characteristic is its context length. Meta extended context from 128K tokens in Llama 3 to 131.1K tokens in Llama 4 Scout 17B 16E Instruct, about a 78x increase. The architecture enabling this is iRoPE (interleaved Rotary Position Embeddings): most layers use standard RoPE, but the model also interleaves attention layers without positional embeddings. Inference-time temperature scaling of attention further enhances length generalization. Llama 4 Scout 17B 16E Instruct was validated with needle-in-a-haystack retrieval tests and cumulative negative log-likelihood evaluations over 131.1K tokens of code.

A context of 131.1K tokens can hold roughly 7.5 million words of plain text, the equivalent of approximately 25 full-length novels, or a large enterprise codebase with all source files, documentation, and test suites loaded together. Use cases include multi-document summarization across a large corpus, parsing extensive user activity logs, and reasoning over entire codebases in a single prompt without chunking or retrieval-augmented generation (RAG). This last capability is particularly notable for software development tooling, where RAG-based approaches introduce retrieval errors and context fragmentation.

Llama 4 Scout 17B 16E Instruct delivers better results across a broad range of benchmarks in its class. It also supports image grounding (aligning user prompts with specific visual regions) and exceeds prior Llama models on coding, reasoning, long context, and image benchmarks.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
DeepInfra
Legal:Terms
Privacy
131K
0.3s
50tps
$0.08/M$0.30/M
04/05/2025
Groq
Legal:Terms
Privacy
131K
0.1s
$0.11/M$0.34/M
04/05/2025
Amazon Bedrock
Legal:Terms
Privacy
128K
0.2s
200tps
$0.17/M$0.66/M
04/05/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Meta

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
131K
0.2s
35tps
$0.24/M$0.97/M
bedrock logo
deepinfra logo
04/05/2025
128K
0.1s
139tps
$0.59/M$0.72/M
bedrock logo
groq logo
12/06/2024
128K
0.3s
62tps
$0.72/M$0.72/M
bedrock logo
09/25/2024
128K
0.3s
53tps
$0.15/M$0.15/M
bedrock logo
09/18/2024
131K
0.1s
51tps
$0.10/M$0.10/M
Read:$0.1/M
Write:
bedrock logo
cerebras logo
deepinfra logo
+2
07/23/2024
131K
0.4s
32tps
$0.72/M$0.72/M
bedrock logo
deepinfra logo
07/23/2024

What To Consider When Choosing a Provider

  • Configuration: Scout's long-context capabilities introduce pricing considerations: longer prompts raise per-request costs substantially. Compare $0.17 and $0.66 against your expected context length.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Llama 4 Scout 17B 16E Instruct

Best For

  • Entire codebase processing: Architecture review, cross-file refactoring suggestions, and comprehensive code search in a single inference call
  • Multi-document analysis: Legal discovery across large contract sets or literature review across a research corpus where chunking loses coherence
  • Long-session personalization: Parsing extensive user history or activity logs without summarization loss
  • Image grounding applications: Precise visual localization across multi-image inputs

Consider Alternatives When

  • Standard context sufficient: Maximum multimodal capability within a standard context window is more important, so Maverick's 128-expert architecture offers greater image and text depth
  • General assistant workload: Maverick is Meta's designated product workhorse
  • Modest context tasks: A smaller, cheaper model such as Llama 3.3 70B would satisfy quality requirements
  • Cost concerns at scale: 131.1K tokens inputs result in substantially higher per-request costs than typical short-context usage

Conclusion

Llama 4 Scout 17B 16E Instruct extends what open-weight models can handle for long-context applications. The combination of a window of 131.1K tokens and native multimodality suits it for codebase-scale reasoning, multi-document analysis, and long-session personalization tasks that were previously impractical without chunking or retrieval augmentation. Its iRoPE architecture makes it the long-context specialist within the Llama 4 generation.

Frequently Asked Questions

  • How large is the context window of 131.1K tokens in practical terms?

    Approximately 7.5 million words. That's roughly 25 full-length novels, a multi-year document archive, or a large enterprise codebase with source files, tests, and documentation all loaded simultaneously.

  • What is the iRoPE architecture and why does it matter for long context?

    iRoPE stands for interleaved Rotary Position Embeddings. Most attention layers use standard RoPE, but some layers use no positional embeddings. Inference-time temperature scaling of attention further enhances length generalization. This combination lets the model generalize beyond its training context length.

  • How does Llama 4 Scout 17B 16E Instruct handle multi-image inputs?

    Llama 4 Scout 17B 16E Instruct supports up to eight images per request. It also supports image grounding, aligning natural language prompts with specific regions or objects in images.

  • Is Llama 4 Scout 17B 16E Instruct suited for RAG, or does the context of 131.1K tokens replace it?

    For applications where the full corpus fits within 131.1K tokens, loading everything into context can be more accurate than retrieval augmentation because it avoids retrieval errors and fragmentation. For larger corpora, RAG remains appropriate, but Llama 4 Scout 17B 16E Instruct can handle much larger retrieval chunks or multiple retrieved documents simultaneously.

  • How does Llama 4 Scout 17B 16E Instruct differ from Maverick? They have the same active parameter count.

    Both have 17B active parameters but differ in expert count and total parameters. Llama 4 Scout 17B 16E Instruct has 16 experts and 109B total; Maverick has 128 experts and 400B total. Maverick stores more knowledge in its larger parameter budget. Llama 4 Scout 17B 16E Instruct is leaner but specialized for extreme context length. Meta designates Maverick as the general-purpose product model and Llama 4 Scout 17B 16E Instruct as the long-context specialist.

  • What languages does Llama 4 Scout 17B 16E Instruct support?

    Like all Llama 4 models, Llama 4 Scout 17B 16E Instruct supports 200 languages with over 100 having more than 1 billion tokens each, 10x more multilingual coverage than Llama 3.