Skip to content
Dashboard

Llama 4 Scout 17B 16E Instruct

Llama 4 Scout 17B 16E Instruct is a natively multimodal Mixture of Experts (MoE) model with a context window of 131.1K tokens, purpose-built for processing entire codebases, multi-document corpora, and extended user activity logs in a single inference call.

Tool UseVision (Image)
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'meta/llama-4-scout',
prompt: 'Why is the sky blue?'
})

Playground

Try out Llama 4 Scout 17B 16E Instruct by Meta. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

meta logo
meta logo

Ask Llama 4 Scout 17B 16E Instruct anything to try it out.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
DeepInfra
131K
0.2s
56tps
$0.10/M$0.30/M——
04/05/2025
Groq
131K
0.6s
$0.11/M$0.34/M——
04/05/2025
Amazon Bedrock
128K
0.2s
182tps
$0.17/M$0.66/M——
04/05/2025
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Meta

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
131K
0.2s
163tps
$0.24/M$0.97/M——
bedrock logo
deepinfra logo
04/05/2025
128K
0.2s
157tps
$0.59/M$0.72/M——
bedrock logo
groq logo
12/06/2024
128K
0.2s
181tps
$0.16/M$0.16/M——
bedrock logo
09/25/2024
128K
0.2s
52tps
$0.15/M$0.15/M——
bedrock logo
09/18/2024
131K
0.1s
28tps
$0.02/M$0.05/M
Read:$0.03/M
Write:—
——
bedrock logo
deepinfra logo
groq logo
+1
07/23/2024
131K
0.2s
35tps
$0.72/M$0.72/M——
bedrock logo
deepinfra logo
07/23/2024

About Llama 4 Scout 17B 16E Instruct

Meta released Llama 4 Scout 17B 16E Instruct on April 5, 2025 alongside Llama 4 Maverick as one of the founding models of the Llama 4 generation. Llama 4 Scout 17B 16E Instruct is a 17-billion-active-parameter Mixture of Experts model with 16 experts and 109 billion total parameters. It's substantially leaner in total parameter count than Maverick's 400B. Like Maverick, Llama 4 Scout 17B 16E Instruct was built with native multimodality across text, image, and video frame data.

Llama 4 Scout 17B 16E Instruct's defining characteristic is its context length. Meta extended context from 128K tokens in Llama 3 to 131.1K tokens in Llama 4 Scout 17B 16E Instruct, about a 78x increase. The architecture enabling this is iRoPE (interleaved Rotary Position Embeddings): most layers use standard RoPE, but the model also interleaves attention layers without positional embeddings. Inference-time temperature scaling of attention further enhances length generalization. Llama 4 Scout 17B 16E Instruct was validated with needle-in-a-haystack retrieval tests and cumulative negative log-likelihood evaluations over 131.1K tokens of code.

A context of 131.1K tokens can hold roughly 7.5 million words of plain text, the equivalent of approximately 25 full-length novels, or a large enterprise codebase with all source files, documentation, and test suites loaded together. Use cases include multi-document summarization across a large corpus, parsing extensive user activity logs, and reasoning over entire codebases in a single prompt without chunking or retrieval-augmented generation (RAG). This last capability is particularly notable for software development tooling, where RAG-based approaches introduce retrieval errors and context fragmentation.

Llama 4 Scout 17B 16E Instruct delivers better results across a broad range of benchmarks in its class. It also supports image grounding (aligning user prompts with specific visual regions) and exceeds prior Llama models on coding, reasoning, long context, and image benchmarks.

What To Consider When Choosing a Provider

  • Configuration: Scout's long-context capabilities introduce pricing considerations: longer prompts raise per-request costs substantially. Compare $0.17 and $0.66 against your expected context length.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Llama 4 Scout 17B 16E Instruct

Best For

  • Entire codebase processing: Architecture review, cross-file refactoring suggestions, and comprehensive code search in a single inference call
  • Multi-document analysis: Legal discovery across large contract sets or literature review across a research corpus where chunking loses coherence
  • Long-session personalization: Parsing extensive user history or activity logs without summarization loss
  • Image grounding applications: Precise visual localization across multi-image inputs

Consider Alternatives When

  • Standard context sufficient: Maximum multimodal capability within a standard context window is more important, so Maverick's 128-expert architecture offers greater image and text depth
  • General assistant workload: Maverick is Meta's designated product workhorse
  • Modest context tasks: A smaller, cheaper model such as Llama 3.3 70B would satisfy quality requirements
  • Cost concerns at scale: 131.1K tokens inputs result in substantially higher per-request costs than typical short-context usage

Conclusion

Llama 4 Scout 17B 16E Instruct extends what open-weight models can handle for long-context applications. The combination of a window of 131.1K tokens and native multimodality suits it for codebase-scale reasoning, multi-document analysis, and long-session personalization tasks that were previously impractical without chunking or retrieval augmentation. Its iRoPE architecture makes it the long-context specialist within the Llama 4 generation.