Llama 4 Maverick 17B 128E Instruct FP8
Llama 4 Maverick 17B 128E Instruct FP8 is Meta's natively multimodal Mixture of Experts (MoE) model with 17B active parameters across 128 experts. Published benchmarks span image and text tasks, and the MoE activates a fraction of the parameters that comparable dense models use.
import { streamText } from 'ai'
const result = streamText({ model: 'meta/llama-4-maverick', prompt: 'Why is the sky blue?'})Playground
Try out Llama 4 Maverick 17B 128E Instruct FP8 by Meta. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by Meta
| Model |
|---|
About Llama 4 Maverick 17B 128E Instruct FP8
Meta released Llama 4 Maverick 17B 128E Instruct FP8 on April 5, 2025 as one of the first two models in the Llama 4 generation. The collection is built around two architectural advances: native multimodality through early fusion, and Mixture of Experts (MoE). Llama 4 Maverick 17B 128E Instruct FP8 is the larger and more capable of the two initial releases, with 17 billion active parameters, 128 routed experts plus one shared expert, and 400 billion total parameters. Each token activates only 17B of those 400B parameters (the shared expert plus one routed expert). This makes inference substantially more efficient than a dense 400B model while preserving the quality benefits of the larger total parameter budget.
Llama 4's native multimodality represents a different architectural approach from the adapter-based vision in Llama 3.2. Rather than adding image understanding to an existing text backbone, Llama 4 treats text and vision tokens together from the beginning in a unified backbone. This enables more coherent cross-modal reasoning.
On the LMArena leaderboard, an experimental chat version of Llama 4 Maverick 17B 128E Instruct FP8 scored an Elo of 1417. Llama 4 Maverick 17B 128E Instruct FP8 exceeds comparable frontier models on coding, reasoning, multilingual, long-context, and image benchmarks. It achieves results comparable to other open-weight models on reasoning and coding at less than half the active parameters.
What To Consider When Choosing a Provider
- Configuration: For workloads that mix images and long text, Llama 4 Maverick 17B 128E Instruct FP8's efficiency advantage over dense models shows most at scale. Validate throughput at your expected concurrency level before you pick a provider tier. Compare $0.24 and $0.97.
- Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use Llama 4 Maverick 17B 128E Instruct FP8
Best For
- Production multimodal applications: Pairing image understanding with long-form text generation for product catalog processing and document analysis with mixed visual and textual content
- Creative and coding workloads: Multilingual applications where the MoE architecture reaches dense-model scores on published benchmarks at lower active-parameter cost
- Cost-per-quality sensitive workloads: Comparable-capability dense models are significantly more expensive to serve
- Long-context multimodal tasks: Image and text reasoning must be maintained coherently across extended conversations
- General assistant and chat: Meta designates Llama 4 Maverick 17B 128E Instruct FP8 as the intended product workhorse
Consider Alternatives When
- Extreme long documents: Llama 4 Scout's 10M token context window is purpose-built for that use case
- Text-only workload: The MoE overhead of loading all experts into memory is not offset by quality gains over a dense model at similar cost
- Maximum reasoning depth: Llama 4 Behemoth (when available) or other frontier reasoning models may be appropriate
Conclusion
Llama 4 Maverick 17B 128E Instruct FP8 combines native multimodality, a 128-expert MoE architecture, and strong benchmark results on image and text tasks at a fraction of the active-parameter cost of dense alternatives. For teams building multimodal production applications on open models, Llama 4 Maverick 17B 128E Instruct FP8 is the more capable of the two initial Llama 4 releases.
Frequently Asked Questions
What is Mixture of Experts (MoE) and how does it work in Llama 4 Maverick 17B 128E Instruct FP8?
Each input token activates only a subset of the total parameters. Llama 4 Maverick 17B 128E Instruct FP8 uses alternating dense and MoE layers. MoE layers route each token to a shared expert plus one of 128 routed experts. Only 17B of the 400B total parameters are active per token, reducing inference cost while the full parameter budget contributes to model quality.
What does "natively multimodal" mean compared to the adapter-based vision in Llama 3.2?
Llama 3.2 added vision to an existing text backbone via cross-attention adapters, keeping language model weights frozen. Llama 4 Maverick 17B 128E Instruct FP8 processes text and vision tokens together in a unified backbone. This enables deeper cross-modal reasoning because the model was never strictly text-only.
What Elo score did Llama 4 Maverick 17B 128E Instruct FP8 achieve on LMArena?
An experimental chat version of Llama 4 Maverick 17B 128E Instruct FP8 scored an Elo of 1417 on LMArena.
What languages does Llama 4 support?
Llama 4 supports 200 languages, including over 100 with more than 1 billion tokens each, representing 10x more multilingual coverage than Llama 3.