Skip to content

Llama 4 Maverick 17B 128E Instruct FP8

meta/llama-4-maverick

Llama 4 Maverick 17B 128E Instruct FP8 is Meta's natively multimodal Mixture of Experts (MoE) model with 17B active parameters across 128 experts. Published benchmarks span image and text tasks, and the MoE activates a fraction of the parameters that comparable dense models use.

Tool UseVision (Image)
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'meta/llama-4-maverick',
prompt: 'Why is the sky blue?'
})

What To Consider When Choosing a Provider

  • Zero Data Retention

    AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.

    Authentication

    AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

For workloads that mix images and long text, Llama 4 Maverick 17B 128E Instruct FP8's efficiency advantage over dense models shows most at scale. Validate throughput at your expected concurrency level before you pick a provider tier. Compare $0.24 and $0.97.

When to Use Llama 4 Maverick 17B 128E Instruct FP8

Best For

  • Production multimodal applications:

    Pairing image understanding with long-form text generation for product catalog processing and document analysis with mixed visual and textual content

  • Creative and coding workloads:

    Multilingual applications where the MoE architecture reaches dense-model scores on published benchmarks at lower active-parameter cost

  • Cost-per-quality sensitive workloads:

    Comparable-capability dense models are significantly more expensive to serve

  • Long-context multimodal tasks:

    Image and text reasoning must be maintained coherently across extended conversations

  • General assistant and chat:

    Meta designates Llama 4 Maverick 17B 128E Instruct FP8 as the intended product workhorse

Consider Alternatives When

  • Extreme long documents:

    Llama 4 Scout's 10M token context window is purpose-built for that use case

  • Text-only workload:

    The MoE overhead of loading all experts into memory is not offset by quality gains over a dense model at similar cost

  • Maximum reasoning depth:

    Llama 4 Behemoth (when available) or other frontier reasoning models may be appropriate

Conclusion

Llama 4 Maverick 17B 128E Instruct FP8 combines native multimodality, a 128-expert MoE architecture, and strong benchmark results on image and text tasks at a fraction of the active-parameter cost of dense alternatives. For teams building multimodal production applications on open models, Llama 4 Maverick 17B 128E Instruct FP8 is the more capable of the two initial Llama 4 releases.

FAQ

Each input token activates only a subset of the total parameters. Llama 4 Maverick 17B 128E Instruct FP8 uses alternating dense and MoE layers. MoE layers route each token to a shared expert plus one of 128 routed experts. Only 17B of the 400B total parameters are active per token, reducing inference cost while the full parameter budget contributes to model quality.

Llama 3.2 added vision to an existing text backbone via cross-attention adapters, keeping language model weights frozen. Llama 4 Maverick 17B 128E Instruct FP8 processes text and vision tokens together in a unified backbone. This enables deeper cross-modal reasoning because the model was never strictly text-only.

An experimental chat version of Llama 4 Maverick 17B 128E Instruct FP8 scored an Elo of 1417 on LMArena.

Llama 4 supports 200 languages, including over 100 with more than 1 billion tokens each, representing 10x more multilingual coverage than Llama 3.