Skip to content

Llama 3.1 8B

Llama 3.1 8B is a multilingual, instruction-tuned model with a context window of 131.1K tokens and tool-use capability. It suits cost-effective production deployments that need multilingual coverage and trained tool use.

Tool Use
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'meta/llama-3.1-8b',
prompt: 'Why is the sky blue?'
})

Playground

Try out Llama 3.1 8B by Meta. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

About Llama 3.1 8B

Meta released Llama 3.1 8B alongside the broader Llama 3.1 family on July 23, 2024, bringing two major upgrades over previous 8B Llama releases: an extended context window of 131.1K tokens and full multilingual capability across eight languages. Both improvements also apply to the 70B, but the 8B delivers them at substantially lower serving cost and higher throughput. This makes it the practical entry point for most teams evaluating the Llama 3.1 generation.

Tool use is a trained capability in this generation. The 8B can participate in agentic workflows that call external tools, making it suitable for lightweight agent pipelines where the per-call cost of a larger model would be prohibitive. Combined with the context of 131.1K tokens, the model can maintain substantial conversation history or reference extensive retrieved documents within a single call.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Cerebras
Legal:Terms
Privacy
128K
0.1s
$0.10/M$0.10/M
Read:$0.1/M
Write:
07/23/2024
Groq
Legal:Terms
Privacy
131K
0.1s
$0.05/M$0.08/M
Read:$0.03/M
Write:
07/23/2024
Amazon Bedrock
Legal:Terms
Privacy
128K
0.2s
$0.22/M$0.22/M
07/23/2024
DeepInfra
Legal:Terms
Privacy
131K
0.2s
34tps
$0.03/M$0.05/M
07/23/2024
Novita AI
Legal:Terms
Privacy
16K
0.5s
62tps
$0.02/M$0.05/M
07/23/2024
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by Meta

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
131K
0.2s
39tps
$0.24/M$0.97/M
bedrock logo
deepinfra logo
04/05/2025
131K
0.2s
189tps
$0.17/M$0.66/M
bedrock logo
deepinfra logo
groq logo
04/05/2025
128K
0.1s
180tps
$0.59/M$0.72/M
bedrock logo
groq logo
12/06/2024
128K
0.3s
62tps
$0.72/M$0.72/M
bedrock logo
09/25/2024
128K
0.3s
53tps
$0.15/M$0.15/M
bedrock logo
09/18/2024
131K
0.3s
32tps
$0.72/M$0.72/M
bedrock logo
deepinfra logo
07/23/2024

What To Consider When Choosing a Provider

  • Configuration: Because the 8B model runs efficiently on modest GPU hardware, providers may offer a wider range of hosting tiers. Decide whether shared or dedicated capacity fits your traffic patterns. Compare $0.10 and $0.10.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use Llama 3.1 8B

Best For

  • Cost-efficient inference: High-throughput applications where per-token economics matter for chatbots, content moderation, and classification at scale
  • Multilingual applications: Support across English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai without stepping up to 70B cost
  • Lightweight agentic pipelines: Reliable tool-use without the latency and serving cost of a larger model

Consider Alternatives When

  • Deeper reasoning needed: The task demands the capability of the 70B, particularly for multi-step math or complex coding problems
  • Image understanding needed: No Llama 3.1 model supports vision input, so Llama 3.2 11B or 90B are the appropriate choices
  • Top instruction following: Instruction following quality needs to be maximized and Llama 3.3 70B's refinements justify the larger scale

Conclusion

Llama 3.1 8B fills the gap for teams that need open-weight multilingual capability with a context window of 131.1K tokens at accessible serving costs. Tool-use support makes it a common default for production deployments where per-token efficiency drives architectural decisions.

Frequently Asked Questions

  • What benchmarks did Llama 3.1 8B perform well on?

    Llama 3.1 8B was evaluated across more than 150 benchmark datasets spanning multiple languages. The 8B lines up with closed and open models of a similar parameter count on general knowledge, instruction following, and tool-use tasks.

  • What tool-use behaviors are supported?

    The model supports function calling and structured output generation as trained behaviors, not just prompt-pattern following. It operates within larger agentic systems that orchestrate external API calls or tool invocations.

  • How does the 8B handle the full context of 131.1K tokens in practice?

    The model holds long documents, conversation histories, or retrieved content in full rather than requiring chunking.

  • What languages are supported beyond English?

    The seven additional languages are German, French, Italian, Portuguese, Hindi, Spanish, and Thai, all with multilingual instruction following and conversational capability.