Skip to content

GPT-4o

GPT-4o is OpenAI's first natively multimodal "omni" model, unifying text, audio, image, and video processing within a single end-to-end trained architecture and delivering audio response times averaging 320 milliseconds, comparable to human conversational latency.

File InputTool UseVision (Image)Implicit Caching
index.ts
import { streamText } from 'ai'
const result = streamText({
model: 'openai/gpt-4o',
prompt: 'Why is the sky blue?'
})

Playground

Try out GPT-4o by OpenAI. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
ZDR
No Training
Release Date
Azure
Legal:Terms
Privacy
128K
0.7s
69tps
$2.50/M$10.00/M
Read:$1.25/M
Write:
$14/K
+ input costs
05/13/2024
OpenAI
Legal:Terms
Privacy
128K
0.7s
74tps
$2.50/M$10.00/M
Read:$1.25/M
Write:
$10.00/K
+ input costs
05/13/2024
Throughput

P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.

Latency

P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.

Uptime

Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.

More models by OpenAI

Model
Context
Latency
Throughput
Input
Output
Cache
Web Search
Per Query
Capabilities
Providers
ZDR
No Training
Release Date
1M
3.2s
68tps
$5.00/M
$30.00/M
Read:
$0.5/M
Write:
$10.00/K
+ input costs
azure logo
openai logo
04/24/2026
400K
1.5s
190tps
$0.75/M$4.50/M
Read:$0.07/M
Write:
$10.00/K
+ input costs
azure logo
openai logo
03/17/2026
400K
0.5s
117tps
$0.20/M$1.25/M
Read:$0.02/M
Write:
$10.00/K
+ input costs
azure logo
openai logo
03/17/2026
128K
0.5s
111tps
$1.25/M$10.00/M
Read:$0.13/M
Write:
$10.00/K
+ input costs
azure logo
openai logo
11/12/2025
400K
3.4s
446tps
$0.25/M$2.00/M
Read:$0.03/M
Write:
$14/K
+ input costs
azure logo
openai logo
08/07/2025
131K
0.1s
223tps
$0.35/M$0.75/M
Read:$0.25/M
Write:
baseten logo
bedrock logo
cerebras logo
+5
08/05/2025

About GPT-4o

GPT-4o was announced on May 13, 2024 at OpenAI's Spring Updates event. The "o" stands for "omni," reflecting the model's foundational design: rather than connecting separate specialist models for different modalities, GPT-4o was trained end-to-end across text, audio, image, and video. This architectural choice enables sub-400-millisecond audio responses. Prior approaches chained a speech recognition model, a language model, and a text-to-speech model together, introducing latency at each boundary. GPT-4 Turbo-based voice averaged 5.4 seconds per turn.

GPT-4o matched GPT-4 Turbo on text and code in English while costing less in the API, with notable improvements on non-English text. This made it the default for developers who previously used GPT-4 Turbo: an upgrade in multimodal capability at a lower price.

The model accepts any combination of text, audio, image, and video as input and can generate text, audio, and image outputs. This flexibility spans real-time voice assistants, vision pipelines that analyze photographs or documents, and agents that process video frames alongside textual context.

What To Consider When Choosing a Provider

  • Configuration: For applications that mix modalities (for example, a voice interface that also accepts image uploads), a single model endpoint simplifies both architecture and cost accounting compared to separate specialist models.
  • Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
  • Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use GPT-4o

Best For

  • Real-time voice applications: Native audio processing eliminates pipeline latency between speech recognition, reasoning, and synthesis
  • Mixed-modality workflows: Processing both images and text together for visual question answering or document analysis with figures
  • Cost-efficient GPT-4 quality: Applications that need GPT-4-class text quality with lower API cost than GPT-4 Turbo
  • Multilingual applications: Products that benefit from GPT-4o's improved non-English text performance
  • Evolving multimodal products: Apps that may expand from text-only to multimodal inputs over time with a single model supporting that evolution

Consider Alternatives When

  • Maximum coding capability: GPT-4.1 benchmarks are meaningfully better for some coding use cases
  • Cost-driven workloads: GPT-4.1 mini provides sufficient quality at lower cost
  • Deep multi-step reasoning: The o-series reasoning models' chain-of-thought approach is superior for complex analytical problems

Conclusion

GPT-4o introduced native omni-modal processing as a practical API capability, eliminating the latency penalties of chained pipeline architectures and matching GPT-4 Turbo quality at lower cost. For applications that need a single model to handle text, audio, and vision inputs reliably, it remains a strong foundation through AI Gateway.

Frequently Asked Questions

  • What does "omni" mean in GPT-4o's name?

    It reflects the model's end-to-end native training across text, audio, image, and video modalities, rather than being a combination of separate specialist models connected by a pipeline.

  • How much faster is GPT-4o's audio response compared to earlier voice pipelines?

    GPT-4o averages 320 milliseconds for audio responses; the prior GPT-4 Turbo-based voice approach averaged 5.4 seconds, making GPT-4o approximately 16x faster for voice.

  • How does GPT-4o's API pricing compare to GPT-4 Turbo?

    GPT-4o launched at lower API cost than GPT-4 Turbo while matching its performance on English text and code, and improving on non-English languages.

  • What input and output modalities does GPT-4o support?

    Inputs: text, audio, image, video. Outputs: text, audio, and image. This breadth makes it flexible for diverse multimodal application architectures.

  • Is the "gpt-4o" model alias the same as a specific dated snapshot?

    No. The alias gpt-4o points to the latest stable version, which may be updated over time. Dated snapshots like gpt-4o-2024-05-13 or gpt-4o-2024-11-20 pin to specific releases.

  • Does routing GPT-4o through AI Gateway add latency?

    AI Gateway is designed as a lightweight routing layer. For most applications, the observability, caching, and authentication benefits outweigh any marginal overhead.

  • What are typical latency characteristics?

    This page shows live throughput and time-to-first-token metrics measured across real AI Gateway traffic.