GPT-4o
GPT-4o is OpenAI's first natively multimodal "omni" model, unifying text, audio, image, and video processing within a single end-to-end trained architecture and delivering audio response times averaging 320 milliseconds, comparable to human conversational latency.
import { streamText } from 'ai'
const result = streamText({ model: 'openai/gpt-4o', prompt: 'Why is the sky blue?'})Playground
Try out GPT-4o by OpenAI. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.
Providers
Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.
| Provider |
|---|
P50 throughput on live AI Gateway traffic, in tokens per second (TPS). Visit the docs for more info.
P50 time to first token (TTFT) on live AI Gateway traffic, in milliseconds. View the docs for more info.
Direct request success rate on AI Gateway and per-provider. Visit the docs for more info.
More models by OpenAI
| Model |
|---|
About GPT-4o
GPT-4o was announced on May 13, 2024 at OpenAI's Spring Updates event. The "o" stands for "omni," reflecting the model's foundational design: rather than connecting separate specialist models for different modalities, GPT-4o was trained end-to-end across text, audio, image, and video. This architectural choice enables sub-400-millisecond audio responses. Prior approaches chained a speech recognition model, a language model, and a text-to-speech model together, introducing latency at each boundary. GPT-4 Turbo-based voice averaged 5.4 seconds per turn.
GPT-4o matched GPT-4 Turbo on text and code in English while costing less in the API, with notable improvements on non-English text. This made it the default for developers who previously used GPT-4 Turbo: an upgrade in multimodal capability at a lower price.
The model accepts any combination of text, audio, image, and video as input and can generate text, audio, and image outputs. This flexibility spans real-time voice assistants, vision pipelines that analyze photographs or documents, and agents that process video frames alongside textual context.
What To Consider When Choosing a Provider
- Configuration: For applications that mix modalities (for example, a voice interface that also accepts image uploads), a single model endpoint simplifies both architecture and cost accounting compared to separate specialist models.
- Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
- Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.
When to Use GPT-4o
Best For
- Real-time voice applications: Native audio processing eliminates pipeline latency between speech recognition, reasoning, and synthesis
- Mixed-modality workflows: Processing both images and text together for visual question answering or document analysis with figures
- Cost-efficient GPT-4 quality: Applications that need GPT-4-class text quality with lower API cost than GPT-4 Turbo
- Multilingual applications: Products that benefit from GPT-4o's improved non-English text performance
- Evolving multimodal products: Apps that may expand from text-only to multimodal inputs over time with a single model supporting that evolution
Consider Alternatives When
- Maximum coding capability: GPT-4.1 benchmarks are meaningfully better for some coding use cases
- Cost-driven workloads: GPT-4.1 mini provides sufficient quality at lower cost
- Deep multi-step reasoning: The o-series reasoning models' chain-of-thought approach is superior for complex analytical problems
Conclusion
GPT-4o introduced native omni-modal processing as a practical API capability, eliminating the latency penalties of chained pipeline architectures and matching GPT-4 Turbo quality at lower cost. For applications that need a single model to handle text, audio, and vision inputs reliably, it remains a strong foundation through AI Gateway.
Frequently Asked Questions
What does "omni" mean in GPT-4o's name?
It reflects the model's end-to-end native training across text, audio, image, and video modalities, rather than being a combination of separate specialist models connected by a pipeline.
How much faster is GPT-4o's audio response compared to earlier voice pipelines?
GPT-4o averages 320 milliseconds for audio responses; the prior GPT-4 Turbo-based voice approach averaged 5.4 seconds, making GPT-4o approximately 16x faster for voice.
How does GPT-4o's API pricing compare to GPT-4 Turbo?
GPT-4o launched at lower API cost than GPT-4 Turbo while matching its performance on English text and code, and improving on non-English languages.
What input and output modalities does GPT-4o support?
Inputs: text, audio, image, video. Outputs: text, audio, and image. This breadth makes it flexible for diverse multimodal application architectures.
Is the "gpt-4o" model alias the same as a specific dated snapshot?
No. The alias
gpt-4opoints to the latest stable version, which may be updated over time. Dated snapshots like gpt-4o-2024-05-13 or gpt-4o-2024-11-20 pin to specific releases.Does routing GPT-4o through AI Gateway add latency?
AI Gateway is designed as a lightweight routing layer. For most applications, the observability, caching, and authentication benefits outweigh any marginal overhead.
What are typical latency characteristics?
This page shows live throughput and time-to-first-token metrics measured across real AI Gateway traffic.