GPT-4o
GPT-4o is OpenAI's first natively multimodal "omni" model, unifying text, audio, image, and video processing within a single end-to-end trained architecture and delivering audio response times averaging 320 milliseconds, comparable to human conversational latency.
import { streamText } from 'ai'
const result = streamText({ model: 'openai/gpt-4o', prompt: 'Why is the sky blue?'})Frequently Asked Questions
What does "omni" mean in GPT-4o's name?
It reflects the model's end-to-end native training across text, audio, image, and video modalities, rather than being a combination of separate specialist models connected by a pipeline.
How much faster is GPT-4o's audio response compared to earlier voice pipelines?
GPT-4o averages 320 milliseconds for audio responses; the prior GPT-4 Turbo-based voice approach averaged 5.4 seconds, making GPT-4o approximately 16x faster for voice.
How does GPT-4o's API pricing compare to GPT-4 Turbo?
GPT-4o launched at lower API cost than GPT-4 Turbo while matching its performance on English text and code, and improving on non-English languages.
What input and output modalities does GPT-4o support?
Inputs: text, audio, image, video. Outputs: text, audio, and image. This breadth makes it flexible for diverse multimodal application architectures.
Is the "gpt-4o" model alias the same as a specific dated snapshot?
No. The alias
gpt-4opoints to the latest stable version, which may be updated over time. Dated snapshots like gpt-4o-2024-05-13 or gpt-4o-2024-11-20 pin to specific releases.Does routing GPT-4o through AI Gateway add latency?
AI Gateway is designed as a lightweight routing layer. For most applications, the observability, caching, and authentication benefits outweigh any marginal overhead.
What are typical latency characteristics?
This page shows live throughput and time-to-first-token metrics measured across real AI Gateway traffic.