GPT-4o was announced on May 13, 2024 at OpenAI's Spring Updates event. The "o" stands for "omni," reflecting the model's foundational design: rather than connecting separate specialist models for different modalities, GPT-4o was trained end-to-end across text, audio, image, and video. This architectural choice enables sub-400-millisecond audio responses. Prior approaches chained a speech recognition model, a language model, and a text-to-speech model together, introducing latency at each boundary. GPT-4 Turbo-based voice averaged 5.4 seconds per turn.
GPT-4o matched GPT-4 Turbo on text and code in English while costing less in the API, with notable improvements on non-English text. This made it the default for developers who previously used GPT-4 Turbo: an upgrade in multimodal capability at a lower price.
The model accepts any combination of text, audio, image, and video as input and can generate text, audio, and image outputs. This flexibility spans real-time voice assistants, vision pipelines that analyze photographs or documents, and agents that process video frames alongside textual context.