GPT-4o

GPT-4o is OpenAI's first natively multimodal "omni" model, unifying text, audio, image, and video processing within a single end-to-end trained architecture and delivering audio response times averaging 320 milliseconds, comparable to human conversational latency.

File InputTool UseVision (Image)Implicit Caching

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'openai/gpt-4o',
  prompt: 'Why is the sky blue?'
})

Overview Playground About Providers Throughput Latency Uptime Status Similar FAQ

About GPT-4o

GPT-4o was announced on May 13, 2024 at OpenAI's Spring Updates event. The "o" stands for "omni," reflecting the model's foundational design: rather than connecting separate specialist models for different modalities, GPT-4o was trained end-to-end across text, audio, image, and video. This architectural choice enables sub-400-millisecond audio responses. Prior approaches chained a speech recognition model, a language model, and a text-to-speech model together, introducing latency at each boundary. GPT-4 Turbo-based voice averaged 5.4 seconds per turn.

GPT-4o matched GPT-4 Turbo on text and code in English while costing less in the API, with notable improvements on non-English text. This made it the default for developers who previously used GPT-4 Turbo: an upgrade in multimodal capability at a lower price.

The model accepts any combination of text, audio, image, and video as input and can generate text, audio, and image outputs. This flexibility spans real-time voice assistants, vision pipelines that analyze photographs or documents, and agents that process video frames alongside textual context.

AI Cloud

Core Platform

Security

Company

Learn

Open Source

Use Cases

Tools

Users

GPT-4o

About GPT-4o