GPT-4o Realtime (Dec '24) by OpenAI: LLM Benchmarks, Rankings & Specs

GPT-4o Realtime (Dec '24) is a version of OpenAI's multimodal GPT-4o model optimized for low-latency, speech-to-speech interactions via the Realtime API. Released as the gpt-4o-realtime-preview-2024-12-17 snapshot, this iteration focuses on minimizing the delay in multimodal conversations by processing audio streams directly rather than using a cascaded system of separate speech-to-text and text-to-speech models. This architecture allows the model to capture and generate non-verbal cues such as tone, emotion, and inflection.

The December 2024 update introduced significant enhancements to the initial preview release, including support for prompt caching, which reduces costs and latency for repetitive context. It also expanded the available vocal options to include eight distinct voices: Alloy, Ash, Ballad, Coral, Echo, Sage, Shimmer, and Verse. These improvements aimed to provide a more responsive and human-like experience for interactive voice applications.

Technically, the model supports a 128,000 token context window, though audio inputs and outputs are metered differently than text. It is designed to handle interruptions naturally and can perform complex tasks such as simultaneous translation, customer support, and interactive storytelling. The model also maintains support for standard developer features like function calling and structured outputs within the real-time audio stream.

GPT-4o Realtime (Dec '24)

Explore AI Studio

Rankings & Comparison