Advancing voice intelligence with new models in the API

We’re introducing three audio models in the API that unlock a new class of voice apps for developers. With these models, developers can build voice experiences that feel more natural, respond more intelligently, and take action in real time:
- GPT‑Realtime‑2, our first voice model with GPT‑5‑class reasoning that can handle harder requests and carry the conversation forward naturally.
- GPT‑Realtime‑Translate, a new live translation model that translates speech from 70+ input languages into 13 output languages while keeping pace with the speaker.
- GPT‑Realtime‑Whisper, a new streaming speech-to-text that transcribes speech live as the speaker talks.
Try GPT-Realtime-2
What can I ask?
After you start the session, try saying one of these:
- I’m hosting a last-minute dinner tonight. I have 30 minutes, two vegetarian friends, one mushroom-hater, and a tiny kitchen. Help me plan a simple menu.
- I’m welcoming guests to a live event in Japan. Say a warm, natural welcome in Japanese — like a host kicking off something special.
- My order number is Orbit-742Q. Repeat it back clearly so I can confirm it’s right.
- Help me practice telling my team we hit our launch milestone. First say it with quiet confidence, then with more excitement.
- I’m planning trivia for a road trip. Give me three trick questions that sound deceivingly simple, then explain each answer in one sentence.
Voice is becoming one of the most natural ways for people to use software. It lets someone ask for help while driving, change a travel plan while walking through an airport, get support in their preferred language, or move through a task without stopping to type.
But building useful voice products takes more than fast turn-taking or a natural-sounding voice. A voice agent needs to understand what someone means, keep track of context, recover when a request changes, use tools while the conversation continues, and respond in a way that feels appropriate to the moment.
Together, the models we are launching move realtime audio from simple call-and-response toward voice interfaces that can actually do work: listen, reason, translate, transcribe, and take action as a conversation unfolds.
As voice becomes a more natural way to use software, we’re seeing developers build around three emerging patterns in voice AI:
- Voice-to-action, where people can describe what they need and the system can reason through the request, use tools, and complete the task. For example, Zillow is building an assistant that can listen, reason, and act on requests like: “find me homes within my BuyAbility, avoid busy streets, and schedule a tour for Saturday.”
- Systems-to-voice, where software can turn context into live spoken guidance. For example, a travel app could proactively tell a traveler: “Your inbound flight is delayed, but you can still make your connection. I found the new gate, mapped the fastest route through the terminal, and your bag is still expected to transfer.”
- Voice-to-voice, where AI can help live conversations continue across languages, tasks, or changing context. For example, Deutsche Telekom is building voice support experiences where customers can speak in the language they’re most comfortable using, while the model translates the conversation in real time.
These patterns can also work together. Priceline is working toward a future where travelers can manage entire trips by voice: searching for flights and hotels conversationally, handling changes like adjusting a hotel reservation after a flight delay or getting real-time updates on TSA wait times, and translating conversations once travelers are on the ground.
GPT‑Realtime‑2 is built for live voice interactions where the model keeps the conversation moving while it reasons through a request, calls tools, handles corrections or interruptions, and responds in a way that fits the moment.
- Preambles: Developers can enable short phrases before a main response, like “let me check that” or “one moment while I look into it,” so users know the agent is working on the request.
- Parallel tool calls and tool transparency: The model can call multiple tools at once and make those actions audible with phrases like “checking your calendar” or “looking that up now,” helping agents stay responsive while completing tasks.
- Stronger recovery behavior: The model can recover more gracefully by saying things like “I’m having trouble with that right now,” instead of failing silently or breaking the conversation.
- Longer context for agentic workflows: We’re increasing the context window from 32K to 128K to support longer, more coherent sessions and more complex task flows.
- Stronger domain understanding: The model better retains specialized terminology, proper nouns, healthcare terms, and other vocabulary that matters in production settings.
- More controllable tone and delivery: The model can better adjust its tone—speaking calmly while resolving an issue, empathetically when a user is frustrated, or upbeat when confirming a successful action.
- Adjustable reasoning effort: Developers can now select from minimal, low, medium, high, and xhigh reasoning levels, with low as the default, balancing lower latency for straightforward interactions with more deliberate reasoning for complex requests.
The gains show up on audio evals that map closely to production voice agents: GPT‑Realtime‑2 (high) scores 15.2% higher on Big Bench Audio for audio intelligence than GPT‑Realtime‑1.5. GPT‑Realtime‑2 (xhigh) scores 13.8% higher on Audio MultiChallenge for instruction following, improving over GPT‑Realtime‑1.5 and showing stronger reasoning, context management, and control in live conversations.
The magic of GPT‑Realtime‑2 shows up across a variety of different use cases:
During early testing, businesses used GPT‑Realtime‑2 to build voice agents that help customers and employees get things done through natural conversation:
“What stood out about GPT-Realtime-2 was the intelligence and tool-calling reliability it brings to complex voice interactions. On our hardest adversarial benchmark, this translates to a 26-point lift in call success rate after prompt optimization (95% vs. 69%). GPT-Realtime-2 is also materially more robust on Fair Housing compliance, which is critical for our business. The combination of agentic competence and guardrail strength is what makes it viable for production voice at Zillow.”
GPT‑Realtime‑Translate helps developers build live multilingual voice experiences where each person can speak in their preferred language and hear the conversation translated in real time and read the real time transcriptions. It supports more than 70 input languages and 13 output languages, making it useful for customer support, cross-border sales, education, events, media, and creator platforms serving global audiences.
For developers, live translation needs to preserve meaning while keeping pace with the speaker, even when people speak naturally, switch context, or use regional pronunciation and domain-specific language. For example, Deutsche Telekom is testing the model for multilingual voice interactions, where lower latency and stronger fluency can make cross-language conversations feel more natural.
In this video, Vimeo shows how GPT‑Realtime‑Translate can translate a product education video live as it plays, so global customers can hear updates in their preferred language without waiting for a separately produced version.
“Building voice AI for India means handling diverse regional phonetics. In our evals across Hindi, Tamil, and Telugu, GPT-Realtime-Translate delivered 12.5% lower Word Error Rates than any other model we tested, along with lower fallback rates, higher task completion, and latency that sustained natural conversation. It sets a new standard for multilingual voice AI.”
GPT‑Realtime‑Whisper is a new streaming transcription model built for low-latency speech-to-text. It transcribes audio as people speak, so live products can feel faster, more responsive, and more natural—from captions that appear in the moment, to meeting notes that keep up with the conversation.
The model makes live speech usable inside business workflows as it happens. Teams can power captions for meetings, classrooms, broadcasts, and events; generate notes and summaries while conversations are still in progress; build voice agents that need to understand users continuously; and create faster follow-up workflows for customer support, healthcare, sales, recruiting, and other high-volume spoken interactions.
The Realtime API incorporates multiple layers of safeguards and mitigations to help prevent misuse. We employ active classifiers over Realtime API sessions, meaning certain conversations can be halted if they are detected as violating our harmful content guidelines. Developers can also easily add their own additional safety guardrails using the Agents SDK.(opens in a new window)
Our usage policies prohibit repurposing or distributing outputs from our services for spam, deception, or other harmful purposes. Developers must also make it clear to end users when they’re interacting with AI, unless it’s already obvious from the context.
The Realtime API fully supports EU Data Residency(opens in a new window) for EU-based applications and is covered by our enterprise privacy commitments.
GPT‑Realtime‑2, GPT‑Realtime‑Translate and GPT‑Realtime‑Whisper are available in the Realtime API. GPT‑Realtime‑2 is priced at $32 / 1M audio input tokens ($0.40 for cached input tokens) and $64 / 1M audio output tokens. GPT‑Realtime‑Translate is priced at $0.034 per minute. GPT‑Realtime‑Whisper is priced at $0.017 per minute.
You can test the new realtime voice models in the Playground(opens in a new window).
To start building, open this prompt in Codex to add GPT‑Realtime‑2 to an existing app or start a new one. If you don’t have Codex yet, download the Codex app first.

Facts Only

Three new audio models are introduced: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper.
GPT-Realtime-2 features GPT-5-class reasoning for natural voice interactions.
GPT-Realtime-Translate supports live translation from 70+ input languages to 13 output languages.
GPT-Realtime-Whisper provides real-time speech-to-text transcription.
Zillow is using GPT-Realtime-2 for real estate voice assistants.
Deutsche Telekom is testing GPT-Realtime-Translate for multilingual customer support.
Priceline is exploring voice-based trip management with these models.
GPT-Realtime-2 includes features like preambles, parallel tool calls, and adjustable reasoning effort.
The models are available in the Realtime API with specific pricing structures.
Safeguards include active classifiers to prevent misuse and support for EU data residency.
Developers can test the models in the Playground or via the Codex app.

Executive Summary

Three new audio models—GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper—have been introduced to enhance voice-based applications. GPT-Realtime-2 offers advanced reasoning and tool integration for natural, real-time conversations, while GPT-Realtime-Translate enables live multilingual translation across 70+ input and 13 output languages. GPT-Realtime-Whisper provides low-latency speech-to-text transcription for live interactions. These models aim to improve voice-to-action, systems-to-voice, and voice-to-voice applications, with companies like Zillow, Deutsche Telekom, and Priceline already testing them for customer support, travel management, and multilingual communication. The models include safeguards against misuse and support EU data residency. Pricing varies, with GPT-Realtime-2 costing $32 per million input tokens and GPT-Realtime-Translate at $0.034 per minute. Developers can test these models in the Realtime API Playground.

Full Take

The introduction of these real-time audio models marks a significant shift toward more natural and functional voice interfaces, but it also raises questions about the broader implications of AI-driven communication. The models' capabilities—such as live translation, contextual reasoning, and low-latency transcription—could democratize access to information and services, particularly for multilingual or mobility-impaired users. However, the emphasis on "natural" conversation and seamless integration into workflows may obscure the limitations of AI, such as potential biases in translation or the risk of over-reliance on automated systems in high-stakes scenarios (e.g., healthcare or legal contexts).
The pricing structure and enterprise-focused features suggest these tools are primarily aimed at businesses, which could exacerbate inequalities if smaller developers or nonprofits cannot afford access. Additionally, while safeguards like active classifiers and data residency compliance are noted, the long-term ethical and privacy implications of widespread voice AI adoption remain underexplored. For instance, how might real-time translation affect cultural nuances in communication, or could voice-to-action systems inadvertently reinforce existing power dynamics in customer-service interactions?
Bridge questions: How might these models be misused in surveillance or manipulation contexts, despite safeguards? What cultural or linguistic biases might persist in translation models, even with broad language support? Would the benefits of these tools outweigh the risks of reducing human-to-human interaction in critical domains?
Patterns detected: none

Sentinel — Human

Confidence

The text is highly structured and technically precise, exhibiting characteristics of professional marketing material, but lacks the typical stylistic idiosyncrasies of casual human writing, suggesting either expert human authorship or sophisticated AI generation based on detailed source material.

Signals Detected

Moderately uniform sentence structure and high informational density, characteristic of professional technical writing, but with varied rhetorical pacing.

High coherence; the text flows logically from introduction to technical features and use cases without apparent emotional digressions or balancing of conflicting viewpoints.

Structured presentation of features, benchmarks, and use cases follows a clear, template-driven press release format. Statistics are presented specifically, suggesting internal data or highly curated sourcing.

Specific, quantitative results (e.g., 15.2% lift, 26-point lift, specific pricing) are present. While plausible, the density and precision are consistent with either internal disclosure or advanced LLM synthesis based on provided data.

Human Indicators

The integration of highly specific, quantitative performance metrics and pricing structures suggests grounding in actual product data, which often requires human oversight or direct input of proprietary information.

The specific use cases (e.g., Zillow, Deutsche Telekom) and the detailed explanation of agentic workflows show a depth of domain knowledge consistent with expert input.