Big news! OpenAI launches a full range of voice agents: it can achieve unprecedented precision in teaching AI to speak

OpenAI has launched a new family of voice agents, making it possible to teach AI to speak in a refined way.
Core content:
1. OpenAI has released three advanced audio models to help build powerful voice agents
2. The new speech-to-text and text-to-speech models have excellent performance, bringing a more natural voice interaction experience
3. The upgraded Agent SDK makes it easier to build voice agents and improves development efficiency
Just now, OpenAI released a series of new models and tools. Specifically, OpenAI launched three new advanced audio models in the API:
?️ Two speech-to-text models - outperforming Whisper ? New TTS (text-to-speech) model - you can teach AI how to speak
There is only one core: allowing developers to easily build powerful "voice intelligent agents"!
In the live broadcast, Olivier Godement, head of the OpenAI platform, said that they have been actively building AI agents, and now they are expanding their focus from text to voice.
Why voice? Olivier believes that voice is the most natural way for humans to interact. Compared with reading and writing, voice communication is more convenient and humane. Therefore, creating a reliable, accurate and flexible voice intelligent body will greatly expand the application scenarios of AI.
Let me highlight the key points for you first
Three major models work together to build the cornerstone of "voice-controlled AI"
To achieve this vision, OpenAI has used three magic weapons:
1. Two new speech-to-text models: GPT-4o-transcribe and GPT-4o-mini-transcribe
These two models are known as the "strongest on earth", with performance that comprehensively surpasses the previous Whisper model, and have achieved a qualitative leap in transcription accuracy in various languages. This means that AI can hear more clearly and more accurately!
2. New "text-to-speech" model: GPT-4o-mini-tts
This model allows developers to finely control the way AI speaks for the first time. Not only can they decide what AI says, but they can also control how AI says it! You can control the tone and emotion to create a more human voice experience.
In order to make it easier for everyone to use this model, OpenAI built a new website for this model.http://OpenAI.fm
, an interactive demo for developers to try out the new text-to-speech model in the OpenAI API. OpenAI has pre-generated a variety of demo texts, and you can choose different voices and emotions to express your text. You can also enter text yourself and experience choosing different voices and emotions to express
3. Upgraded Agent SDK
In order to make it easier for developers to build voice agents, OpenAI has made a major update to the previously released Agent SDK, making it possible to "upgrade" text agents to voice agents with one click! This upgrade has many highlights:
Voice capability support: Agent SDK deeply integrates OpenAI's latest "speech to text" and "text to speech" models. Developers can give the intelligent agent "ears" and "mouth" without complicated configuration.
Streaming processing optimization: The upgraded SDK supports two-way streaming transmission, and both audio input and voice output are more real-time, greatly improving the fluency of voice interaction.
Ready to use out of the box, get started quickly: Agent SDK provides rich sample codes and detailed documentation, so even novice developers can quickly get started and easily convert text agents into voice agents
Debugging tool: Agent SDK is seamlessly integrated with OpenAI debugging UI. Developers can intuitively track the entire process of voice interaction, analyze audio input, text transcription, model reasoning, speech synthesis and other aspects, and debug efficiency is greatly improved!
Two mainstream solutions for building voice-based intelligent agents
Jeff Harris, an expert from OpenAI, shared two main methods for building voice agents in a live broadcast:
Method 1: Real-time API direct connection to the "speech-speech" model
This method is more cutting-edge, directly using the "speech-speech" model, allowing AI to directly understand the audio and output speech, which is faster and more fluid. This is also the technology behind ChatGPT's advanced speech mode.
Method 2: Chain call audio model and text model
This is a more user-friendly and reliable solution, and is also the method that OpenAI strongly recommends this time. It is achieved through the following steps:
1. Speech-to-Text model: converts user speech into text. 2. Text-based LLM: such as GPT-4o, which understands the text and generates appropriate responses. 3. Text-to-Speech model: Converts text responses into natural and fluent speech.
Jeff emphasized that the advantages of the chain solution are:
• Modularity: Models in each link can be flexibly replaced to select the most suitable components. • High reliability: The intelligence level of the text model is still the current “gold standard”, and the chain solution can ensure higher reliability. • Easy to use: Developers can quickly add voice capabilities based on existing text agent projects
The technology behind the model
Pre-training using real audio datasets
The new audio models are based on the GPT‑4o and GPT‑4o-mini architectures and are extensively pre-trained on specialized audio-centric datasets, which is critical to optimizing model performance. This targeted approach provides deeper insights into speech nuances and achieves outstanding performance on audio-related tasks.
Advanced distillation methods
Enhanced distillation technology enables knowledge transfer from the largest audio models to smaller, more efficient models. Using advanced self-play methods, our distilled dataset effectively captures realistic conversational dynamics, replicating real user-assistant interactions. This helps small models provide excellent conversational quality and responsiveness
Reinforcement Learning Paradigm
For speech-to-text models, we integrated a reinforcement learning (RL-heavy) paradigm to push transcription accuracy to the state-of-the-art. This approach significantly improves accuracy and reduces hallucinations, making speech-to-text solutions highly competitive in complex speech recognition scenarios.
Outstanding performance at an affordable price
The amazing performance of the GPT-4o series of "speech-to-text" models: in the FLEURS benchmark, the error rate is much lower than the previous generation Whisper model, which is truly "a step up"
What’s even more surprising is that the price is also very reasonable:
• GPT-4o-transcribe: 0.6 cents per minute, in line with the Whisper model • GPT-4o-mini-transcribe: only 0.3 cents per minute, more cost-effective! • GPT-4o-mini-tts: A text-to-speech model that costs 1 cent per minute, making it affordable