After OpenAI released Advanced Voice Mode for Plus users starting from September 27, 2023, it's been a week now, so I'd like to share some thoughts.
For context, previously, Voice Chat in ChatGPT was called Standard Voice, which worked by converting speech to text through Whisper, then processing it with GPT-4o or GPT-4o mini, and finally converting the text back to speech via TTS. This method worked quite well, but it lacked the perception of tone and other environmental factors that affect meaning in real conversations.
This new Advanced Voice uses GPT-4o's ability to directly receive and generate speech, making it true Speech-to-Speech. This allows it to perceive tone and respond more appropriately to situations, including adjusting its voice to flow with the context.
From experimenting and challenging it with Thai language, I found that:
🌟It performs much better than before, with faster response times, making conversations more natural.
🌟The accent is still slightly noticeable, but much less so.
🌟We can interrupt it at any time, without waiting for it to finish speaking.
🌟The voice has more variety, able to narrate with pauses, use emphatic tones to create surprise, or speak slowly as if casting a spell. It's perfect for bedtime stories.
🌟Paused conversations can be resumed with all context remembered.
However, despite all the good points, we also encountered some limitations such as:
🔹 Unable to retrieve information from the internet.
🔹 The Advanced version can respond for about 20 seconds at a time, then stops speaking. We can ask it to continue from that point. The Standard version can answer for longer periods.
🔹 The Advanced version cannot continue conversations from the Standard version or from previous text chats.
The transcription may not always match the actual speech. Sometimes Thai speech is transcribed as English.
If anyone wants to hear the comparison between Standard and Advanced, you can check the attached clips. We tried to simulate the same scenario for both and edited them for viewing. Don't forget to turn on the sound 🎧
⚙️ If I remember correctly, the voice used is Amber. ⚙️ TTS is the Standard version and GPT-4o Voice is the Advanced version.
Voice samples:
Rela's Adventure
Why are apples red?
If you've finished reading and want to try it out, but don't want your voice to be used for training, remember to turn off "Improve voice for everyone" in the ChatGPT app.
Finally, from a developer's perspective, Advanced Voice might seem like just a toy for casual conversation or a language learning tool, but it actually has much more potential. Right now, its capabilities are limited to what we see, but if we could integrate it with other things or teach it more knowledge, we might see a world where Advanced Voice becomes a real assistant that we don't have to command directly, but just casually tell what we want to do and let the AI interpret it.
It can also help with work, such as reducing the burden on call centers so that staff don't have to talk directly to customers but can focus on problem-solving instead. Customers won't have to navigate complicated phone menus. Or AI could help screen callers' emotions and summarize the issues, helping staff make fewer mistakes. There are many more interesting use cases, but I'll stop here for now.
And just before we were about to post this...
We saw the announcement of Introducing the Realtime API 👀
Which was just launched on October 1st.
https://openai.com/index/introducing-the-realtime-api
Yes, this is the channel that allows us to build upon Advanced Voice in our own way. But it seems access is still limited at the moment, perhaps because I'm low-tier customers 😢 Feel free to contribute if you can. Once I get access, I'll update you in the next post. See you then 👋