OpenAI Unveils GPT-Realtime Speech-To-Speech Model With Multimodal Support And Advanced Conversational Capabilities


In Brief
OpenAI released the gpt-realtime speech-to-speech model with multimodal support, advanced conversational skills, and strong audio reasoning performance.

Artificial intelligence research organisation OpenAI announced the general availability of its Realtime API, now enhanced with features that allow developers and enterprises to build robust, production-ready voice agents. The API supports remote MCP servers, image inputs, and phone calling via Session Initiation Protocol (SIP), enabling more capable and context-aware voice applications.
Alongside the API, OpenAI has released its most advanced speech-to-speech model, gpt-realtime, designed to improve instruction following, function calling, and natural-sounding speech. The model can interpret complex prompts, switch languages mid-sentence, reproduce alphanumeric sequences accurately, and capture non-verbal cues. Two new voices, Cedar and Marin, are also available, offering more expressive and human-like intonation. Existing voices have been updated to incorporate these enhancements.
The Realtime API processes audio directly through a single model, reducing latency and preserving nuance, unlike traditional pipelines that chain separate speech-to-text and text-to-speech models. gpt-realtime has been trained in collaboration with users to excel in real-world applications such as customer support, personal assistance, and education. Benchmark evaluations show substantial improvements in reasoning, instruction adherence, and function calling accuracy compared to previous models.
Additional updates include asynchronous function calling, allowing long-running operations without interrupting ongoing conversations, further supporting seamless, production-ready voice experiences.
OpenAI Expands Realtime API With MCP Support, Image Inputs, SIP Integration, And Cost-Saving Controls For Voice Agents
OpenAI’s Realtime API now includes new features designed to simplify integration and expand capabilities for production-ready voice agents. Developers can enable remote MCP support by linking a session to an MCP server URL, allowing the API to manage tool calls automatically and access additional functionalities without manual setup.
The gpt-realtime model now supports image inputs, enabling the system to incorporate photos, screenshots, and other visuals alongside audio or text. This allows users to ask context-specific questions about what they see, while developers retain control over which images are shared and when.
Additional improvements include Session Initiation Protocol (SIP) support for connecting apps to phone networks and PBX systems, as well as reusable prompts that let developers save and deploy pre-configured instructions, tools, and example messages across multiple sessions.
The generally available Realtime API and gpt-realtime model are now accessible to all developers, with pricing reduced by 20% compared to the previous gpt-4o-realtime-preview. New controls for conversation context allow for smarter token management, reducing costs for long-running sessions. Documentation, a Playground for testing, and a Realtime API prompting guide are available to support developers in adopting these features.
Disclaimer
In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.
About The Author
Alisa, a dedicated journalist at the MPost, specializes in cryptocurrency, zero-knowledge proofs, investments, and the expansive realm of Web3. With a keen eye for emerging trends and technologies, she delivers comprehensive coverage to inform and engage readers in the ever-evolving landscape of digital finance.
More articles

Alisa, a dedicated journalist at the MPost, specializes in cryptocurrency, zero-knowledge proofs, investments, and the expansive realm of Web3. With a keen eye for emerging trends and technologies, she delivers comprehensive coverage to inform and engage readers in the ever-evolving landscape of digital finance.