SoundStorm: Google Unveils Terrifying AI Tool Capable of Real-Time Voice Replication

by Damir Yalalov

Published: May 30, 2023 at 10:00 am Updated: May 30, 2023 at 7:26 am

by Karolina Gaszcz

Edited and fact-checked: May 30, 2023 at 10:00 am

In Brief

Google has introduced SoundStorm, a cutting-edge model for efficient and non-autoregressive audio generation.

It employs bidirectional attention and confidence-based parallel decoding to generate high-quality audio while significantly reducing generation time.

It also has the ability to synthesize natural dialogues.

Google has introduced its latest breakthrough in artificial intelligence technology with SoundStorm, a cutting-edge model for efficient and non-autoregressive audio generation. With the ability to synthesize dialogues with different voices, SoundStorm opens up new possibilities for applications such as generating audio content from written text and creating realistic podcasts.

SoundStorm: Google Unveils Terrifying AI Tool Capable of Real-Time Voice Replication — @Midjourney

Unlike its predecessor AudioLM, SoundStorm employs a novel architecture that generates audio in chunks of 30 seconds, enhancing efficiency. By utilizing bidirectional attention and confidence-based parallel decoding, the model produces high-quality audio while significantly reducing generation time. On Google’s TPU-v4 hardware, SoundStorm can generate 30 seconds of audio in just 0.5 seconds, marking a substantial speed improvement.

SoundStorm’s training was conducted using a massive dataset of 100,000 hours of dialogue, ensuring a robust understanding of spoken language patterns. The model achieves impressive consistency in voice and acoustic conditions while maintaining the audio quality achieved by AudioLM. This breakthrough makes SoundStorm two orders of magnitude faster than its predecessor, demonstrating its potential for scalable audio generation.

One of the key capabilities of SoundStorm is its ability to synthesize natural dialogues by leveraging the text-to-semantic modeling stage of SPEAR-TTS. By providing transcripts with speaker turns and short voice prompts, users can control the spoken content and the voices of the speakers. During testing, SoundStorm demonstrated the ability to synthesize 30-second dialogue segments in just 2 seconds on a single TPU-v4, showcasing its efficiency and versatility.

Voice Prompt

Synthesized Dialogue

When compared to standard baselines, the audio generated by SoundStorm is of equivalent quality to AudioLM and demonstrates superior consistency and acoustic integrity. Notably, when prompted to give a speech sample, the model preserves the speaker’s voice with amazing accuracy, greatly boosting its capacity to generate lifelike dialogue.

While SoundStorm’s capabilities are outstanding, it is critical to recognize and solve possible ethical concerns. The training data for the algorithm may introduce biases relating to accents and voice features. The capacity to imitate voices could be abused for impersonation or to circumvent biometric identification. Google underlines the significance of putting protections in place to prevent such abuse and assuring the detectability of created audio through dedicated classifiers.

Google’s ethical AI principles drive its continuing efforts to address potential hazards and constraints. The organization realizes the need to do a thorough study of training data and the implications for model outputs. They also plan to investigate additional approaches, such as audio watermarking, for detecting synthesized speech to make ethical use of this technology.

SoundStorm is a big step forward in AI-powered audio production, providing high-quality and efficient neural audio codec-derived audio representations. Google expects that SoundStorm’s lower memory and processing needs will make audio generation research more accessible to a wider community. Google remains dedicated to preserving responsible AI practices and ensuring the safe and responsible use of SoundStorm and comparable breakthroughs in the field as technology evolves.
VALL-E, Microsoft’s latest text-to-speech (TTS) model, is a huge step forward in enhancing how these systems generate voice. VALL-E is a TTS model based on transformers that can generate speech in any voice after only hearing a three-second sample of that voice. This is a big advancement over earlier models, which required a significantly longer training period to develop a new voice.

Read more about AI:

Tags:

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Damir is the team leader, product manager, and editor at Metaverse Post, covering topics such as AI/ML, AGI, LLMs, Metaverse, and Web3-related fields. His articles attract a massive audience of over a million users every month. He appears to be an expert with 10 years of experience in SEO and digital marketing. Damir has been mentioned in Mashable, Wired, Cointelegraph, The New Yorker, Inside.com, Entrepreneur, BeInCrypto, and other publications. He travels between the UAE, Turkey, Russia, and the CIS as a digital nomad. Damir earned a bachelor's degree in physics, which he believes has given him the critical thinking skills needed to be successful in the ever-changing landscape of the internet.

Damir Yalalov