Since the release of the first text-to-speech (TTS) model, researchers have been looking for ways to improve the way these systems generate speech. The latest model from Microsoft, VALL-E, is a significant step forward in this regard.

VALL-E is a transformer-based TTS model that can generate speech in any voice after only hearing a three-second sample of that voice. This is a significant improvement over previous models, which required a much longer training period in order to generate a new voice.

VALL-E is an amazing technological feat that has the potential to change the way we interact with digital media.

Additionally, the intonation, charisma, and style of the voice are all kept intact in the generated speech. This is an important step forward in making TTS systems sound more natural.

This model is transformer-based and has a Dale-1 appearance. Not to be confused with the diffusion-based Dalle-2. The code is still lacking. And users have some skepticism that they will post it. However, Microsoft has released a few examples of the model in action, and it is clear that this is a major advance in TTS technology.

Example #1:

Example #2:

Example #3:

Read more about AI: