For those interested in AI technology developments check out this story:
Google’s new AI voice synthesizer, a service named
Cloud Text-to-Speech, will be available for businesses and developers that want to integrate voice synthesis in apps, websites or virtual assistants.
Traditionally, voice synthesizers use what is called concatenate synthesis, in which the program pieces individual syllables together to form words and sentences. While this language is understandable, it doesn’t have the imperfections of human speech that make it sound realistic, despite its development over the years.
In comparison, Google’s new AI voice synthesizer is powered by WaveNet, which uses machine learning to generate audio from scratch. It analyzes a huge database of human speech and re-creates them at a rate of 24,000 samples per second. The end result is a voice with subtleties like lip smacks and accents!
Check out the full story here (includes audio samples of WaveNet vs. a traditional audio synthesizer):
https://www.theverge.com/2018/3/27/17167200/google-ai-speech-tts-cloud-deepmind-wavenetI don't think I hear much of a difference in the English version however, maybe something is wrong with my hearing. Does anyone else hear a big difference? The Japanese sample definitely shows a big difference. I am kind of excited for this because I have been wondering when that very robotic sound in synthesizers would be programmed to be a bit more natural.