Amazon’s neural TTS can model speaking styles with only a few hours of recordings

Nitin Naresh November 19, 2018

0 278 2 minutes read

Tired of Alexa’s staid, monotonous tone? Well, thanks to a new artificial intelligence (AI) technique, Amazon might soon be able to roll out new speaking styles to its voice assistant at a rapid clip.
In a newly published paper (“Effect of data reduction on sequence-to-sequence neural TTS“) and an accompanying blog post, the Seattle company today detailed a text-to-speech (TTS) system that can learn to adopt a new speaking style, such as that of a newscaster, from just a few hours of training. Traditional methods require hiring a voice actor to read in the target style for a collective tens of hours.
“To users, synthetic speech produced by neural networks sounds much more natural than speech produced through concatenative methods, which string together short speech snippets stored in an audio database,” wrote Trevor Wood, applied science manager at Amazon. “With the increased flexibility provided by [our system], we can easily vary the speaking style of synthesized speech.”
Amazon’s AI model — which it refers to as neural TTS, or NTTS for short — consists of two components. The first is a generative neural network that converts a sequence of phonemes — perceptually distinct units of sound that distinguish one word from another, such as the p, b, d, and t in pad and pat — into a sequence of spectrograms, a visual representation of the spectrum of frequencies of sound as they vary with time. The second is a vocoder that converts those spectrograms — specifically mel-spectrograms, which have frequency bands that, according to Wood, “emphasize features that the human brain uses when processing speech” — into a continuous audio signal.
The phenome-to-spectrogram interpreter network is sequence to sequence, Wood noted, meaning it doesn’t compute an output solely from the corresponding inputs, instead considering its position in the sequence of outputs. Scientists at Amazon trained it with phenome sequences and corresponding sequences of mel-spectrograms, in addition to a “style encoding,” the latter of which identified the specific speaking style used in the training example.

Above: The NTTS’ architecture.

Image Credit: Amazon

The output of the model was fed into a vocoder that generated high-quality speech waveforms. Uniquely, the vocoder can take mel-spectrograms from any speaker, regardless of whether they were seen during training time, and it doesn’t require the use of a speaker encoding.
The result? A model-training method that combines a large amount of neutral-style speech data with only a few hours of supplementary data in the desired style, and an AI system capable of distinguishing elements of speech both independent of a speaking style and unique to a style.
“When presented with a speaking-style code during operation, the network predicts the prosodic pattern suitable for that style and applies it to a separately generated, style-agnostic representation,” Wood explained. “The high quality achieved with relatively little additional training data allows for rapid expansion of speaking styles.”

Above: The results of Amazon’s listener survey.

Image Credit: Amazon

According to Amazon’s research, listeners preferred voices generated with NTTS to those produced through concatenative synthesis.
“The preference for the neutral-style NTTS reflects the widely reported increase in general speech synthesis quality due to neural generative methods,” Wood wrote. “The further improvement for the NTTS newscaster voice reflects our system’s ability to capture a style relevant to the text.”
The new research follows the debut of Alexa’s whisper mode, which enables Alexa to respond to whispered speech by whispering back.
Source: VentureBeat
To Read Our Daily News Updates, Please visit Inventiva or Subscribe Our Newsletter & Push.

Unmasking Patanjali and FMCG’s Deceptive Marketing: Supreme Court’s Stand Against Misleading Ads!

Swiggy’s IPO Plans, Secures Shareholder Approval For A Potential $1.2 Billion IPO

United Nations Turns Into Battleground As United States And Russia Clash Over Nuclear Weapons In Space; How Dominance In Space Is Opening A 4th Dimension In Warfare, And A Worrying One!

What Is Project Nimbus? Why Are Google Employees Protesting It? Do Tech Companies Have Ties With The Military?

MDH and Everest Spice Banned in Singapore and HongKong; Can they Cause Cancer?

Finally Ankiti Bose Founder & Ex-CEO Of Zilingo Filed Retaliatory Sexual Harassment Complaint Against Co-Founder For Blackmailing & Extortion

NOTA, No Votes and Unopposed Nominations: The Grey Areas of the Indian Election Process Explained

Can A Bigger ‘Sorry’ Apology Ad Undo The Fraud Committed By Baba Ramdev’s Patanjali? Why Has The License Not Been Cancelled, And Why Is There No Fine? Should Indian Citizens Forgive Him So Easily?

India’s Biggest Worry, Unemployment, Reuters Poll; How Modi Govt Has Failed To Address The Critical Issue Amid ‘White Washing’; Where Are Our Jobs?

Bye Bye Tesla! Tesla’s Change In Strategy Bores’ Gloomy Skies’ Over India Factory; Tesla’s Earnings Plunge, But The Company Promises Cheaper Car Model

Amazon’s neural TTS can model speaking styles with only a few hours of recordings

Nitin Naresh

Read Next

United Nations Turns Into Battleground As United States And Russia Clash Over Nuclear Weapons In Space; How Dominance In Space Is Opening A 4th Dimension In Warfare, And A Worrying One!

What Is Project Nimbus? Why Are Google Employees Protesting It? Do Tech Companies Have Ties With The Military?

NOTA, No Votes and Unopposed Nominations: The Grey Areas of the Indian Election Process Explained

United Nations Turns Into Battleground As United States And Russia Clash Over Nuclear Weapons In Space; How Dominance In Space Is Opening A 4th Dimension In Warfare, And A Worrying One!

What Is Project Nimbus? Why Are Google Employees Protesting It? Do Tech Companies Have Ties With The Military?

NOTA, No Votes and Unopposed Nominations: The Grey Areas of the Indian Election Process Explained

Leave a Reply Cancel reply

Top 10 Best Agriculture Companies in India 2022

Top 10 Best Artificial Intelligence (AI) Companies of India in 2022

Ampere launches new chip built from ground up for cloud workloads

Acer may shutter or sell StarVR after location-based VR revenues sink

Indonesia short on oxygen, seeks help as virus cases soar

Floods- Why are Pune and Mumbai prone to it?

The solar storms will hit the Earth and cause disruption in GPS and mobile connectivity.

The death of democracy in India

Employee Engagement In The Hybrid Workplace Of The Future

Read Next

United Nations Turns Into Battleground As United States And Russia Clash Over Nuclear Weapons In Space; How Dominance In Space Is Opening A 4th Dimension In Warfare, And A Worrying One!

What Is Project Nimbus? Why Are Google Employees Protesting It? Do Tech Companies Have Ties With The Military?

NOTA, No Votes and Unopposed Nominations: The Grey Areas of the Indian Election Process Explained

Art Social Network and Discovery Startup, Skillbox Raises Seed Funding from Angel Investor Sandip Ranjhan

Marvel and Riot Games will make League of Legends graphic novels

Related Articles

Leave a Reply Cancel reply

Top 10 Best Agriculture Companies in India 2022

Top 10 Best Artificial Intelligence (AI) Companies of India in 2022

Ampere launches new chip built from ground up for cloud workloads

Acer may shutter or sell StarVR after location-based VR revenues sink

Indonesia short on oxygen, seeks help as virus cases soar

Floods- Why are Pune and Mumbai prone to it?

The solar storms will hit the Earth and cause disruption in GPS and mobile connectivity.

The death of democracy in India

Employee Engagement In The Hybrid Workplace Of The Future

Adblock Detected