Google Duo's new machine learning model improves audio quality in calls

XDA

Google's new WaveNetEQ machine learning model improves audio quality in Duo

By Tushar Mehta

Published Apr 2, 2020

Google Duo uses Google's novel WaveNetEQ machine learning model to improve audio quality in calls by filling gaps and curing jitter.

Google has had a history of killing messaging apps unpleasantly in favor of newer communication apps that too are killed eventually. Google Duo has, so far, been an exception since it was launched alongside Allo, the now-defunct messaging service. Duo has continuously received Google's attention and frequent addition of new features like 1080p support on 5G Samsung S20 phones, (upcoming) live captions, doodles, and up to 12 participants in a group call. Now, Google is applying machine learning to abate the major problem of jitters for a smoother and uninterrupted audio experience.

Video calling has become a vital way of official communication during the COVID-19 quarantine period and jittery audio can cost you or your company financially. Google acknowledges that 99% of the calls on Duo suffer from interruptions due to network delays. About a fifth of these calls suffers a 3% loss in audio while a tenth loses nearly 8% of the audio, much of which could be very significant information that you end up missing. This happens as packets of data are either delayed or lost in transmission and the absence of these packets results in glitches in the audio, rendering much of it incomprehensible.

Google's new WaveNetEQ machine learning algorithm works on a technique called "packet loss concealment" (PLC). WaveNet EQ is a generative model based on DeepMind’s WaveRNN and creates chunks of audio to plug in gaps with realistic fillers. The AI model has been trained by feeding a large pool of speech-related data. Due to end-to-end encryption in Google Duo, the model runs on the receiver's device. But Google claims that it is "fast enough to run on a phone, while still providing state-of-the-art audio quality."

WaveRRN relies on a text-to-speech model and besides being trained for "what to say," it has also been trained for "how to say" things. It analyzes the input with a strong phonetic understanding to predict sounds in the immediate future. Besides filling up gaps, the model also produces surplus audio in the raw waveform to overlap the part which follows the jitter. This signal overlaps with the actual audio with a bit of cross-fading and results in a smoother transition.

Google Duo's WaveNetEQ model has been trained in 48 languages fed by 100 individuals so that it can learn the general characteristics of human voice instead of just one language. The model is trained to mostly produce syllables and can fill up to 120ms long gaps.

The feature is already available on the Google Pixel 4 and is now rolling out to other Android devices.

Source: Google AI Blog