US 11,875,809 B2
Speech denoising via discrete representation learning
Zhao Song, Sunnyvale, CA (US); and Wei Ping, Sunnyvale, CA (US)
Assigned to Baidu USA LLC, Sunnyvale, CA (US)
Filed by Baidu USA, LLC, Sunnyvale, CA (US)
Filed on Oct. 1, 2020, as Appl. No. 17/061,317.
Prior Publication US 2022/0108712 A1, Apr. 7, 2022
Int. Cl. G10L 21/0208 (2013.01); G10L 25/30 (2013.01); G06N 3/045 (2023.01); G06N 3/08 (2023.01)
CPC G10L 21/0208 (2013.01) [G06N 3/045 (2023.01); G10L 25/30 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for training a denoising system comprising:
given a denoising system comprising a first encoder, a second encoder, a quantizer, and a decoder and given a set of one or more clean-noisy audio pairs, in which each clean-noisy audio pair comprises a clean audio of content by a speaker and a noisy audio of the content by the speaker:
for each clean audio from the set of one or more clean-noisy audio pairs:
generating one or more continuous latent representations for the clean audio using the first encoder; and
for each continuous latent representation of the one or more continuous latent representations for the clean audio, generating a corresponding discrete clean audio representation using the quantizer;
for each noisy audio from the set of one or more clean-noisy audio pairs:
generating one or more continuous latent representations for the noisy audio using the second encoder; and
for each continuous latent representation of the one or more continuous latent representations for the noisy audio, generating a corresponding discrete noisy audio representation using the quantizer;
for each clean-noisy audio pair from the set of one or more clean-noisy audio pairs, inputting the discrete clean audio representation or representations, the clean audio, and a speaker embedding that represents the speaker of the clean-noisy audio pair into the decoder to generate an audio sequence prediction;
computing a loss for the denoising system, in which the loss comprises a latent representation matching loss term that, for a time step in which the discrete clean audio representation and the discrete noisy audio representation for a clean-noisy audio pair have different values, is determined using a distance measure between the continuous latent representation of the clean audio and the continuous latent representation of the noisy audio for that time step; and
updating the denoising system using the loss.