CPC G10L 21/0208 (2013.01) [G06N 3/045 (2023.01); G10L 25/30 (2013.01)] | 20 Claims |
1. A computer-implemented method for training a denoising system comprising:
given a denoising system comprising a first encoder, a second encoder, a quantizer, and a decoder and given a set of one or more clean-noisy audio pairs, in which each clean-noisy audio pair comprises a clean audio of content by a speaker and a noisy audio of the content by the speaker:
for each clean audio from the set of one or more clean-noisy audio pairs:
generating one or more continuous latent representations for the clean audio using the first encoder; and
for each continuous latent representation of the one or more continuous latent representations for the clean audio, generating a corresponding discrete clean audio representation using the quantizer;
for each noisy audio from the set of one or more clean-noisy audio pairs:
generating one or more continuous latent representations for the noisy audio using the second encoder; and
for each continuous latent representation of the one or more continuous latent representations for the noisy audio, generating a corresponding discrete noisy audio representation using the quantizer;
for each clean-noisy audio pair from the set of one or more clean-noisy audio pairs, inputting the discrete clean audio representation or representations, the clean audio, and a speaker embedding that represents the speaker of the clean-noisy audio pair into the decoder to generate an audio sequence prediction;
computing a loss for the denoising system, in which the loss comprises a latent representation matching loss term that, for a time step in which the discrete clean audio representation and the discrete noisy audio representation for a clean-noisy audio pair have different values, is determined using a distance measure between the continuous latent representation of the clean audio and the continuous latent representation of the noisy audio for that time step; and
updating the denoising system using the loss.
|