CPC G10L 19/022 (2013.01) [G06N 3/045 (2023.01)] | 47 Claims |
1. A method performed by one or more computers, the method comprising:
receiving an input audio waveform that comprises a respective input audio sample for each of a plurality of input time steps;
processing the input audio waveform using an encoder neural network to generate a set of feature vectors representing the input audio waveform,
wherein the encoder neural network comprises a sequence of encoder blocks that are each configured to:
process a respective set of input feature vectors in accordance with a set of encoder block parameters to generate a set of output feature vectors, comprising down-sampling the set of input feature vectors; and
processing the set of feature vectors representing the input audio waveform using a decoder neural network to generate an output audio waveform that comprises a respective output audio sample for each of a plurality of output time steps,
wherein the decoder neural network comprises a sequence of decoder blocks that are each configured to:
process a respective set of input feature vectors in accordance with a set of decoder block parameters to generate a set of output feature vectors, comprising up-sampling the set of input feature vectors;
wherein the output audio waveform represents a version of the input audio waveform that has been filtered to include only audio from a target audio source; and
wherein the encoder neural network, the decoder neural network, or both additionally process a conditioning vector representing the target audio source.
|