US 12,190,896 B2
	Generating audio waveforms using encoder and decoder neural networks
Yunpeng Li, Zurich (CH); Marco Tagliasacchi, Kilchberg (CH); Dominik Roblek, Meilen (CH); Félix de Chaumont Quitry, Zurich (CH); Beat Gfeller, Dubendorf (CH); Hannah Raphaelle Muckenhirn, Zurich (CH); Victor Ungureanu, Thalwil (CH); Oleg Rybakov, Redmond, WA (US); Karolis Misiunas, Zurich (CH); and Zalán Borsos, Zurich (CH)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jul. 1, 2022, as Appl. No. 17/856,292.
Claims priority of provisional application 63/218,141, filed on Jul. 2, 2021.
Prior Publication US 2023/0013370 A1, Jan. 19, 2023
Int. Cl. G10L 19/022 (2013.01); G06N 3/045 (2023.01)

CPC G10L 19/022 (2013.01) [G06N 3/045 (2023.01)]

47 Claims

1. A method performed by one or more computers, the method comprising:

receiving an input audio waveform that comprises a respective input audio sample for each of a plurality of input time steps;

processing the input audio waveform using an encoder neural network to generate a set of feature vectors representing the input audio waveform,

wherein the encoder neural network comprises a sequence of encoder blocks that are each configured to:

process a respective set of input feature vectors in accordance with a set of encoder block parameters to generate a set of output feature vectors, comprising down-sampling the set of input feature vectors; and

processing the set of feature vectors representing the input audio waveform using a decoder neural network to generate an output audio waveform that comprises a respective output audio sample for each of a plurality of output time steps,

wherein the decoder neural network comprises a sequence of decoder blocks that are each configured to:

process a respective set of input feature vectors in accordance with a set of decoder block parameters to generate a set of output feature vectors, comprising up-sampling the set of input feature vectors;

wherein the output audio waveform represents a version of the input audio waveform that has been filtered to include only audio from a target audio source; and

wherein the encoder neural network, the decoder neural network, or both additionally process a conditioning vector representing the target audio source.