US 12,314,850 B2
Audio processing with neural networks
Dominik Roblek, Meilen (CH); and Matthew Sharifi, Kilchberg (CH)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on May 3, 2021, as Appl. No. 17/306,934.
Application 17/306,934 is a continuation of application No. 15/151,374, filed on May 10, 2016, granted, now 11,003,987, issued on May 11, 2021.
Prior Publication US 2021/0256379 A1, Aug. 19, 2021
Int. Cl. G06N 3/08 (2023.01); G06F 3/16 (2006.01); G06N 3/049 (2023.01); G06N 3/084 (2023.01); G10L 25/30 (2013.01)
CPC G06N 3/08 (2013.01) [G06F 3/16 (2013.01); G06N 3/049 (2013.01); G06N 3/084 (2013.01); G10L 25/30 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:
receiving, as input to a trained frequency-transform (F-T) layer of a neural network, time domain features of an input audio sample, the input audio sample comprising a speech signal representing an utterance and a background signal;
applying, by the trained F-T layer, a transformation defined by a set of trained F-T layer parameters that transforms the time domain features of the input audio sample into frequency domain features, the trained F-T layer trained to learn a mapping between time domain features and frequency features for a particular audio processing task;
processing, using a plurality of convolutional layers of the neural network, the frequency domain features that were transformed by the transformation applied by the trained F-T layer;
generating, as output from the neural network, based on the processed frequency domain features, an output audio sample by subtracting the background signal from the speech signal of the input audio sample; and
training the neural network on training data comprising, for each of a plurality of training audio samples, corresponding time domain features of the training audio sample and a corresponding known output for the training audio sample, wherein training the neural network comprises updating tunable parameters of the trained F-T layer.