| CPC G06N 3/08 (2013.01) [G06F 3/16 (2013.01); G06N 3/049 (2013.01); G06N 3/084 (2013.01); G10L 25/30 (2013.01)] | 18 Claims |

|
1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising:
receiving, as input to a trained frequency-transform (F-T) layer of a neural network, time domain features of an input audio sample, the input audio sample comprising a speech signal representing an utterance and a background signal;
applying, by the trained F-T layer, a transformation defined by a set of trained F-T layer parameters that transforms the time domain features of the input audio sample into frequency domain features, the trained F-T layer trained to learn a mapping between time domain features and frequency features for a particular audio processing task;
processing, using a plurality of convolutional layers of the neural network, the frequency domain features that were transformed by the transformation applied by the trained F-T layer;
generating, as output from the neural network, based on the processed frequency domain features, an output audio sample by subtracting the background signal from the speech signal of the input audio sample; and
training the neural network on training data comprising, for each of a plurality of training audio samples, corresponding time domain features of the training audio sample and a corresponding known output for the training audio sample, wherein training the neural network comprises updating tunable parameters of the trained F-T layer.
|