| CPC G10L 15/02 (2013.01) [G10L 15/16 (2013.01)] | 20 Claims |

|
1. A method performed by one or more computers, the method comprising:
obtaining an audio waveform comprising a sequence of audio samples at a first frequency;
processing the audio waveform using a learned audio frontend model to generate a feature representation of the audio waveform, wherein the feature representation comprises a sequence of features at a second frequency, wherein the second frequency is lower than the first frequency, and wherein the learned audio frontend model is configured to:
apply a machine-learned filtering operation having a plurality of filtering parameters to the audio waveform to generate a filtered representation comprising a sequence of filtered features at the first frequency, wherein each filtered feature has a respective value for each of a plurality of channels;
apply a machine-learned pooling operation having a plurality of pooling parameters to the filtered representation to generate a pooled representation comprising a sequence of pooled features at the second frequency comprises:
for each channel, applying, with stride greater than one, a respective learned lowpass filter for the channel to the values in the filtered features for the channel to generate a set of pooled values for the channel that have the second frequency; and
apply a machine-learned normalization operation having a plurality of normalization parameters to the pooled representation to generate the feature representation; and
processing the feature representation using a first audio understanding machine learning model having a plurality of audio understanding parameters to generate a respective output for each of one or more audio understanding tasks, and wherein the learned audio frontend model and a second audio understanding machine learning model have been trained end-to-end on a set of at least one training audio understanding task to determine the filtering parameters of the machine-learned filtering operation, the pooling parameters of the machine-learned pooling operation comprising parameters of the respective learned lowpass filters for the plurality of channels, and the normalization parameters of the machine-learned normalization operation.
|