US 12,482,452 B2
	Learned audio frontend machine learning model for audio understanding
Neil Zeghidour, Paris (FR); Olivier Teboul, Paris (FR); Félix de Chaumont Quitry, Zürich (CH); and Marco Tagliasacchi, Kilchberg (CH)
Assigned to Google LLC, Mountain View, CA (US)
Appl. No. 18/029,843
Filed by Google LLC, Mountain View, CA (US)
PCT Filed Oct. 4, 2021, PCT No. PCT/US2021/053425 § 371(c)(1), (2) Date Mar. 31, 2023, PCT Pub. No. WO2022/072941, PCT Pub. Date Apr. 7, 2022.
Claims priority of provisional application 63/087,144, filed on Oct. 2, 2020.
Prior Publication US 2023/0377561 A1, Nov. 23, 2023
Int. Cl. G10L 15/02 (2006.01); G10L 15/16 (2006.01)

CPC G10L 15/02 (2013.01) [G10L 15/16 (2013.01)]

20 Claims

1. A method performed by one or more computers, the method comprising:

obtaining an audio waveform comprising a sequence of audio samples at a first frequency;

processing the audio waveform using a learned audio frontend model to generate a feature representation of the audio waveform, wherein the feature representation comprises a sequence of features at a second frequency, wherein the second frequency is lower than the first frequency, and wherein the learned audio frontend model is configured to:

apply a machine-learned filtering operation having a plurality of filtering parameters to the audio waveform to generate a filtered representation comprising a sequence of filtered features at the first frequency, wherein each filtered feature has a respective value for each of a plurality of channels;

apply a machine-learned pooling operation having a plurality of pooling parameters to the filtered representation to generate a pooled representation comprising a sequence of pooled features at the second frequency comprises:

for each channel, applying, with stride greater than one, a respective learned lowpass filter for the channel to the values in the filtered features for the channel to generate a set of pooled values for the channel that have the second frequency; and

apply a machine-learned normalization operation having a plurality of normalization parameters to the pooled representation to generate the feature representation; and

processing the feature representation using a first audio understanding machine learning model having a plurality of audio understanding parameters to generate a respective output for each of one or more audio understanding tasks, and wherein the learned audio frontend model and a second audio understanding machine learning model have been trained end-to-end on a set of at least one training audio understanding task to determine the filtering parameters of the machine-learned filtering operation, the pooling parameters of the machine-learned pooling operation comprising parameters of the respective learned lowpass filters for the plurality of channels, and the normalization parameters of the machine-learned normalization operation.