US 11,816,577 B2
	Augmentation of audiographic images for improved machine learning
Daniel Sung-Joon Park, Sunnyvale, CA (US); Quoc Le, Sunnyvale, CA (US); William Chan, Toronto (CA); Ekin Dogus Cubuk, San Francisco, CA (US); Barret Zoph, San Francisco, CA (US); Yu Zhang, Mountain View, CA (US); and Chung-Cheng Chiu, Mountain View, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Sep. 28, 2021, as Appl. No. 17/487,548.
Application 17/487,548 is a continuation of application No. 16/416,888, filed on May 20, 2019, granted, now 11,138,471.
Claims priority of provisional application 62/831,528, filed on Apr. 9, 2019.
Claims priority of provisional application 62/673,777, filed on May 18, 2018.
Prior Publication US 2022/0012537 A1, Jan. 13, 2022
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/06 (2013.01); G10L 15/12 (2006.01); G06N 3/084 (2023.01); G10L 15/16 (2006.01); G10L 15/28 (2013.01); G06N 20/00 (2019.01); G06F 18/214 (2023.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01)

CPC G06N 3/084 (2013.01) [G06F 18/2148 (2023.01); G06N 20/00 (2019.01); G06V 10/7747 (2022.01); G06V 10/82 (2022.01); G10L 15/063 (2013.01); G10L 15/12 (2013.01); G10L 15/16 (2013.01); G10L 15/28 (2013.01)]

22 Claims

21. One or more non-transitory computer-readable media that collective store instructions that, when executed by one or more processors, cause the one or more processors to perform operations, the operations comprising:

obtaining a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal;

generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images, wherein the one or more augmentation operations includes:

a time warping operation comprising warping image content of the audiographic image along an axis representative of time,

a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or

a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image;

inputting the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions;

evaluating an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and

modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function.