US 12,322,380 B2
	Generating audio using auto-regressive generative neural networks
Andrea Agostinelli, Zurich (CH); Timo Immanuel Denk, Zurich (CH); Antoine Caillon, Paris (FR); Neil Zeghidour, Paris (FR); Jesse Engel, Orinda, CA (US); Mauro Verzetti, Dübendorf (CH); Christian Frank, Zurich (CH); Zalán Borsos, Zurich (CH); Matthew Sharifi, Kilchberg (CH); Adam Joseph Roberts, Durham, NC (US); and Marco Tagliasacchi, Kilchberg (CH)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jan. 12, 2024, as Appl. No. 18/412,394.
Application 18/412,394 is a continuation of application No. 18/463,196, filed on Sep. 7, 2023, granted, now 11,915,689.
Claims priority of provisional application 63/441,412, filed on Jan. 26, 2023.
Claims priority of provisional application 63/404,528, filed on Sep. 7, 2022.
Prior Publication US 2024/0233713 A1, Jul. 11, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G10L 15/16 (2006.01); G06N 3/0455 (2023.01); G06N 3/0475 (2023.01); G10H 1/00 (2006.01); G10L 15/06 (2013.01); G10L 15/18 (2013.01)

CPC G10L 15/16 (2013.01) [G06N 3/0455 (2023.01); G06N 3/0475 (2023.01); G10H 1/0008 (2013.01); G10L 15/063 (2013.01); G10L 15/1815 (2013.01); G10H 2210/056 (2013.01); G10H 2250/311 (2013.01)]

20 Claims

1. A method for training one or more generative neural networks for generating a prediction of an audio signal, the audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window, the training comprising:

obtaining, as training data, a plurality of target audio signals;

training, on a set of target semantic representations generated from the target audio signals, a third generative neural network that generates a semantic representation of the audio signal, wherein the semantic representation specifies a respective semantic token at each of a plurality of first time steps spanning the time window, each semantic token representing semantic content of the audio signal at the corresponding first time step; and

training, on a set of target acoustic representations generated from the target audio signals, a first generative neural network and a second generative neural network that generate an acoustic representation of the audio signal, wherein the acoustic representation specifies a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window, the one or more respective acoustic tokens at each second time step representing acoustic properties of the audio signal at the corresponding second time step.