| CPC G10L 15/16 (2013.01) [G06N 3/0455 (2023.01); G06N 3/0475 (2023.01); G10H 1/0008 (2013.01); G10L 15/063 (2013.01); G10L 15/1815 (2013.01); G10H 2210/056 (2013.01); G10H 2250/311 (2013.01)] | 20 Claims |

|
1. A method for training one or more generative neural networks for generating a prediction of an audio signal, the audio signal having a respective audio sample at each of a plurality of output time steps spanning a time window, the training comprising:
obtaining, as training data, a plurality of target audio signals;
training, on a set of target semantic representations generated from the target audio signals, a third generative neural network that generates a semantic representation of the audio signal, wherein the semantic representation specifies a respective semantic token at each of a plurality of first time steps spanning the time window, each semantic token representing semantic content of the audio signal at the corresponding first time step; and
training, on a set of target acoustic representations generated from the target audio signals, a first generative neural network and a second generative neural network that generate an acoustic representation of the audio signal, wherein the acoustic representation specifies a set of one or more respective acoustic tokens at each of a plurality of second time steps spanning the time window, the one or more respective acoustic tokens at each second time step representing acoustic properties of the audio signal at the corresponding second time step.
|