US 12,254,864 B1
Augmenting datasets for training audio generation models
Mateusz Aleksander Lajszczak, Cambridge (GB); Adam Marek Gabrys, Sopot (PL); Arent van Korlaar, London (GB); Ruizhe Li, London (GB); Elena Sergeevna Sokolova, London (GB); Jaime Lorenzo Trueba, Madrid (ES); Arnaud Vincent Pierre Yves Joly, Cambridge (GB); Marco Nicolis, London (GB); and Ekaterina Petrova, Oberhaching (DE)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Jun. 30, 2022, as Appl. No. 17/854,439.
Int. Cl. G10L 13/047 (2013.01); G10L 15/02 (2006.01); G10L 15/16 (2006.01); G10L 15/18 (2013.01); G10L 15/22 (2006.01); G10L 25/18 (2013.01)
CPC G10L 13/047 (2013.01) [G10L 15/02 (2013.01); G10L 15/16 (2013.01); G10L 15/1815 (2013.01); G10L 15/22 (2013.01); G10L 25/18 (2013.01); G10L 2015/025 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
receiving first audio data representing first human speech;
processing the first audio data using a neural network encoder to generate first encoded data, the first encoded data representing semantic information and feature data of the first human speech, wherein the feature data represents one or more of prosody, timbre, speaker identity, or speaking style of the first human speech;
training an acoustic/semantic language model (ASLM) to process a first portion of the first encoded data to predict a second portion of the first encoded data, the second portion occurring after the first portion;
receiving second audio data representing second human speech having voice characteristics to be reproduced by a text-to-speech (TTS) model;
processing the second audio data using the neural network encoder to generate second encoded data, the second encoded data representing semantic information and feature data of the second human speech, wherein the feature data represents one or more of prosody, timbre, speaker identity, or speaking style of the first human speech;
processing the second encoded data using the ASLM to generate third encoded data, the third encoded data representing a predicted continuation of the second human speech;
processing the third encoded data using a neural network decoder to generate third audio data; and
training the TTS model using the second audio data and the third audio data.