US 12,233,338 B1
Robust speech audio generation for video games
Ping Zhong, Mountain View, CA (US); Zahra Shakeri, Newark, CA (US); Siddharth Gururani, Santa Clara, CA (US); Kilol Gupta, Redwood City, CA (US); and Shahab Raji, Highland Park, NJ (US)
Assigned to Electronic Arts Inc., Redwood City, CA (US)
Filed by Electronic Arts Inc., Redwood City, CA (US)
Filed on Nov. 16, 2021, as Appl. No. 17/527,533.
Int. Cl. A63F 13/54 (2014.01); G10L 13/02 (2013.01); G10L 17/04 (2013.01); G10L 17/18 (2013.01); G10L 17/22 (2013.01); G10L 19/16 (2013.01)
CPC A63F 13/54 (2014.09) [G10L 13/02 (2013.01); G10L 17/04 (2013.01); G10L 17/18 (2013.01); G10L 17/22 (2013.01); G10L 19/16 (2013.01); A63F 2300/6072 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method of training a machine-learned speech audio generation system for use in a video game, the training comprising:
receiving one or more training examples, each training example comprising: (i) ground-truth acoustic features for speech audio, (ii) speech content data representing speech content of the speech audio, and (iii) a ground-truth speaker identifier for a speaker of the speech audio;
for each of the one or more training examples:
generating, by a speaker encoder, a speaker embedding, comprising processing the ground-truth acoustic features;
generating, by an expression encoder, an expression embedding for the training example, comprising processing the ground-truth acoustic features;
classifying, by an expression-speaker classifier, the expression embedding to generate a first speaker classification;
generating, by a speech content encoder of a synthesizer, a speech content embedding, comprising processing the speech content data;
generating a combined embedding, comprising combining the speaker embedding, the expression embedding, and the speech content embedding;
classifying, by a combined-speaker classifier, the combined embedding to generate a second speaker classification;
decoding, by a decoder of the synthesizer, the combined embedding, to generate predicted acoustic features for the training example; and
updating parameters of the machine-learned speech audio generation system to: (i) minimize a measure of difference between the predicted acoustic features of a training example and the corresponding ground-truth acoustic features of the training example, (ii) maximize a measure of difference between the first speaker classification for the training example and the corresponding ground-truth speaker identifier of the training example, and (iii) minimize a measure of difference between the second speaker classification for the training example and the corresponding ground-truth speaker identifier of the training example.