US 12,296,265 B1
	Speech prosody prediction in video games
Kilol Gupta, Redwood City, CA (US); Zahra Shakeri, Newark, CA (US); Gordon Durity, Surrey (CA); Mohsen Sardari, Burlingame, CA (US); Harold Chaput, Castro Valley, CA (US); and Navid Aghdaie, San Jose, CA (US)
Assigned to ELECTRONIC ARTS INC., Redwood City, CA (US)
Filed by Electronic Arts Inc., Redwood City, CA (US)
Filed on Jan. 9, 2024, as Appl. No. 18/407,686.
Application 18/407,686 is a continuation of application No. 16/953,801, filed on Nov. 20, 2020, abandoned.
Int. Cl. A63F 13/54 (2014.01); G06F 3/16 (2006.01); G06N 3/08 (2023.01); G10L 13/02 (2013.01); G10L 19/04 (2013.01); G10L 25/30 (2013.01); G10L 25/51 (2013.01)

CPC A63F 13/54 (2014.09) [G06F 3/16 (2013.01); G06N 3/08 (2013.01); G10L 13/02 (2013.01); G10L 19/04 (2013.01); G10L 25/30 (2013.01); G10L 25/51 (2013.01); A63F 2300/6081 (2013.01)]

20 Claims

1. A computer-implemented method of generating context-dependent speech audio in a video game, the method comprising:

enabling, by at least one processor of a computing device, gameplay of the video game;

determining, by a video game engine of the video game on the at least one processor, an in-game event for which context-dependent speech audio is to be generated during the gameplay of the video game, wherein the in-game event includes an action performed by a character of the video game;

obtaining, by the video game engine of the video game, contextual information and speech content data relating to a current state of the gameplay;

requesting, by the video game engine of the video game, the context-dependent speech audio from a speech audio generator of the video game;

generating, by the speech audio generator responsive to the request, the context-dependent speech audio by:

inputting the contextual information relating to the current state of the gameplay into a prosody prediction model, wherein the prosody prediction model comprises a trained machine learning model which is configured to generate predicted prosodic features based on the contextual information;

generating, by the prosody prediction model, predicted prosodic features from the input contextual information;

inputting, into a speech audio generation model, input data comprising:

at least the predicted prosodic features; and

the speech content data relating to the current state of the gameplay;

generating, using one or more encoders of the speech audio generation model, an encoded representation of the speech content data dependent on the predicted prosodic features;

decoding, using a decoder of the speech audio generation model, the encoded representation to generate the context-dependent speech audio; and

causing, by the video game engine of the video game, the context-dependent speech audio that matches the current state of the video game to be played among the gameplay of the in-game event.