US 12,406,419 B1
	Generating facial animation data from speech audio
Monica Villanueva Aylagas, Sundbyberg (SE); Mattias Teye, Sundbyberg (SE); and Hector Leon, Malmö (SE)
Assigned to ELECTRONIC ARTS INC., Redwood City, CA (US)
Filed by Electronic Arts Inc., Redwood City, CA (US)
Filed on Mar. 30, 2023, as Appl. No. 18/128,997.
Claims priority of provisional application 63/327,647, filed on Apr. 5, 2022.
Claims priority of provisional application 63/327,633, filed on Apr. 5, 2022.
Int. Cl. G06T 13/20 (2011.01); G06T 13/40 (2011.01); G10L 25/30 (2013.01)

CPC G06T 13/205 (2013.01) [G06T 13/40 (2013.01); G10L 25/30 (2013.01)]

15 Claims

1. A computer-implemented method, the method comprising:

obtaining a trained generative machine-learning model, the trained generative machine-learning model configured to process (i) input data derived from speech audio and (ii) a conditioning input representing a particular facial expression to generate facial animation data corresponding to the speech audio and the particular facial expression;

obtaining input data derived from speech audio for processing by the trained generative machine-learning model;

determining a conditioning input representing a particular facial expression from a set of reference speech animation examples, each reference speech animation example comprising data derived from speech audio and corresponding ground-truth facial animation data having the particular facial expression, wherein determining the conditioning input comprises:

initializing the conditioning input;

processing, using the trained generative machine-learning model: (i) the conditioning input, and (ii) the data derived from speech audio of one or more reference speech animation examples from the set of reference speech animation examples;

generating, as output of the trained generative machine learning model, predicted facial animation data for each reference speech animation example;

determining a loss for each reference speech animation example, wherein the loss for a reference speech animation example is dependent on the predicted facial animation data and the ground truth facial animation data of the reference speech animation example; and

updating the conditioning input based on the losses of the speech animation examples whilst the weights of the trained generative machine-learning model are held frozen;

processing, by the trained generative machine-learning model, (i) the input data derived from speech audio for processing and (ii) the determined conditioning input representing a particular facial expression from the set of reference speech animation examples to generate facial animation data corresponding to the speech audio and the particular facial expression.