| CPC G06T 13/205 (2013.01) [G06T 13/40 (2013.01); G10L 25/30 (2013.01)] | 15 Claims | 

| 
               1. A computer-implemented method, the method comprising: 
            obtaining a trained generative machine-learning model, the trained generative machine-learning model configured to process (i) input data derived from speech audio and (ii) a conditioning input representing a particular facial expression to generate facial animation data corresponding to the speech audio and the particular facial expression; 
                obtaining input data derived from speech audio for processing by the trained generative machine-learning model; 
                determining a conditioning input representing a particular facial expression from a set of reference speech animation examples, each reference speech animation example comprising data derived from speech audio and corresponding ground-truth facial animation data having the particular facial expression, wherein determining the conditioning input comprises: 
                initializing the conditioning input; 
                  processing, using the trained generative machine-learning model: (i) the conditioning input, and (ii) the data derived from speech audio of one or more reference speech animation examples from the set of reference speech animation examples; 
                  generating, as output of the trained generative machine learning model, predicted facial animation data for each reference speech animation example; 
                  determining a loss for each reference speech animation example, wherein the loss for a reference speech animation example is dependent on the predicted facial animation data and the ground truth facial animation data of the reference speech animation example; and 
                  updating the conditioning input based on the losses of the speech animation examples whilst the weights of the trained generative machine-learning model are held frozen; 
                processing, by the trained generative machine-learning model, (i) the input data derived from speech audio for processing and (ii) the determined conditioning input representing a particular facial expression from the set of reference speech animation examples to generate facial animation data corresponding to the speech audio and the particular facial expression. 
               |