US 12,307,567 B2
	Methods and systems for emotion-controllable generalized talking face generation
Sanjana Sinha, Kolkata (IN); Sandika Biswas, Kolkata (IN); and Brojeshwar Bhowmick, Kolkata (IN)
Assigned to Tata Consultancy Services Limited, Mumbai (IN)
Filed by Tata Consultancy Services Limited, Mumbai (IN)
Filed on Feb. 2, 2023, as Appl. No. 18/163,704.
Claims priority of application No. 202221025055 (IN), filed on Apr. 28, 2022.
Prior Publication US 2023/0351662 A1, Nov. 2, 2023
Int. Cl. G06T 13/40 (2011.01); G06N 3/0455 (2023.01); G06T 13/20 (2011.01); G06V 10/80 (2022.01); G06V 40/16 (2022.01); G10L 25/63 (2013.01)

CPC G06T 13/40 (2013.01) [G06N 3/0455 (2023.01); G06T 13/205 (2013.01); G06V 10/806 (2022.01); G06V 40/171 (2022.01); G10L 25/63 (2013.01)]

13 Claims

7. A system comprising:

a memory storing instructions;

one or more input/output (I/O) interfaces; and

one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to:

receive a plurality of training samples, wherein each training sample of the plurality of training samples comprises a speech audio input data, an emotion input data comprising an emotion type and an emotion intensity, an input image of a target subject in a neutral emotion, and a ground-truth image corresponding to the emotion input data;

train a geometry-aware landmark generation network, with each training sample at a time, until the plurality of training samples is completed, to obtain a trained speech and emotion driven geometry-aware landmark generation model, wherein the geometry-aware landmark generation network comprises an audio encoder network, a first emotion encoder network, a graph encoder network, and a graph decoder network, and wherein training the geometry-aware landmark generation network with each training sample comprises:

obtaining a set of emotion-invariant speech embedding features, from the speech audio input data present in the training sample, using the audio encoder network;

obtaining a set of first emotion embedding features, from the emotion input data present in the training sample, using the first emotion encoder network;

obtaining a set of graph embedding features, from the input image of the target subject in the neutral emotion present in the training sample, using the graph encoder network;

concatenating (i) the set of emotion-invariant speech embedding features, (ii) the set of first emotion embedding features, and (iii) the set of graph embedding features, to obtain concatenated embedding features of the training sample;

decoding the concatenated embedding features of the training sample, to predict a landmark graph of the training sample, using the graph decoder network, wherein the predicted landmark graph comprises an ordered graph representation of predicted speech and emotion driven geometry-aware facial landmarks of the training sample;

minimizing a loss function of the geometry-aware landmark generation network, wherein the loss function computes a difference between the predicted landmark graph of the training sample, and a ground-truth landmark graph obtained from the ground-truth image corresponding to the training sample; and

updating weights of the geometry-aware landmark generation network, based on the minimization of the loss function of the geometry-aware landmark generation network; and

train a flow-guided texture generation network with each training sample at a time, until the plurality of training samples is completed, to obtain a trained flow-guided texture generation model, using the predicted landmark graph of each training sample, wherein the flow-guided texture generation network comprises an image encoder network, a landmark encoder network, a second emotion encoder network, a feature concatenation encoder-decoder network, and an image decoder network, and wherein training the flow-guided texture generation network with each training sample comprises:

obtaining a set of identity features from the input image of the target subject in the neutral emotion present in the training sample, using the image encoder network;

obtaining a set of differential landmark features, from the predicted landmark graph of the training sample and the neutral landmark graph corresponding to the input image of the target subject in the neutral emotion present in the training sample, using the landmark encoder network;

obtaining a set of second emotion embedding features, from the emotion input data present in the training sample, using the second emotion encoder network;

combining (i) the set of identity features, (ii) the set of differential landmark features, (iii) the set of second emotion embedding features, to obtain a dense flow map and an occlusion map, for the training sample, using the feature concatenation encoder-decoder network;

passing the dense flow map and the occlusion map for the training sample, to the image decoder network, to predict an emotional talking face image for the target subject present in the training sample, wherein the predicted emotional talking face image comprises the speech audio input data and the emotion input data corresponding to the training sample;

minimizing a loss function of the flow-guided texture generation network, wherein the loss function of the flow-guided texture generation network computes the difference between the predicted emotional talking face image of the training sample, and the ground-truth image corresponding to the training sample; and

updating weights of the flow-guided texture generation network, based on the minimization of the loss function of the flow-guided texture generation network.