| CPC G06T 13/205 (2013.01) [G06T 7/20 (2013.01); G06T 7/70 (2017.01); G06T 13/40 (2013.01); G06V 40/176 (2022.01); G10L 15/22 (2013.01); G10L 25/57 (2013.01); G10L 25/60 (2013.01); G06T 2207/30201 (2013.01)] | 19 Claims |

|
1. A computer-implemented method, the method comprising:
receiving, from a microphone of a device, first audio data that includes a representation of first speech of a first user;
receiving, from an image sensor of the device, first image data representing a face of the first user;
generating, using the first image data, first motion data representing first facial motion of the first user corresponding to the first speech;
generating, by a machine learning transformer component using the first audio data and the first motion data, first embedding data that represents the first facial motion, wherein the first embedding data corresponds to a first coordinate system;
determining, using a first identifier representing a listener style, second embedding data corresponding to the first coordinate system;
generating, by a first machine learning model using the first embedding data and the second embedding data, first animation data corresponding to second facial motion responsive to the first speech;
generating, using the first animation data, second image data representing a synthetic face engaging in the second facial motion; and
presenting, on a display of the device, the second image data.
|