| CPC G06T 13/40 (2013.01) [G06T 7/246 (2017.01); G06T 7/262 (2017.01); G06T 11/206 (2013.01); G06T 13/205 (2013.01); G10L 15/02 (2013.01); G06T 2207/20076 (2013.01); G06T 2207/30201 (2013.01)] | 12 Claims |

|
1. An apparatus for generating a speech synthesis image based on machine learning, the apparatus comprising:
at least one processor configured to implement:
a first global geometric transformation predictor configured to be trained to receive each of a source image and a target image including the same person, and predict a global geometric transformation for a global motion of the person between the source image and the target image, based on the source image and the target image;
a local feature tensor predictor configured to be trained to predict a feature tensor for a local motion of the person, based on preset input data; and
an image generator configured to be trained to reconstruct the target image, based on the global geometric transformation, the source image, and the feature tensor for the local motion-,
wherein the local feature tensor predictor includes a first local feature tensor predictor configured to be trained to predict a speech feature tensor for a local speech motion of the person, based on a preset voice signal, and
the local speech motion is a motion related to speech of the person.
|