CPC G06T 13/00 (2013.01) [G06T 9/00 (2013.01); G10L 15/02 (2013.01); G10L 21/055 (2013.01); G10L 25/30 (2013.01)] | 11 Claims |
1. A device for generating a speech moving image that is a computing device comprising one or more processors and a memory storing one or more programs executed by the one or more processors, the device comprising:
a first encoder configured to receive a person background image in which a portion related to speech of a person that is a video part of the speech moving image of the person is covered with a mask, extract an image feature vector from the person background image, and compress the extracted image feature vector;
a second encoder configured to receive a speech audio signal that is an audio part of the speech moving image, extract a voice feature vector from the speech audio signal, and compress the extracted voice feature vector;
a combination unit configured to generate a combination vector by combining the compressed image feature vector output from the first encoder and the compressed voice feature vector output from the second encoder; and
an image reconstruction unit configured to reconstruct the speech moving image of the person with the combination vector as an input,
wherein the first encoder includes a first feature extraction unit that extracts the image feature vector from the person background image and a first compression unit that compresses the extracted image feature vector,
wherein the first compression unit calculates a representative value of an image feature vector for each channel based on the extracted image feature vector, calculates an image representative feature matrix using the representative value of the image feature vector for each channel as each matrix element, and controls a compressed size of the image feature vector by connecting a fully connected neural network to the image representative feature matrix, and
wherein the representative value is a mean value of the image feature vector for each channel; and
the first compression unit calculates the mean value of the image feature vector for each channel through Equation 1 below:
![]() where fc: Mean value of an image feature vector of a c-th channel;
H: Height of the image feature vector;
W: Width of the image feature vector; and
Fi,j,c: Image feature vector value of the c-th channel at (i, j) coordinates.
|