US 12,205,212 B2
	Method and device for generating speech moving image
Gyeongsu Chae, Seoul (KR); and Guembuel Hwang, Seoul (KR)
Assigned to DEEPBRAIN AI INC., Seoul (KR)
Appl. No. 17/762,914
Filed by DEEPBRAIN AI INC., Seoul (KR)
PCT Filed Dec. 8, 2020, PCT No. PCT/KR2020/017847 § 371(c)(1), (2) Date Mar. 23, 2022, PCT Pub. No. WO2022/014800, PCT Pub. Date Jan. 20, 2022.
Claims priority of application No. 10-2020-0086183 (KR), filed on Jul. 13, 2020.
Prior Publication US 2022/0398793 A1, Dec. 15, 2022
Int. Cl. G06T 13/00 (2011.01); G06T 9/00 (2006.01); G10L 15/02 (2006.01); G10L 21/055 (2013.01); G10L 25/30 (2013.01)

CPC G06T 13/00 (2013.01) [G06T 9/00 (2013.01); G10L 15/02 (2013.01); G10L 21/055 (2013.01); G10L 25/30 (2013.01)]

11 Claims

1. A device for generating a speech moving image that is a computing device comprising one or more processors and a memory storing one or more programs executed by the one or more processors, the device comprising:

a first encoder configured to receive a person background image in which a portion related to speech of a person that is a video part of the speech moving image of the person is covered with a mask, extract an image feature vector from the person background image, and compress the extracted image feature vector;

a second encoder configured to receive a speech audio signal that is an audio part of the speech moving image, extract a voice feature vector from the speech audio signal, and compress the extracted voice feature vector;

a combination unit configured to generate a combination vector by combining the compressed image feature vector output from the first encoder and the compressed voice feature vector output from the second encoder; and

an image reconstruction unit configured to reconstruct the speech moving image of the person with the combination vector as an input,

wherein the first encoder includes a first feature extraction unit that extracts the image feature vector from the person background image and a first compression unit that compresses the extracted image feature vector,

wherein the first compression unit calculates a representative value of an image feature vector for each channel based on the extracted image feature vector, calculates an image representative feature matrix using the representative value of the image feature vector for each channel as each matrix element, and controls a compressed size of the image feature vector by connecting a fully connected neural network to the image representative feature matrix, and

wherein the representative value is a mean value of the image feature vector for each channel; and

the first compression unit calculates the mean value of the image feature vector for each channel through Equation 1 below:

where f_c: Mean value of an image feature vector of a c-th channel;

H: Height of the image feature vector;

W: Width of the image feature vector; and

F_i,j,c: Image feature vector value of the c-th channel at (i, j) coordinates.