US 12,148,431 B2
	Method and device for generating speech video using audio signal
Gyeongsu Chae, Seoul (KR); Guembuel Hwang, Seoul (KR); Sungwoo Park, Seoul (KR); and Seyoung Jang, Seoul (KR)
Assigned to DEEPBRAIN AI INC., Seoul (KR)
Appl. No. 17/620,867
Filed by DEEPBRAIN AI INC., Seoul (KR)
PCT Filed Jun. 19, 2020, PCT No. PCT/KR2020/007975 § 371(c)(1), (2) Date Apr. 20, 2022, PCT Pub. No. WO2020/256472, PCT Pub. Date Dec. 24, 2020.
Claims priority of application No. 10-2019-0074150 (KR), filed on Jun. 21, 2019; and application No. 10-2020-0070748 (KR), filed on Jun. 11, 2020.
Prior Publication US 2022/0399025 A1, Dec. 15, 2022
Int. Cl. G10L 17/14 (2013.01); G06N 3/08 (2023.01); G10L 17/02 (2013.01); G10L 21/10 (2013.01); H04N 5/265 (2006.01); H04N 21/2368 (2011.01); H04N 21/439 (2011.01); G10L 25/30 (2013.01)

CPC G10L 17/14 (2013.01) [G06N 3/08 (2013.01); G10L 17/02 (2013.01); G10L 21/10 (2013.01); H04N 5/265 (2013.01); H04N 21/2368 (2013.01); H04N 21/439 (2013.01); G10L 2021/105 (2013.01); G10L 25/30 (2013.01)]

7 Claims

1. A device for generating a speech video having one or more processors and a memory storing one or more programs executable by the one or more processors, the one or more processors are configured to:

using a first encoder, receive a person background image corresponding to a video part of a speech video of a person and extract an image feature vector from the person background image;

using a second encoder, receive a speech audio signal corresponding to an audio part of the speech video and extract a voice feature vector from the speech audio signal;

using a combiner, generate a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder; and

using a decoder, reconstruct the speech video of the person using the combined vector as an input,

wherein the image feature vector is a 3-dimensional vector in a form of height×width×channel, and the voice feature vector is a 1-dimensional vector in a form of channel,

wherein the one or more processors are further configured to, using the combiner:

transform the voice feature vector into a tensor having the same form as the image feature vector by copying the voice feature vector by the height of the image feature vector in a height direction and by copying the voice feature vector by the width of the image feature vector in a width direction, and

generate the combined vector by concatenating the image feature vector and the voice feature vector having the same form as the image feature vector.