| CPC G10L 17/14 (2013.01) [G06N 3/08 (2013.01); G10L 17/02 (2013.01); G10L 21/10 (2013.01); H04N 5/265 (2013.01); H04N 21/2368 (2013.01); H04N 21/439 (2013.01); G10L 2021/105 (2013.01); G10L 25/30 (2013.01)] | 7 Claims |

|
1. A device for generating a speech video having one or more processors and a memory storing one or more programs executable by the one or more processors, the one or more processors are configured to:
using a first encoder, receive a person background image corresponding to a video part of a speech video of a person and extract an image feature vector from the person background image;
using a second encoder, receive a speech audio signal corresponding to an audio part of the speech video and extract a voice feature vector from the speech audio signal;
using a combiner, generate a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder; and
using a decoder, reconstruct the speech video of the person using the combined vector as an input,
wherein the image feature vector is a 3-dimensional vector in a form of height×width×channel, and the voice feature vector is a 1-dimensional vector in a form of channel,
wherein the one or more processors are further configured to, using the combiner:
transform the voice feature vector into a tensor having the same form as the image feature vector by copying the voice feature vector by the height of the image feature vector in a height direction and by copying the voice feature vector by the width of the image feature vector in a width direction, and
generate the combined vector by concatenating the image feature vector and the voice feature vector having the same form as the image feature vector.
|