US 12,236,943 B2
Apparatus and method for generating lip sync image
Guem Buel Hwang, Seoul (KR); and Gyeong Su Chae, Seoul (KR)
Assigned to DEEPBRAIN AI INC., Seoul (KR)
Appl. No. 17/764,651
Filed by DEEPBRAIN AI INC., Seoul (KR)
PCT Filed Jun. 8, 2021, PCT No. PCT/KR2021/007125
§ 371(c)(1), (2) Date Mar. 29, 2022,
PCT Pub. No. WO2022/149667, PCT Pub. Date Jul. 14, 2022.
Claims priority of application No. 10-2021-0003375 (KR), filed on Jan. 11, 2021.
Prior Publication US 2023/0178072 A1, Jun. 8, 2023
Int. Cl. G10L 21/10 (2013.01); G10L 15/16 (2006.01); G10L 15/25 (2013.01)
CPC G10L 15/16 (2013.01) [G10L 21/10 (2013.01); G10L 15/25 (2013.01); G10L 2021/105 (2013.01)] 8 Claims
OG exemplary drawing
 
1. An apparatus for generating a lip sync image having one or more processors and a memory which stores one or more programs executed by the one or more processors, the apparatus comprising:
a first artificial neural network model configured to generate an utterance match synthesis image by using a person background image and an utterance match audio signal as an input, and generate an utterance mismatch synthesis image by using the person background image and an utterance mismatch audio signal as an input; and
a second artificial neural network model configured to output classification values for an input pair in which an image and a voice match and an input pair in which an image and a voice do not match by using the input pairs as an input,
wherein the utterance match audio signal is a voice signal which matches a figure in which the corresponding person utters in the person background image,
wherein the utterance mismatch audio signal is a voice signal which does not match the figure in which the corresponding person utters in the person background image,
wherein the second artificial neural network model is trained to classify the input pair in which an image and a voice match as True, and to classify the input pair in which an image and a voice do not match as False,
wherein the first artificial neural network model is configured to receive the utterance mismatch synthesis image generated by the first artificial neural network model and the utterance mismatch audio signal used as the input when generating the utterance mismatch synthesis image and classify the utterance mismatch synthesis image and the utterance mismatch audio signal as True, and propagate a generative adversarial error to the first artificial neural network model through an adversarial learning method.