US 11,928,767 B2
	Method for audio-driven character lip sync, model for audio-driven character lip sync and training method therefor
Huapeng Sima, Jiangsu (CN); and Zheng Liao, Jiangsu (CN)
Assigned to NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD., Jiangsu (CN)
Filed by NANJING SILICON INTELLIGENCE TECHNOLOGY CO., LTD., Jiangsu (CN)
Filed on Jun. 21, 2023, as Appl. No. 18/212,545.
Claims priority of application No. 202210953133.5 (CN), filed on Aug. 10, 2022.
Prior Publication US 2024/0054711 A1, Feb. 15, 2024
Int. Cl. G06T 13/40 (2011.01); G06T 13/20 (2011.01); G06T 13/80 (2011.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 10/98 (2022.01); G06V 40/20 (2022.01); G10L 21/06 (2013.01)

CPC G06T 13/40 (2013.01) [G06T 13/205 (2013.01); G06T 13/80 (2013.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 10/98 (2022.01); G06V 40/20 (2022.01); G10L 21/06 (2013.01)]

9 Claims

1. A system model on cloud for audio-driven character lip sync, wherein the model on cloud is configured to:

extract an audio feature of target speech and an image feature of a target image, wherein the target speech indicates speech for a target dynamic image generated for a target character, the target image indicates a lip sync image for the target dynamic image generated for the target character, and the target dynamic image indicates a video image when the target image performs lip sync actions corresponding to the target speech;

encode the audio feature and the image feature to obtain audio encoded data and image encoded data, respectively;

stitch the audio encoded data and the image encoded data to obtain image-audio data;

perform synthesis processing on the image-audio data to obtain a dynamic image encoding result;

decode the dynamic image encoding result to obtain the target dynamic image; and

generate, during a training process of the model, a silent video with preset duration based on a sample image, process the silent video as sample image data, and process an auxiliary video as auxiliary data, wherein the sample image indicates a lip sync image of the target character; the auxiliary video comprises a non-target lip sync action generated through speaking by a non-target character and non-target speech corresponding to the non-target lip sync action; and the sample image data and the auxiliary data are used for training of the model.