US 12,315,059 B2
Method for generating a talking head video with mouth movement sequence, device and computer-readable storage medium
Wan Ding, Shenzhen (CN); Dongyan Huang, Shenzhen (CN); Linhuang Yan, Shenzhen (CN); and Zhiyong Yang, Shenzhen (CN)
Assigned to UBTECH ROBOTICS CORP LTD, Shenzhen (CN)
Filed by UBTECH ROBOTICS CORP LTD, Shenzhen (CN)
Filed on May 26, 2023, as Appl. No. 18/202,291.
Claims priority of application No. 202210612090.4 (CN), filed on May 31, 2022.
Prior Publication US 2023/0386116 A1, Nov. 30, 2023
Int. Cl. G06T 13/20 (2011.01); G06T 13/40 (2011.01); G06V 40/20 (2022.01); G10L 13/02 (2013.01); G10L 13/08 (2013.01); G10L 21/10 (2013.01)
CPC G06T 13/40 (2013.01) [G06T 13/205 (2013.01); G06V 40/20 (2022.01); G10L 13/02 (2013.01); G10L 13/08 (2013.01); G10L 2021/105 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for generating a talking head video with mouth movement sequence, the method comprising:
obtaining a text and an image containing a face of a user;
determining a phoneme sequence that corresponds to the text and comprises one or more phonemes;
determining acoustic features corresponding to the text according to the phoneme sequence, and obtaining synthesized speech corresponding to the text according to the acoustic features;
determining a first mouth movement sequence corresponding to the text according to the phoneme sequence, and determining a second mouth movement sequence corresponding to the text according to the acoustic features;
creating a facial action video corresponding to the user according to the first mouth movement sequence, the second mouth movement sequence and the image; and
processing the synthesized speech and the facial action video synchronously, and generating a talking head video corresponding to the user;
wherein the method further comprises, before determining the second mouth movement sequence corresponding to the text according to the acoustic features,
obtaining a video data set that comprises multiple pieces of video data;
determining a training phoneme sequence corresponding to each of the multiple pieces of video data;
obtaining acoustic features corresponding to the video data according to the training phoneme sequences, and determining a second initial mouth movement sequence corresponding to the acoustic features;
determining a first mouth movement sequence corresponding to the training phoneme sequences;
obtaining a second training mouth movement sequence corresponding to the acoustic features according to the second initial mouth movement sequence and the first mouth movement sequence corresponding to the training phoneme sequence; and
obtaining a second prediction model using each of the acoustic features and the second training mouth movement sequences corresponding to acoustic features, wherein the second prediction model is configured to predict the second mouth movement sequence according to the acoustic features.