US 12,456,480 B2
	Video generation method, generation model training method and apparatus, and medium and device
Xiang Yin, Beijing (CN)
Assigned to BEIJING BYTEDANCE NETWORK TECHNOLOGY CO., LTD., Beijing (CN)
Appl. No. 18/000,387
Filed by BEIJING BYTEDANCE NETWORK TECHNOLOGY CO., LTD., Beijing (CN)
PCT Filed Jul. 30, 2021, PCT No. PCT/CN2021/109460 § 371(c)(1), (2) Date Nov. 30, 2022, PCT Pub. No. WO2022/033327, PCT Pub. Date Feb. 17, 2022.
Claims priority of application No. 202010807940.7 (CN), filed on Aug. 12, 2020.
Prior Publication US 2023/0223010 A1, Jul. 13, 2023
Int. Cl. G10L 21/10 (2013.01); G06N 3/0455 (2023.01); G06N 3/0475 (2023.01); G10L 15/02 (2006.01); G10L 15/06 (2013.01); G10L 15/16 (2006.01); G10L 21/06 (2013.01); G10L 25/57 (2013.01)

CPC G10L 21/10 (2013.01) [G06N 3/0455 (2023.01); G06N 3/0475 (2023.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 21/06 (2013.01); G10L 25/57 (2013.01); G10L 2015/025 (2013.01)]

12 Claims

1. A video generation method, comprising:

acquiring target audio data to be synthesized;

extracting an acoustic feature of the target audio data as a target acoustic feature;

determining phonetic posteriorgrams (PPG) corresponding to the target audio data according to the target acoustic feature and generating an image sequence corresponding to the target audio data according to the PPG, wherein the PPG is used to characterize a distribution probability of a phoneme to which each audio frame in the target audio data belongs; and

performing a video synthesis on the target audio data and the image sequence corresponding to the target audio data to obtain target video data,

wherein determining the PPG corresponding to the target audio data according to the target acoustic feature and generating the image sequence corresponding to the target audio data according to the PPG comprise:

inputting the target acoustic feature into an image generation model, determining the PPG corresponding to the target audio data by the image generation model according to the target acoustic feature and generating the image sequence corresponding to the target audio data according to the PPG corresponding to the target audio data;

wherein the image generation model comprises a speech recognition sub-model, a gated recurrent unit (GRU) and a decoding network of a variational autoencoder (VAE) which are connected in sequence; and

wherein the speech recognition sub-model is configured to determine the PPG of the audio data according to the input acoustic feature of the audio data; the GRU is configured to determine a feature vector according to input PPG; and the decoding network is configured to generate the image sequence corresponding to the audio data according to the feature vector.