| CPC G10L 21/10 (2013.01) [G06N 3/0455 (2023.01); G06N 3/0475 (2023.01); G10L 15/02 (2013.01); G10L 15/063 (2013.01); G10L 15/16 (2013.01); G10L 21/06 (2013.01); G10L 25/57 (2013.01); G10L 2015/025 (2013.01)] | 12 Claims |

|
1. A video generation method, comprising:
acquiring target audio data to be synthesized;
extracting an acoustic feature of the target audio data as a target acoustic feature;
determining phonetic posteriorgrams (PPG) corresponding to the target audio data according to the target acoustic feature and generating an image sequence corresponding to the target audio data according to the PPG, wherein the PPG is used to characterize a distribution probability of a phoneme to which each audio frame in the target audio data belongs; and
performing a video synthesis on the target audio data and the image sequence corresponding to the target audio data to obtain target video data,
wherein determining the PPG corresponding to the target audio data according to the target acoustic feature and generating the image sequence corresponding to the target audio data according to the PPG comprise:
inputting the target acoustic feature into an image generation model, determining the PPG corresponding to the target audio data by the image generation model according to the target acoustic feature and generating the image sequence corresponding to the target audio data according to the PPG corresponding to the target audio data;
wherein the image generation model comprises a speech recognition sub-model, a gated recurrent unit (GRU) and a decoding network of a variational autoencoder (VAE) which are connected in sequence; and
wherein the speech recognition sub-model is configured to determine the PPG of the audio data according to the input acoustic feature of the audio data; the GRU is configured to determine a feature vector according to input PPG; and the decoding network is configured to generate the image sequence corresponding to the audio data according to the feature vector.
|