| CPC G06T 7/73 (2017.01) [G06F 16/685 (2019.01); G06F 40/242 (2020.01); G06T 7/207 (2017.01)] | 17 Claims |

|
1. A method comprising:
receiving a first input including a reference speech video, the reference speech video including a reference video sequence paired with a reference audio sequence;
generating a video motion graph representing the reference speech video, wherein each node of the video motion graph is associated with a frame of the reference video sequence and reference audio features of the reference audio sequence;
receiving a second input including a target audio sequence;
identifying a node path through the video motion graph based on target audio features and the reference audio features; and
generating an output media sequence, the output media sequence including an output video sequence generated based on the identified node path through the video motion graph paired with the target audio sequence, including blending, by a trained neural network, frames of the reference video associated with one or more nodes surrounding pairs of consecutive nodes in the identified node path that are non-consecutive nodes in the video motion graph.
|