| CPC G11B 27/036 (2013.01) [G06T 7/00 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/30196 (2013.01)] | 20 Claims |

|
1. A computing system comprising a processor and a non-transitory computer-readable medium having stored thereon program instructions that upon execution by the processor, cause performance of a set of acts comprising:
estimating, using a skeletal detection model, a pose of an original actor for each of multiple frames of a video;
obtaining, for each of a plurality of the estimated poses of the original actor, a respective image of a modified version of the original actor;
generating, using the estimated poses and the images of the modified version of the original actor, synthetic frames corresponding to the multiple frames of the video that depict the modified version of the original actor in place of the original actor, wherein the synthetic frames depict the modified version of the original actor in respective poses that align with the estimated poses of the original actor in corresponding frames of the video, and wherein the synthetic frames comprise facial expressions for the modified version of the original actor that temporally align with corresponding speech, wherein generating the synthetic frames comprises, for a given frame of the multiple frames, inserting, using an object insertion model, an image of the modified version of the original actor into the given frame at a location indicated by the estimated pose of the original actor so as to obtain a modified frame, and wherein generating the synthetic frames further comprises providing the corresponding speech and the modified frame as input to a temporal generative adversarial network having an ensemble of discriminators; and
combining the synthetic frames and the corresponding speech so as to obtain a synthetic video that replaces the original actor with the modified version of the original actor.
|