US 12,260,882 B2
	Actor-replacement system for videos
Sunil Ramesh, Cupertino, CA (US); Michael Cutter, Golden, CO (US); and Karina Levitian, Austin, TX (US)
Assigned to Roku, Inc., San Jose, CA (US)
Filed by Roku, Inc., San Jose, CA (US)
Filed on May 16, 2024, as Appl. No. 18/666,243.
Application 18/666,243 is a continuation of application No. 18/349,551, filed on Jul. 10, 2023, granted, now 12,014,753.
Application 18/349,551 is a continuation of application No. 18/062,410, filed on Dec. 6, 2022, granted, now 11,749,311, issued on Sep. 5, 2023.
Prior Publication US 2024/0304219 A1, Sep. 12, 2024
Int. Cl. G11B 27/00 (2006.01); G06T 7/00 (2017.01); G11B 27/036 (2006.01); G06T 7/70 (2017.01)

CPC G11B 27/036 (2013.01) [G06T 7/00 (2013.01); G06T 2207/10016 (2013.01); G06T 2207/30196 (2013.01)]

20 Claims

1. A computing system comprising a processor and a non-transitory computer-readable medium having stored thereon program instructions that upon execution by the processor, cause performance of a set of acts comprising:

estimating, using a skeletal detection model, a pose of an original actor for each of multiple frames of a video;

obtaining, for each of a plurality of the estimated poses of the original actor, a respective image of a modified version of the original actor;

generating, using the estimated poses and the images of the modified version of the original actor, synthetic frames corresponding to the multiple frames of the video that depict the modified version of the original actor in place of the original actor, wherein the synthetic frames depict the modified version of the original actor in respective poses that align with the estimated poses of the original actor in corresponding frames of the video, and wherein the synthetic frames comprise facial expressions for the modified version of the original actor that temporally align with corresponding speech, wherein generating the synthetic frames comprises, for a given frame of the multiple frames, inserting, using an object insertion model, an image of the modified version of the original actor into the given frame at a location indicated by the estimated pose of the original actor so as to obtain a modified frame, and wherein generating the synthetic frames further comprises providing the corresponding speech and the modified frame as input to a temporal generative adversarial network having an ensemble of discriminators; and

combining the synthetic frames and the corresponding speech so as to obtain a synthetic video that replaces the original actor with the modified version of the original actor.