US 12,002,139 B2
	Robust facial animation from video using neural networks
Inaki Navarro, Zurich (CH); Dario Kneubuhler, Zurich (CH); Tijmen Verhulsdonck, Gothenburg (SE); Eloi Du Bois, Austin, TX (US); Will Welch, San Francisco, CA (US); Vivek Verma, Oakland, CA (US); Ian Sachs, Corte Madera, CA (US); and Kiran Bhat, San Francisco, CA (US)
Assigned to Roblox Corporation, San Mateo, CA (US)
Filed by Roblox Corporation, San Mateo, CA (US)
Filed on Feb. 22, 2022, as Appl. No. 17/677,123.
Claims priority of provisional application 63/152,819, filed on Feb. 23, 2021.
Claims priority of provisional application 63/152,327, filed on Feb. 22, 2021.
Prior Publication US 2022/0270314 A1, Aug. 25, 2022
Int. Cl. G06T 13/40 (2011.01); G06T 7/73 (2017.01); G06V 10/24 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 40/16 (2022.01)

CPC G06T 13/40 (2013.01) [G06T 7/73 (2017.01); G06V 10/24 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 40/161 (2022.01); G06V 40/171 (2022.01); G06V 40/174 (2022.01); G06T 2207/20081 (2013.01); G06T 2207/20084 (2013.01); G06T 2207/30201 (2013.01)]

20 Claims

1. A computer-implemented method, comprising:

identifying, using a fully convolutional network, a set of bounding box candidates from a first frame of a video, wherein each bounding box candidate includes a face;

refining, using a convolutional neural network, the set of bounding box candidates into a bounding box;

obtaining a first set of one or more of a predefined facial expression weight, a head pose, and facial landmarks based on the bounding box and the first frame using an overloaded output convolutional neural network;

generating a first animation frame of an animation of a three-dimensional (3D) avatar based on the first set of the one or more of the predefined facial expression weight, the head pose, and the facial landmarks, wherein a head pose of the avatar matches the head pose in the first set and facial landmarks of the avatar match the facial landmarks in the first set; and

for each additional frame of the video subsequent to the first frame,

detecting whether the bounding box, applied to the additional frame, includes the face;

if it is detected that the bounding box includes the face, bypassing the fully convolutional network and the convolutional neural network, and obtaining an additional set of the one or more predefined facial expression weights, head poses, and facial landmarks based on the bounding box and the additional frame, by using the overloaded output convolutional neural network; and

generating an additional animation frame of the animation of the 3D avatar using the additional set.