US 12,462,460 B2
Photorealistic talking faces from audio
Vivek Kwatra, Saratoga, CA (US); Christian Frueh, Mountain View, CA (US); Avisek Lahiri, West Bengal (IN); and John Lewis, Mountain View, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jun. 5, 2024, as Appl. No. 18/734,327.
Application 18/734,327 is a continuation of application No. 17/796,399, granted, now 12,033,259, previously published as PCT/US2021/015698, filed on Jan. 29, 2021.
Claims priority of provisional application 62/967,335, filed on Jan. 29, 2020.
Prior Publication US 2024/0320892 A1, Sep. 26, 2024
Int. Cl. G06T 13/20 (2011.01); G06T 13/40 (2011.01); G06T 17/20 (2006.01)
CPC G06T 13/205 (2013.01) [G06T 13/40 (2013.01); G06T 17/20 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computing system to generate a talking face from an audio signal, the computing system comprising:
one or more processors; and
one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:
obtaining audio data descriptive audio signals comprising speech;
processing the audio data with a machine-learned face geometry prediction model to predict a set of face geometries based on the audio data, wherein the machine-learned face geometry prediction model was trained to predict three-dimensional face geometries based on data descriptive of input audio signals comprising speech;
processing the audio data with a machine-learned face texture prediction model to predict a set of face textures based on the audio data, wherein the machine-learned face texture prediction model was trained to predict two-dimensional face textures based on data descriptive of input audio signals comprising speech; and
generating a synthesized video based on the audio data, the set of face geometries, and the set of face textures, wherein the synthesized video comprises a face performing movements associated with the speech of the audio data based on the set of face geometries and the set of face textures.