US 12,301,847 B2
Hierarchical video encoders
Vihan Jain, San Francisco, CA (US); Joonseok Lee, Fremont, CA (US); Ming Zhao, Sunnyvale, CA (US); Sheide Chammas, San Francisco, CA (US); Hexiang Hu, Los Angeles, CA (US); Bowen Zhang, Los Angeles, CA (US); Fei Sha, Los Angeles, CA (US); and Tze Way Eugene Ie, Los Altos, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Dec. 5, 2023, as Appl. No. 18/529,173.
Application 18/070,556 is a division of application No. 17/162,150, filed on Jan. 29, 2021, granted, now 11,533,495, issued on Dec. 20, 2022.
Application 18/529,173 is a continuation of application No. 18/070,556, filed on Nov. 29, 2022, granted, now 11,876,986.
Prior Publication US 2024/0114158 A1, Apr. 4, 2024
Int. Cl. H04N 19/30 (2014.01); G06N 20/00 (2019.01); H04N 19/172 (2014.01); H04N 19/20 (2014.01); H04N 19/40 (2014.01)
CPC H04N 19/30 (2014.11) [G06N 20/00 (2019.01); H04N 19/172 (2014.11)] 10 Claims
OG exemplary drawing
 
1. A computing system, the system comprising:
one or more processors;
one or more non-transitory computer readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising:
obtaining a training dataset, wherein the training dataset comprises a search query, a ground-truth video, and a negative video-query pair, wherein the ground-truth video is responsive to the search query;
processing the ground-truth video with a machine-learned hierarchical video encoder model to generate a plurality of contextualized segment representations, wherein each contextualized segment representation of the plurality of contextualized segment representations comprise segment-level semantic information for a respective video segment;
determining a first video-query compatibility score based on the search query and the plurality of contextualized segment representations;
determining a second video-query compatibility score based on a respective video representation and respective query of the negative video-query pair;
evaluating a loss function that evaluates a difference between the first video-query compatibility score and the second video-query compatibility score; and
adjusting one or more parameters of the machine-learned hierarchical video encoder model based at least in part on the loss function.