US 12,393,839 B2
Video anchors
Gabe Culbertson, Palo Alto, CA (US); Wei Peng, Fremont, CA (US); and Nicolas Crowell, San Francisco, CA (US)
Assigned to GOOGLE LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Jun. 14, 2023, as Appl. No. 18/334,648.
Application 18/334,648 is a continuation of application No. 17/069,638, filed on Oct. 13, 2020, granted, now 11,720,793.
Claims priority of provisional application 62/914,684, filed on Oct. 14, 2019.
Prior Publication US 2023/0325669 A1, Oct. 12, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 3/08 (2023.01); G06F 18/22 (2023.01); G06F 18/23 (2023.01); G06N 5/02 (2023.01); G06V 10/74 (2022.01); G06V 10/762 (2022.01); G06V 20/40 (2022.01)
CPC G06N 3/08 (2013.01) [G06F 18/22 (2023.01); G06F 18/23 (2023.01); G06N 5/02 (2013.01); G06V 10/761 (2022.01); G06V 10/762 (2022.01); G06V 20/41 (2022.01); G06V 20/47 (2022.01); G06V 20/44 (2022.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method to generate augmented reality imagery, the method comprising:
obtaining, by a computing system comprising one or more processors, a video;
processing, by the computing system, the video with a machine-learned anchor model to determine a plurality of anchors associated with the video and a plurality of respective anchor text datasets, wherein each anchor in the plurality of anchors for the video begin at a respective playback time specified by a respective time index value of a time in the video, wherein each respective anchor text dataset of the plurality of respective anchor text datasets is predicted to be descriptive of subject matter in the video beginning at the time index value;
wherein the machine-learned anchor model was trained on a training dataset comprising: one or more training videos, a set of training anchors associated with the one or more training videos text generated based on training audio associated with the one or more training videos, and a set of entity labels, wherein each training anchor of the set of training anchors is associated with a specific playback time of the one or more training videos, wherein the text is generated via automatic speech recognition, and wherein at least a subset of the set of entity labels are associated with the text generated based on the training audio, wherein the set of entity labels were generated based on:
generating one or more transcripts for the one or more training videos;
processing the one or more transcripts of the one or more training videos to generate a list of entities associated with the one or more training videos;
determining a hypernym list based on determining a hypernym for each entity of the list of entities;
generating entity clusters by clustering entities of the list of entities based on similarities of hypernyms from the hypernym list;
processing the entity clusters to filter the list of entities to generate the set of entity labels for training; and
storing, by the computing system, the plurality of anchors with the plurality of respective anchor text datasets in an index.