US 12,469,282 B2
Systems and methods for retrieving videos using natural language description
Ning Yan, Milpitas, CA (US)
Assigned to HUAWEI TECHNOLOGIES CO., LTD., Shenzhen (CN)
Filed by HUAWEI TECHNOLOGIES CO., LTD., Shenzhen (CN)
Filed on Nov. 29, 2022, as Appl. No. 18/071,523.
Application 18/071,523 is a continuation of application No. PCT/US2020/053802, filed on Oct. 1, 2020.
Claims priority of provisional application 63/032,571, filed on May 30, 2020.
Prior Publication US 2023/0086735 A1, Mar. 23, 2023
Int. Cl. G06K 9/00 (2022.01); G06F 16/735 (2019.01); G06F 16/738 (2019.01); G06F 16/783 (2019.01); G06F 40/279 (2020.01); G06V 10/86 (2022.01); G06V 20/40 (2022.01); G06V 10/426 (2022.01); G06V 30/196 (2022.01)
CPC G06V 20/41 (2022.01) [G06F 16/735 (2019.01); G06F 16/738 (2019.01); G06F 16/7837 (2019.01); G06F 40/279 (2020.01); G06V 10/86 (2022.01); G06V 10/426 (2022.01); G06V 20/46 (2022.01); G06V 20/48 (2022.01); G06V 30/1988 (2022.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method, comprising:
generating, by a data processing apparatus, a plurality of scene graphs for a plurality of videos, wherein generating the plurality of scene graphs includes:
extracting, by the data processing apparatus and from each video of the plurality of videos, a plurality of key frames, each key frame including a timestamp corresponding to an occurrence of the key frame within the video and a reference to the video including the key frame; and
generating, by the data processing apparatus and for each key frame in the plurality of key frames, a scene graph for the key frame, including:
identifying, by a machine-learned model, a plurality of objects in the key frame;
extracting, by the machine-learned model, a relationship feature defining a relationship between a first object and a second, different object of the plurality of objects in the key frame; and
generating, by the machine-learned model and from the first object, the second object, and the relationship feature, the scene graph for the key frame that includes a set of nodes and a set of edges that interconnect a subset of nodes in the set of nodes, wherein the first object is represented by a first node from the set of nodes, the second object is represented by a second node from the set of nodes, and the relationship feature is an edge connecting the first node to the second node;
receiving, by the data processing apparatus, a natural language query request for a video in the plurality of videos, wherein the natural language query request comprises a plurality of terms specifying two or more particular objects and a relationship between the two or more particular objects;
generating, by the data processing apparatus, a query graph for the natural language query request, the query graph representing objects and relationship features extracted from the natural language query request as nodes and edges between nodes;
matching the query graph and to the scene graph of each key frame to identify, by the data processing apparatus and from the plurality of scene graphs, a set of scene graphs of the plurality of scene graphs matching the query graph; and
generating, by the data processing apparatus and from the identified set of scene graphs, a set of videos of the plurality of videos, each video including at least one scene graph of the set of scene graphs.