| CPC G06V 20/41 (2022.01) [G06F 16/735 (2019.01); G06F 16/738 (2019.01); G06F 16/7837 (2019.01); G06F 40/279 (2020.01); G06V 10/86 (2022.01); G06V 10/426 (2022.01); G06V 20/46 (2022.01); G06V 20/48 (2022.01); G06V 30/1988 (2022.01)] | 18 Claims |

|
1. A computer-implemented method, comprising:
generating, by a data processing apparatus, a plurality of scene graphs for a plurality of videos, wherein generating the plurality of scene graphs includes:
extracting, by the data processing apparatus and from each video of the plurality of videos, a plurality of key frames, each key frame including a timestamp corresponding to an occurrence of the key frame within the video and a reference to the video including the key frame; and
generating, by the data processing apparatus and for each key frame in the plurality of key frames, a scene graph for the key frame, including:
identifying, by a machine-learned model, a plurality of objects in the key frame;
extracting, by the machine-learned model, a relationship feature defining a relationship between a first object and a second, different object of the plurality of objects in the key frame; and
generating, by the machine-learned model and from the first object, the second object, and the relationship feature, the scene graph for the key frame that includes a set of nodes and a set of edges that interconnect a subset of nodes in the set of nodes, wherein the first object is represented by a first node from the set of nodes, the second object is represented by a second node from the set of nodes, and the relationship feature is an edge connecting the first node to the second node;
receiving, by the data processing apparatus, a natural language query request for a video in the plurality of videos, wherein the natural language query request comprises a plurality of terms specifying two or more particular objects and a relationship between the two or more particular objects;
generating, by the data processing apparatus, a query graph for the natural language query request, the query graph representing objects and relationship features extracted from the natural language query request as nodes and edges between nodes;
matching the query graph and to the scene graph of each key frame to identify, by the data processing apparatus and from the plurality of scene graphs, a set of scene graphs of the plurality of scene graphs matching the query graph; and
generating, by the data processing apparatus and from the identified set of scene graphs, a set of videos of the plurality of videos, each video including at least one scene graph of the set of scene graphs.
|