US 12,254,693 B2
Action classification in video clips using attention-based neural networks
Joao Carreira, St Albans (GB); Carl Doersch, London (GB); and Andrew Zisserman, London (GB)
Assigned to DeepMind Technologies Limited, London (GB)
Filed by DeepMind Technologies Limited, London (GB)
Filed on Oct. 2, 2023, as Appl. No. 18/375,941.
Application 18/375,941 is a continuation of application No. 17/295,329, granted, now 11,776,269, previously published as PCT/EP2019/081877, filed on Nov. 20, 2019.
Claims priority of provisional application 62/770,096, filed on Nov. 20, 2018.
Prior Publication US 2024/0029436 A1, Jan. 25, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06V 20/40 (2022.01); G06N 3/045 (2023.01); G06V 10/25 (2022.01)
CPC G06V 20/46 (2022.01) [G06N 3/045 (2023.01); G06V 10/25 (2022.01); G06V 20/41 (2022.01)] 18 Claims
OG exemplary drawing
 
1. A method comprising:
obtaining a feature representation of a video clip comprising a key video frame from a video and one or more context video frames from the video;
obtaining data specifying a plurality of candidate agent bounding boxes in the key video frame, wherein each candidate agent bounding box is an initial estimate of a portion of the key video frame that depicts an agent; and
for each candidate agent bounding box:
processing the feature representation through an action transformer neural network, wherein the action transformer neural network comprises:
a stack of action transformer layers configured to process the feature representation to generate a final query feature vector for the candidate agent bounding box, wherein each action transformer layer is configured to:
for each of one or more attention units:
 receive input query features for the action transformer layer,
 generate, from the feature representation, key features,
 generate, from the feature representation, value features,
 apply an attention mechanism to the input query features, the key features, and the value features to generate initial updated query features; and
 generate output query features from the initial updated query features, wherein:
the input query features for the first action transformer layer in the stack are features corresponding to the candidate agent bounding box in the feature representation,
the input query features for each action transformer layer in the stack other than the first action transformer layer are generated based on the output query features for each attention unit in the preceding action transformer layer in the stack, and
the final query features are generated based on the output query features for each attention unit in the last action transformer layer in the stack; and
one or more regression output layers configured to process a final feature vector composed of the final query features for the candidate agent bounding box to generate data defining a final bounding box that is a refined estimate of the portion of the key video frame that depicts the agent.