CPC G06V 20/47 (2022.01) [G06N 3/08 (2013.01); G06V 10/42 (2022.01); G06V 20/42 (2022.01)] | 20 Claims |
1. A computer-implemented method for identifying an action in a video, the method comprising:
inputting a video into a feature extraction module comprising a plurality of trained neural network action recognition models that each receives as input the video and outputs a series of feature representations in which each feature representation corresponds to a video sequence of one or more video frames of the video;
for each feature representation from the plurality of trained neural network action recognition models corresponding to a same video sequence, combining the feature representations from the plurality of trained neural network action recognition models to form a combined semantic feature for that video sequence;
for a set of one or more combined semantic features, obtaining output classification probabilities of actions from a transformer-based temporal detection model, which comprises:
a summation module, which receives the combined semantic features and corresponding positional encodings as inputs and combines them to form an input for a first transformer encoding layer of a plurality of transformer encoding layers;
the plurality of transformer encoding layers arrayed sequentially, in which the output of a prior layer is input into a next layer and a first transformer encoding layer receives the output from the summation module; and
an activation layer that receives an output from a last transformer encoding layer of the plurality of transformer encoding layer and outputs classification probabilities of actions for the set of one or more combined semantic features;
for the set of one or more combined semantic features, assigning an action label to the action label that has a highest classification probability and is above a threshold value; and
outputting a video time of the action using a time value or time values correlated to video sequences that correspond to the set of one or more combined semantic features.
|