CPC G06V 40/20 (2022.01) [G06V 10/764 (2022.01); G06V 10/77 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01)] | 15 Claims |
1. A method of performing video action recognition, the method comprising:
receiving a video comprising a plurality of frames;
generating a multi-dimensional feature tensor representing the received video, the multi-dimensional feature tensor having a plurality of channels; and
performing action recognition using a machine learning (ML) model comprising a plurality of temporal cross-resolution (TCR) blocks, by:
splitting the multi-dimensional feature tensor into:
a first feature tensor having a first set of channels, and
a second feature tensor having a second set of channels;
applying at least one temporal pooling layer to the second feature tensor to temporally downsample the second feature tensor;
processing, in parallel, using each TCR block among the plurality of TCR blocks:
the first feature tensor, and
the temporally downsampled second feature tensor; and
outputting a prediction of an action within the received video.
|