| CPC G06V 40/20 (2022.01) [G06T 5/20 (2013.01); G06V 10/62 (2022.01); G06V 10/764 (2022.01); G06V 10/771 (2022.01); G06V 20/70 (2022.01)] | 9 Claims |

|
1. A spatiotemporal enhancement network based video action recognition method, comprising:
S1, equally partitioning a video into T time periods, and randomly sampling one frame from each time period, to obtain an input sequence having T frames of image;
S2, preprocessing the image sequence acquired in S1;
S3, taking a tensor obtained in S2 as an input, inputting the tensor into a spatiotemporal enhancement network, and extracting spatial and temporal features by a model; and
S4, activating and normalizing the spatial and temporal features obtained in S3 by softmax, averaging the normalized spatial and temporal features along a time dimension, finally obtaining classification scores of behaviors in videos through transformation, and then taking a label corresponding to a highest score as a classification result,
wherein the step S3 comprises:
S3-1, taking MobileNet V2 with 17 bottlenecks as a basic network, and embedding a designed spatiotemporal enhancement module in the 3rd, 5th, 6th, 8th, 9th, 10th, 12th, 13th, 15th, and 16th bottlenecks of the basic network to obtain the spatiotemporal enhancement network;
S3-2, to ensure a long-term modeling capability of the spatiotemporal enhancement network, cascading a 1D convolutional kernel with a size of 3 before the spatiotemporal enhancement module; and
S3-3, implementing the spatiotemporal enhancement module in a form of residual block, wherein a residual function of the spatiotemporal enhancement module is xn+1=xn+A(xn, Wn), A(xn, Wn) is a spatiotemporal enhancement part, and steps of the spatiotemporal enhancement part are: performing spatial averaging on input features respectively along a length dimension and a width dimension, then performing activation respectively by softmax, then performing matrix multiplication to obtain a spatial correlation map, and multiplying the map after time convolution by an original input of the spatiotemporal enhancement module to activate a part of the input features having rich motion information.
|