US 12,488,620 B2
	Spatiotemporal enhancement network based video action recognition method
He Huang, Suzhou (CN); and Jianuo Yu, Suzhou (CN)
Assigned to SOOCHOW UNIVERSITY, Suzhou (CN)
Appl. No. 18/032,158
Filed by SOOCHOW UNIVERSITY, Suzhou (CN)
PCT Filed Jul. 28, 2022, PCT No. PCT/CN2022/108524 § 371(c)(1), (2) Date Apr. 14, 2023, PCT Pub. No. WO2023/065759, PCT Pub. Date Apr. 27, 2023.
Claims priority of application No. 202111209904.1 (CN), filed on Oct. 18, 2021.
Prior Publication US 2024/0371203 A1, Nov. 7, 2024
Int. Cl. G06V 40/20 (2022.01); G06T 5/20 (2006.01); G06V 10/62 (2022.01); G06V 10/764 (2022.01); G06V 10/771 (2022.01); G06V 20/70 (2022.01)

CPC G06V 40/20 (2022.01) [G06T 5/20 (2013.01); G06V 10/62 (2022.01); G06V 10/764 (2022.01); G06V 10/771 (2022.01); G06V 20/70 (2022.01)]

9 Claims

1. A spatiotemporal enhancement network based video action recognition method, comprising:

S1, equally partitioning a video into T time periods, and randomly sampling one frame from each time period, to obtain an input sequence having T frames of image;

S2, preprocessing the image sequence acquired in S1;

S3, taking a tensor obtained in S2 as an input, inputting the tensor into a spatiotemporal enhancement network, and extracting spatial and temporal features by a model; and

S4, activating and normalizing the spatial and temporal features obtained in S3 by softmax, averaging the normalized spatial and temporal features along a time dimension, finally obtaining classification scores of behaviors in videos through transformation, and then taking a label corresponding to a highest score as a classification result,

wherein the step S3 comprises:

S3-1, taking MobileNet V2 with 17 bottlenecks as a basic network, and embedding a designed spatiotemporal enhancement module in the 3rd, 5th, 6th, 8th, 9th, 10th, 12th, 13th, 15th, and 16th bottlenecks of the basic network to obtain the spatiotemporal enhancement network;

S3-2, to ensure a long-term modeling capability of the spatiotemporal enhancement network, cascading a 1D convolutional kernel with a size of 3 before the spatiotemporal enhancement module; and

S3-3, implementing the spatiotemporal enhancement module in a form of residual block, wherein a residual function of the spatiotemporal enhancement module is x_n+1=x_n+A(x_n, W_n), A(x_n, W_n) is a spatiotemporal enhancement part, and steps of the spatiotemporal enhancement part are: performing spatial averaging on input features respectively along a length dimension and a width dimension, then performing activation respectively by softmax, then performing matrix multiplication to obtain a spatial correlation map, and multiplying the map after time convolution by an original input of the spatiotemporal enhancement module to activate a part of the input features having rich motion information.