US 11,721,130 B2
Weakly supervised video activity detection method and system based on iterative learning
Yan Song, Jiangsu (CN); Rong Zou, Jiangsu (CN); and Xiangbo Shu, Jiangsu (CN)
Assigned to NANJING UNIVERSITY OF SCIENCE AND TECHNOLOGY, Nanjing (CN)
Appl. No. 17/425,653
Filed by Nanjing University of Science and Technology, Jiangsu (CN)
PCT Filed Sep. 16, 2020, PCT No. PCT/CN2020/115542
§ 371(c)(1), (2) Date Jul. 23, 2021,
PCT Pub. No. WO2022/007193, PCT Pub. Date Jan. 13, 2022.
Claims priority of application No. 202010644474.5 (CN), filed on Jul. 7, 2020.
Prior Publication US 2022/0189209 A1, Jun. 16, 2022
Int. Cl. G06V 40/20 (2022.01); G06V 10/82 (2022.01); G06V 10/40 (2022.01)
CPC G06V 40/23 (2022.01) [G06V 10/40 (2022.01); G06V 10/82 (2022.01)] 10 Claims
OG exemplary drawing
 
1. A weakly supervised video activity detection method based on iterative learning, comprising:
extracting spatial-temporal features of a video that contains actions, the spatial-temporal features being divided into the spatial-temporal features in the training set and the spatial-temporal features in the test set;
constructing a neural network model group, the neural network model group containing at least two neural network models, an input of each neural network model being the spatial-temporal feature in the training set, and an output of each neural network model being a class activation sequence, a pseudo temporal label and a video feature of the spatial-temporal feature in the training set in the corresponding neural network model;
training a first neural network model according to the class label of the video, the class activation sequence output by the first neural network model, and the video feature output by the first neural network model, the first neural network model being the first neural network model in the neural network model group;
training the next neural network model according to the class label of the video, the pseudo temporal label output by the current neural network model, the class activation sequence output by the next neural network model, and the video feature output by the next neural network model;
inputting the spatial-temporal features in the test set to the various neural network models, and respectively performing action detection on each corresponding test video in a test set according to the class activation sequences output by the various neural network models to obtain the detection accuracy of the various neural network models; and
performing action detection on the test video according to the neural network model corresponding to the highest detection accuracy value.