US 12,423,975 B2
	Action localization method, device, electronic equipment, and computer-readable storage medium
Shizhuo Liu, Beijing (CN); Jingjun Jiao, Beijing (CN); Xiaobing Wang, Beijing (CN); Lanlan Zhang, Beijing (CN); Wei Li, Beijing (CN); Haifeng Zhang, Beijing (CN); Zhezhu Jin, Beijing (CN); and Wei Wen, Beijing (CN)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed by SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Filed on Jun. 17, 2022, as Appl. No. 17/843,603.
Application 17/843,603 is a continuation of application No. PCT/KR2022/000532, filed on Jan. 12, 2022.
Claims priority of application No. 202110038254.2 (CN), filed on Jan. 12, 2021; and application No. 202110845122.0 (CN), filed on Jul. 26, 2021.
Prior Publication US 2022/0327834 A1, Oct. 13, 2022
Int. Cl. G06V 20/40 (2022.01); G06T 7/246 (2017.01); G06V 40/20 (2022.01)

CPC G06V 20/46 (2022.01) [G06T 7/248 (2017.01); G06V 40/20 (2022.01); G06T 2207/30196 (2013.01)]

19 Claims

1. An action localization method comprising:

identifying at least one target video segment including a target object in a video;

acquiring at least one image frame from the at least one target video segment;

acquiring, by an action recognition network, a first action recognition result of the at least one image frame, the first action recognition result corresponding to a first action recognition score indicating probabilities of at least one preset action type occurring in the at least one image frame,

acquiring, by the action recognition network, a second action recognition result of the at least one target video segment by aggregating the first action recognition result of the at least one image frame, the second action recognition result corresponding to a second action recognition score indicating probabilities of at least one preset action type occurring in the at least one target video segment;

acquiring an action localization result of the video based on the first action recognition result and the second action recognition result, the action localization result indicating whether a preset action type occurs in the at least one target video segment; and

outputting the action localization result,

wherein the acquiring of the first action recognition result comprises processing an input feature map that represents the at least one image frame through a temporal residual neural network that comprises:

an attention sub-network configured to apply maximum pooling and average pooling to the input feature map to obtain a first output feature map including attention scores for each of the at least one image frame; and

a neighbor sub-network configured to extract local temporal information from adjacent frames to obtain a second output feature map including the local temporal information.