US 12,367,712 B2
	Action recognition method, apparatus and device, storage medium and computer program product
Boyuan Jiang, Guangdong (CN); Donghao Luo, Guangdong (CN); Mingyu Wu, Guangdong (CN); Yabiao Wang, Guangdong (CN); Chengjie Wang, Guangdong (CN); Xiaoming Huang, Guangdong (CN); Jilin Li, Guangdong (CN); Feiyue Huang, Guangdong (CN); and Yongjian Wu, Guangdong (CN)
Assigned to Tencent Technology (Shenzhen) Company Limited, (CN)
Filed by Tencent Technology (Shenzhen) Company Limited, Guangdong (CN)
Filed on Oct. 31, 2022, as Appl. No. 17/977,415.
Application 17/977,415 is a continuation of application No. PCT/CN2022/073411, filed on Jan. 24, 2022.
Claims priority of application No. 202110134629.5 (CN), filed on Jan. 29, 2021.
Prior Publication US 2023/0067934 A1, Mar. 2, 2023
Int. Cl. G06V 40/20 (2022.01); G06V 10/56 (2022.01); G06V 10/74 (2022.01); G06V 10/75 (2022.01); G06V 10/80 (2022.01); G06V 20/40 (2022.01)

CPC G06V 40/20 (2022.01) [G06V 10/56 (2022.01); G06V 10/757 (2022.01); G06V 10/761 (2022.01); G06V 10/806 (2022.01); G06V 20/46 (2022.01); G06V 20/48 (2022.01)]

19 Claims

1. An action recognition method performed by a computer device, comprising:

obtaining multiple video frames in a target video;

performing feature extraction on the multiple video frames respectively according to multiple dimensions to obtain multiple multi-channel feature patterns, each video frame corresponding to one multi-channel feature pattern, and each channel representing one dimension;

determining an attention weight of each multi-channel feature pattern based on a similarity between every two multi-channel feature patterns in the multiple multi-channel feature patterns comprising the each multi-channel feature pattern and another multi-channel feature pattern, the attention weight being used for representing a degree of correlation between a corresponding multi-channel feature pattern and an action performed by an object in the target video, the similarity between a multi-channel feature pattern pair being used for representing a magnitude of a motion performed by the object in the multiple video frames corresponding to the multi-channel feature pattern pair; and

determining a type of the action based on the multiple multi-channel feature patterns and the determined multiple attention weights.