US 12,260,609 B2
Behavior recognition method and system, electronic device and computer-readable storage medium
Zhenzhi Wu, Beijing (CN); and Yaolong Zhu, Beijing (CN)
Assigned to LYNXI TECHNOLOGIES CO., LTD., Beijing (CN)
Appl. No. 17/790,694
Filed by LYNXI TECHNOLOGIES CO., LTD., Beijing (CN)
PCT Filed Mar. 8, 2021, PCT No. PCT/CN2021/079530
§ 371(c)(1), (2) Date Jul. 1, 2022,
PCT Pub. No. WO2021/180030, PCT Pub. Date Sep. 16, 2021.
Claims priority of application No. 202010157538.9 (CN), filed on Mar. 9, 2020.
Prior Publication US 2023/0042187 A1, Feb. 9, 2023
Int. Cl. G06V 10/62 (2022.01); G06T 7/269 (2017.01); G06V 10/40 (2022.01); G06V 10/77 (2022.01); G06V 10/80 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01)
CPC G06V 10/62 (2022.01) [G06T 7/269 (2017.01); G06V 10/40 (2022.01); G06V 10/7715 (2022.01); G06V 10/80 (2022.01); G06V 10/82 (2022.01); G06V 20/46 (2022.01); G06V 20/49 (2022.01); G06T 2207/10016 (2013.01); G06T 2207/20084 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A behavior recognition method, comprising:
dividing video data into a plurality of video clips, performing frame extraction processing on each video clip to obtain a plurality of frame images, and performing optical flow extraction on the plurality of frame images of each video clip obtained after the frame extraction to obtain optical flow images of each video clip;
respectively performing feature extraction on the frame images and the optical flow images of each video clip to obtain feature maps of the frame images of each video clip and feature maps of the optical flow images of each video clip;
respectively performing spatio-temporal convolution processing on the feature maps of the frame images of each video clip and the feature maps of the optical flow images of each video clip, and determining a spatial prediction result and a temporal prediction result of each video clip;
fusing the spatial prediction results of all the video clips to obtain a spatial fusion result, and fusing the temporal prediction results of all the video clips to obtain a temporal fusion result; and
performing two-stream fusion on the spatial fusion result and the temporal fusion result to obtain a behavior recognition result,
wherein respectively performing the spatio-temporal convolution processing on the feature maps of the frame images of each video clip and the feature maps of the optical flow images of each video clip, and determining the spatial prediction result and the temporal prediction result of each video clip comprises:
respectively performing time series feature extraction on the feature maps of the frame images of each video clip and the feature maps of the optical flow images of each video clip for n times to obtain a first eigenvector, with n being a positive integer;
performing matrix transformation processing on the first eigenvector to obtain a second eigenvector;
performing time series full-connection processing on the second eigenvector to obtain a third eigenvector; and
determining the spatial prediction result and the temporal prediction result of each video clip according to the third eigenvector.