US 11,837,025 B2
Method and apparatus for action recognition
Brais Martinez, Staines (GB); Tao Xiang, Staines (GB); Victor Augusto Escorcia, Staines (GB); Juan Perez-Rua, Staines (GB); Xiatian Zhu, Staines (GB); and Antoine Toisoul, Staines (GB)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Appl. No. 17/421,073
Filed by SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
PCT Filed Feb. 16, 2021, PCT No. PCT/KR2021/001953
§ 371(c)(1), (2) Date Jul. 7, 2021,
PCT Pub. No. WO2021/177628, PCT Pub. Date Sep. 10, 2021.
Claims priority of application No. 2003088 (GB), filed on Mar. 4, 2020; and application No. 20206371 (EP), filed on Nov. 9, 2020.
Prior Publication US 2023/0145150 A1, May 11, 2023
Int. Cl. G06V 40/20 (2022.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G06V 10/77 (2022.01)
CPC G06V 40/20 (2022.01) [G06V 10/764 (2022.01); G06V 10/77 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01)] 15 Claims
OG exemplary drawing
 
1. A method of performing video action recognition, the method comprising:
receiving a video comprising a plurality of frames;
generating a multi-dimensional feature tensor representing the received video, the multi-dimensional feature tensor having a plurality of channels; and
performing action recognition using a machine learning (ML) model comprising a plurality of temporal cross-resolution (TCR) blocks, by:
splitting the multi-dimensional feature tensor into:
a first feature tensor having a first set of channels, and
a second feature tensor having a second set of channels;
applying at least one temporal pooling layer to the second feature tensor to temporally downsample the second feature tensor;
processing, in parallel, using each TCR block among the plurality of TCR blocks:
the first feature tensor, and
the temporally downsampled second feature tensor; and
outputting a prediction of an action within the received video.