US 11,837,025 B2
	Method and apparatus for action recognition
Brais Martinez, Staines (GB); Tao Xiang, Staines (GB); Victor Augusto Escorcia, Staines (GB); Juan Perez-Rua, Staines (GB); Xiatian Zhu, Staines (GB); and Antoine Toisoul, Staines (GB)
Assigned to SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
Appl. No. 17/421,073
Filed by SAMSUNG ELECTRONICS CO., LTD., Suwon-si (KR)
PCT Filed Feb. 16, 2021, PCT No. PCT/KR2021/001953 § 371(c)(1), (2) Date Jul. 7, 2021, PCT Pub. No. WO2021/177628, PCT Pub. Date Sep. 10, 2021.
Claims priority of application No. 2003088 (GB), filed on Mar. 4, 2020; and application No. 20206371 (EP), filed on Nov. 9, 2020.
Prior Publication US 2023/0145150 A1, May 11, 2023
Int. Cl. G06V 40/20 (2022.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01); G06V 10/77 (2022.01)

CPC G06V 40/20 (2022.01) [G06V 10/764 (2022.01); G06V 10/77 (2022.01); G06V 10/82 (2022.01); G06V 20/40 (2022.01)]

15 Claims

1. A method of performing video action recognition, the method comprising:

receiving a video comprising a plurality of frames;

generating a multi-dimensional feature tensor representing the received video, the multi-dimensional feature tensor having a plurality of channels; and

performing action recognition using a machine learning (ML) model comprising a plurality of temporal cross-resolution (TCR) blocks, by:

splitting the multi-dimensional feature tensor into:

a first feature tensor having a first set of channels, and

a second feature tensor having a second set of channels;

applying at least one temporal pooling layer to the second feature tensor to temporally downsample the second feature tensor;

processing, in parallel, using each TCR block among the plurality of TCR blocks:

the first feature tensor, and

the temporally downsampled second feature tensor; and

outputting a prediction of an action within the received video.