US 12,254,690 B2
	Method for video recognition and related products
Jenhao Hsiao, Palo Alto, CA (US); and Jiawei Chen, Palo Alto, CA (US)
Assigned to GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP., LTD., Guangdong (CN)
Filed by GUANGDONG OPPO MOBILE TELECOMMUNICATIONS CORP., LTD., Guangdong (CN)
Filed on Sep. 15, 2022, as Appl. No. 17/932,360.
Application 17/932,360 is a continuation of application No. PCT/CN2021/083326, filed on Mar. 26, 2021.
Claims priority of provisional application 63/000,011, filed on Mar. 26, 2020.
Prior Publication US 2023/0005264 A1, Jan. 5, 2023
Int. Cl. G06V 20/40 (2022.01); G06V 10/774 (2022.01); G06V 10/82 (2022.01)

CPC G06V 20/40 (2022.01) [G06V 10/774 (2022.01); G06V 10/82 (2022.01); G06V 20/41 (2022.01); G06V 20/44 (2022.01)]

20 Claims

1. A method for video recognition, comprising:

obtaining an original set of clip descriptors by providing a plurality of clips of a video as an input of a three-dimensional (3D) convolutional neural network (CNN) of a neural network, wherein the neural network comprises the 3D CNN and at least one first fully connected layer, and each of the plurality of clips comprises at least one frame;

determining an attention vector corresponding to the original set of clip descriptors;

obtaining an enhanced set of clip descriptors based on the original set of clip descriptors and the attention vector; and

inputting the enhanced set of clip descriptors into the at least one first fully connected layer and performing video recognition based on an output of the at least one first fully connected layer.

9. A method for training a neural network, comprising,

obtaining an original set of clip descriptors by providing a plurality of clips of a video as an input of a three-dimensional (3D) convolutional neural network (CNN) of a neural network, wherein the neural network comprises the 3D CNN and at least one first fully connected layer, the 3D CNN comprises at least one convolutional layer and at least one second fully connected layer, and each of the plurality of clips comprises at least one frame;

determining an attention vector corresponding to the original set of clip descriptors;

obtaining an enhanced set of clip descriptors based on the original set of clip descriptors and the attention vector;

inputting the enhanced set of clip descriptors into the at least one first fully connected layer and obtaining an output of the neural network; and

training the neural network by updating parameters of the neural network based on a loss of the neural network, wherein the parameters of the neural network comprise a weight of the at least one first fully connected layer and a weight of the at least one second fully connected layer.