US 12,299,982 B2
Systems and methods for partially supervised online action detection in untrimmed videos
Mingfei Gao, Sunnyvale, CA (US); Yingbo Zhou, Mountain View, CA (US); Ran Xu, Mountain View, CA (US); and Caiming Xiong, Menlo Park, CA (US)
Assigned to Salesforce, Inc., San Francisco, CA (US)
Filed by Salesforce, Inc., San Francisco, CA (US)
Filed on Jul. 16, 2020, as Appl. No. 16/931,228.
Claims priority of provisional application 63/023,402, filed on May 12, 2020.
Prior Publication US 2021/0357687 A1, Nov. 18, 2021
Int. Cl. G06V 20/40 (2022.01); G06F 17/18 (2006.01); G06F 18/2113 (2023.01); G06F 18/214 (2023.01); G06F 18/2431 (2023.01); G06N 3/084 (2023.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 20/20 (2022.01)
CPC G06V 20/44 (2022.01) [G06F 17/18 (2013.01); G06F 18/2113 (2023.01); G06F 18/214 (2023.01); G06F 18/2431 (2023.01); G06N 3/084 (2013.01); G06V 10/764 (2022.01); G06V 10/82 (2022.01); G06V 20/20 (2022.01); G06V 20/40 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A method of training an online action detection (OAD) neural network model using a training dataset of untrimmed videos having video-level labels without annotated labels indicating whether a specific video frame contains an action start of a specific action class, the method comprising:
receiving, by a communication interface, an input of the training dataset of untrimmed videos including a set of video-level labels indicating one or more action classes that emerge in the untrimmed videos, wherein the untrimmed videos are used for training the OAD neural network model with no annotated label indicating action starts of the one or more action classes;
generating, by a feature extractor neural network model implemented on one or more hardware processors, feature representations from the training dataset of the untrimmed videos;
generating, by a temporal proposal generator (TPG) neural network model implemented on the one or more hardware processors and receiving the feature representations from the feature extractor neural network, class-wise temporal proposals indicating a respective estimated action start for each action class of the one or more action classes based on the feature representations and the set of video-level labels;
generating, by an online action recognizer (OAR) neural network model implemented on the one or more hardware processors and receiving the feature representations from the feature extractor neural network and the class-wise temporal proposals from the TPG neural network model, per-frame action scores over action classes indicating whether each frame of the untrimmed video contains each specific action class, and a class-agnostic start score indicating whether the respective frame contains a start of any action based on the feature representations and the class-wise temporal proposals;
training the OAD neural network model comprising the TPG neural network model and the OAR neural network model according to a loss metric computed based on the per-frame action scores, the class-agnostic start scores and based on the class-wise temporal proposals as pseudo ground-truth labels; and
generating, by the trained OAD neural network model, a predicted action start for an input real-time video stream.