US 12,327,409 B2
	System and method for determining sub-activities in videos and segmenting the videos with little to no annotation
Quoc-Huy Tran, Redmond, WA (US); Muhammad Zeeshan Zia, Sammamish, WA (US); Andrey Konin, Redmond, WA (US); Sateesh Kumar, La Jolla, CA (US); Sanjay Haresh, Burnaby (CA); Awais Ahmed, Karachi (PK); Hamza Khan, Karachi (PK); and Muhammad Shakeeb Hussain Siddiqui, Karachi (PK)
Assigned to Retrocausal, Inc., , WA (US)
Filed by Retrocausal, Inc., Redmond, WA (US)
Filed on May 25, 2022, as Appl. No. 17/752,946.
Claims priority of provisional application 63/192,923, filed on May 25, 2021.
Prior Publication US 2022/0383638 A1, Dec. 1, 2022
Int. Cl. G06V 20/40 (2022.01); G06V 10/62 (2022.01); G06V 10/762 (2022.01)

CPC G06V 20/49 (2022.01) [G06V 10/62 (2022.01); G06V 10/762 (2022.01); G06V 20/41 (2022.01)]

20 Claims

1. A computing system for determining sub-activities in videos and segmenting the videos, the computing system comprising:

one or more hardware processors; and

a memory coupled to the one or more hardware processors, wherein the memory comprises a plurality of modules in the form of programmable instructions executable by the one or more hardware processors, and wherein the plurality of modules comprises:

a data receiver module configured to receive one or more videos from one or more sources for segmenting the one or more videos, wherein the one or more videos are unlabeled videos comprising one or more activities performed by a human;

a batch extraction module configured to extract one or more batches from the received one or more videos by using a batch extraction technique, wherein each of the one or more batches comprises a set of frames;

a feature extraction module configured to extract one or more features from the set of frames associated with each of the one or more batches by using an activity determination-based Machine Learning (ML) model;

a predicted code generation module configured to generate a set of predicted codes based on the extracted one or more features, a temperature parameter, and a set of learned prototypes by using the activity determination-based ML model, wherein each of the set of learned prototypes corresponds to a cluster center;

a cross-entropy loss determination module configured to determine a cross-entropy loss corresponding to one or more parameters associated with the activity determination-based ML model and the set of learned prototypes based on the generated set of predicted codes and a set of pseudo-label codes by using the activity determination-based ML model;

a temporal coherence loss determination module configured to determine a temporal coherence loss corresponding to the one or more parameters based on a subset of frames associated with the set of frames, a positive sample of frames, a negative sample of frames and the one or more parameters by using the activity determination-based ML model upon determining the temporal coherence loss;

a loss determination module configured to determine a final loss based on the determined cross-entropy loss, the determined temporal coherence loss and a weight associated with the determined temporal coherence loss by using the activity determination-based ML model, wherein the final loss is optimized corresponding to the one or more parameters and the set of learned prototypes;

a data categorization module configured to categorize the set of frames into one or more predefined clusters based on the extracted one or more features, the determined final loss, the set of predicted codes and the set of learned prototypes by using the activity determination-based ML model, wherein each of the one or more predefined clusters corresponds to a sub-activity;

a data generation module configured to generate one or more segmented videos based on the categorized set of frames, the determined final loss and the set of predicted codes by using the activity determination-based ML model; and

a data output module configured to output the generated one or more segmented videos on user interface screen of one or more electronic devices associated with one or more users.