US 11,921,820 B2
	Real-time minimal vector labeling scheme for supervised machine learning
Sameer T. Khanna, Cupertino, CA (US)
Assigned to Fortinet, Inc., Sunnyvale, CA (US)
Filed by Fortinet, Inc., Sunnyvale, CA (US)
Filed on Sep. 11, 2020, as Appl. No. 17/018,885.
Prior Publication US 2022/0083815 A1, Mar. 17, 2022
Int. Cl. G06F 18/214 (2023.01); G06F 18/10 (2023.01); G06F 18/2113 (2023.01); G06F 18/2115 (2023.01); G06F 18/213 (2023.01); G06F 18/22 (2023.01); G06F 18/23 (2023.01); G06F 18/2321 (2023.01); G06F 18/2413 (2023.01); G06F 18/2431 (2023.01); G06F 18/28 (2023.01); G06N 3/09 (2023.01); G06N 3/092 (2023.01); G06N 5/01 (2023.01); G06N 20/00 (2019.01)

CPC G06F 18/2155 (2023.01) [G06F 18/10 (2023.01); G06F 18/2113 (2023.01); G06F 18/2115 (2023.01); G06F 18/213 (2023.01); G06F 18/22 (2023.01); G06F 18/23 (2023.01); G06F 18/2321 (2023.01); G06F 18/24137 (2023.01); G06F 18/2431 (2023.01); G06F 18/28 (2023.01); G06N 3/09 (2023.01); G06N 5/01 (2023.01); G06N 20/00 (2019.01); G06N 3/092 (2023.01)]

18 Claims

8. A system comprising:

a processing resource; and

a non-transitory computer-readable medium, coupled to the processing resource, having stored therein instructions that when executed by the processing resource cause the processing resource to:

receive a first set of feature vectors, wherein the first set of feature vectors are un-labeled;

group the first set of feature vectors into a plurality of clusters within a vector space having fewer dimensions than the first set of feature vectors by applying a homomorphic dimensionality reduction algorithm to the first set of feature vectors and performing centroid-based clustering;

identify an optimal set of clusters among the plurality of clusters by performing a convex optimization process on the plurality of clusters;

minimize vector labeling by selecting a plurality of ground truth representative vectors including a representative vector from each cluster of the optimal set of clusters;

create a set of labeled feature vectors based on labels received from an oracle for each of the plurality of representative vectors;

train a machine-learning model for multiclass classification based on the set of labeled feature vectors; and

train the machine-learning model with inductive learning, wherein the inductive learning comprises:

selecting an unlabeled feature vector from the first set of feature vectors;

classifying the un-labeled feature vector using the machine learning model to get a model classified cluster with a confidence score;

determining whether the confidence score is greater than a threshold; and

when said determining is affirmative:

determining a Mahalanobis distance of the un-labeled feature vector with respect to each labeled feature vector of the first set of feature vectors;

determining a statistically matching cluster of labeled feature vectors to which the un-labeled feature vector is closest based on the determined Mahalanobis distance;

determining whether the model classified cluster and the statistically matching cluster are the same; and

when the model classified cluster and the statistically matching cluster are determined to be the same:

labeling the un-labeled feature vector based on the label associated with the model classified cluster; and

model fitting the machine learning model based on the labeling.