US 12,216,684 B2
Cluster-based dataset evolution
Nathan Wolfe, Silver Spring, MD (US); Purva Shanker, Arlington, VA (US); Joshua Edwards, Philadelphia, PA (US); and Gang Mei, Ellicott City, MD (US)
Assigned to Capital One Services, LLC, McLean, VA (US)
Filed by Capital One Services, LLC, McLean, VA (US)
Filed on Aug. 19, 2022, as Appl. No. 17/821,030.
Prior Publication US 2024/0061867 A1, Feb. 22, 2024
Int. Cl. G06F 16/28 (2019.01); G06N 20/20 (2019.01)
CPC G06F 16/285 (2019.01) [G06N 20/20 (2019.01)] 19 Claims
OG exemplary drawing
 
1. A system for evolving datasets to reduce time-based labeling deterioration based on distances between records of clusters, the system comprising a computer system comprising:
one or more processors programmed with computer program instructions that, when executed, cause the computer system to perform operations comprising:
obtaining a first dataset used to train a first machine learning model configured to generate class predictions from a first set of classes, wherein each record of the first dataset is labeled with at least one class of the first set of classes, the first dataset having a first data quality score for the first set of classes that satisfies a data quality threshold;
obtaining a second dataset comprising shared features, the shared features being shared with the first dataset, wherein each record of the second dataset is labeled with at least one class of a second set of classes, at least one record of the second dataset being labeled with a new class included in the second set of classes and not included in the first set of classes;
creating, from the first dataset and the second dataset, an aggregated dataset for training a second machine learning model to be configured to generate class predictions for the second set of classes, the aggregated dataset having a second data quality score for the second set of classes that satisfy the data quality threshold, wherein creating the aggregated dataset from the first dataset and the second dataset comprises:
determining a first set of clusters of records in the first dataset, wherein each cluster of records of the first set of clusters is labeled with a respective class of the first set of classes;
determining a second set of clusters of records in the second dataset based on the second dataset, wherein each cluster of the second set of clusters is labeled with a respective class of the second set of classes;
determining a set of clustering analysis scores based on distances between records of a first cluster of the first set of clusters and records of a second cluster of the second set of clusters in a feature space of the shared features, wherein the first cluster is associated with a first class, and wherein the second cluster is associated with a second class; and
in response to a determination that the set of clustering analysis scores satisfies a class update threshold, generating a relabeling indication for a set of records of the first dataset associated with the set of clustering analysis scores;
generating a synthesized class associated with the second class; and
updating the set of records indicated with the relabeling indication by labeling the set of records with the synthesized class.