| CPC G06F 16/285 (2019.01) [G06N 20/20 (2019.01)] | 19 Claims |

|
1. A system for evolving datasets to reduce time-based labeling deterioration based on distances between records of clusters, the system comprising a computer system comprising:
one or more processors programmed with computer program instructions that, when executed, cause the computer system to perform operations comprising:
obtaining a first dataset used to train a first machine learning model configured to generate class predictions from a first set of classes, wherein each record of the first dataset is labeled with at least one class of the first set of classes, the first dataset having a first data quality score for the first set of classes that satisfies a data quality threshold;
obtaining a second dataset comprising shared features, the shared features being shared with the first dataset, wherein each record of the second dataset is labeled with at least one class of a second set of classes, at least one record of the second dataset being labeled with a new class included in the second set of classes and not included in the first set of classes;
creating, from the first dataset and the second dataset, an aggregated dataset for training a second machine learning model to be configured to generate class predictions for the second set of classes, the aggregated dataset having a second data quality score for the second set of classes that satisfy the data quality threshold, wherein creating the aggregated dataset from the first dataset and the second dataset comprises:
determining a first set of clusters of records in the first dataset, wherein each cluster of records of the first set of clusters is labeled with a respective class of the first set of classes;
determining a second set of clusters of records in the second dataset based on the second dataset, wherein each cluster of the second set of clusters is labeled with a respective class of the second set of classes;
determining a set of clustering analysis scores based on distances between records of a first cluster of the first set of clusters and records of a second cluster of the second set of clusters in a feature space of the shared features, wherein the first cluster is associated with a first class, and wherein the second cluster is associated with a second class; and
in response to a determination that the set of clustering analysis scores satisfies a class update threshold, generating a relabeling indication for a set of records of the first dataset associated with the set of clustering analysis scores;
generating a synthesized class associated with the second class; and
updating the set of records indicated with the relabeling indication by labeling the set of records with the synthesized class.
|