US 11,921,757 B2
System to label K-means clusters with human understandable labels
Mark Watson, Urbana, IL (US); Reza Farivar, Champaign, IL (US); Austin Walters, Savoy, IL (US); Jeremy Goodsitt, Champaign, IL (US); Vincent Pham, Champaign, IL (US); Anh Truong, Champaign, IL (US); and Galen Rafferty, Mahomet, IL (US)
Assigned to Capital One Services, LLC, McLean, VA (US)
Filed by Capital One Services, LLC, McLean, VA (US)
Filed on Jan. 17, 2023, as Appl. No. 18/097,524.
Application 18/097,524 is a continuation of application No. 15/931,233, filed on May 13, 2020, granted, now 11,556,564.
Prior Publication US 2023/0153330 A1, May 18, 2023
Int. Cl. G06F 7/00 (2006.01); G06F 7/08 (2006.01); G06F 16/28 (2019.01); G06N 5/04 (2023.01); G06N 20/00 (2019.01)
CPC G06F 16/285 (2019.01) [G06F 7/08 (2013.01); G06N 5/04 (2013.01); G06N 20/00 (2019.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method, comprising:
retrieving, by at least one computer processor, a plurality of data records from a database, wherein the plurality of data records have been categorized by an unsupervised machine-learning algorithm, and wherein each data record of the plurality of data records has been designated a cluster number out of a total K number of clusters;
for each of a plurality of classification features, performing, by the at least one computer processor, a cluster-based analysis for a first cluster in the K number of clusters to determine proximity of data records between different clusters and overlap with the first cluster, with respect to a single feature, the cluster-based analysis comprising:
determining, by the at least one computer processor, an average and standard deviation, collectively, for all data records forming the first cluster, for the single feature; and
determining, by the at least one computer processor, an average for all data records in each other cluster in the K number of clusters, for the single feature;
generating, by the at least one computer processor, based on the cluster-based analysis, a single feature overlap score for each of the plurality of classification features based on a proximity of the first cluster to the other clusters in the K number of clusters and an amount of overlap of the first cluster with the other clusters in the K number of clusters; and
generating, by the at least one computer processor, a naming label for the first cluster based on a predetermined number of features having lowest overlap scores of the plurality of classification features.