US 12,321,866 B2
Data diversity visualization and quantification for machine learning models
Deepa Anand, Bangalore (IN); Rakesh Mullick, Bangalore (IN); Dattesh Dayanand Shanbhag, Bengaluru (IN); and Marc T Edgar, Glenmont, NY (US)
Assigned to GE Precision Healthcare LLC, Waukesha, WI (US)
Filed by GE Precision Healthcare LLC, Milwaukee, WI (US)
Filed on Apr. 28, 2021, as Appl. No. 17/243,046.
Prior Publication US 2022/0351055 A1, Nov. 3, 2022
Int. Cl. G06N 5/04 (2023.01); G06N 20/00 (2019.01)
CPC G06N 5/04 (2013.01) [G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A system, comprising:
a memory that stores computer-executable components; and
a processor that executes at least one of the computer-executable components that:
accesses a first set of data candidates and a second set of data candidates, wherein a machine learning model is trained on the first set of data candidates, wherein the second set of data candidates represents at least one of augmented versions of the first set of data candidates or potential training data for the machine learning model;
obtains a first set of latent activations generated by the machine learning model based on the first set of data candidates;
obtains a second set of latent activations generated by the machine learning model based on the second set of data candidates;
generates a first set of compressed data points in a defined space by applying a dimensionality reduction technique to the first set of latent activations;
generates a second set of compressed data points in the defined space by applying the dimensionality reduction technique to the second set of latent activations;
generates a diversity score based on the first set of compressed data points and the second set of compressed data points, wherein the diversity score represents a degree of diversity of the second set of compressed data points from the first set of compressed data points from a perspective of the machine learning model, and wherein generating the diversity score comprises:
grouping compressed data points of the first set of compressed data points into clusters,
for each cluster, determining a respective threshold distance in the defined space from a center of the cluster based on a density of compressed data points in the cluster and a distribution of compressed data points in the cluster,
assigning compressed data points of the second set of compressed data points into the clusters based on a defined criterion, and
determining the diversity score based on an amount of respective compressed data points of the second set of compressed data points in the defined space that are outside of the respective threshold distances from the respective centers of the clusters assigned to the respective compressed data points of the second set of compressed data points;
recommends training the machine learning model on the second set of data candidates in response to determining that the diversity score satisfies a predetermined threshold; and
recommends not training the machine learning model on the second set of data candidates in response to determining that the diversity score fails to satisfy the predetermined threshold.