US 12,430,899 B2
De-biasing datasets for machine learning
Nikita Jaipuria, Pittsburgh, PA (US); Xianling Zhang, San Jose, CA (US); Katherine Stevo, Wellesley, MA (US); Jinesh Jain, San Francisco, CA (US); Vidya Nariyambut Murali, Sunnyvale, CA (US); and Meghana Laxmidhar Gaopande, Sunnyvale, CA (US)
Assigned to Ford Global Technologies, LLC, Dearborn, MI (US)
Filed by Ford Global Technologies, LLC, Dearborn, MI (US)
Filed on Aug. 3, 2022, as Appl. No. 17/817,235.
Claims priority of provisional application 63/234,763, filed on Aug. 19, 2021.
Prior Publication US 2024/0046625 A1, Feb. 8, 2024
Int. Cl. G06V 10/778 (2022.01); G06V 10/77 (2022.01)
CPC G06V 10/778 (2022.01) [G06V 10/7715 (2022.01)] 15 Claims
OG exemplary drawing
 
1. A computer comprising a processor and a memory storing instructions executable by the processor to:
receive a dataset of images;
extract feature data from the images;
optimize a number of clusters into which the images are classified based on the feature data, wherein optimizing the number of clusters includes performing k-means clustering for a plurality of values for the number of clusters, determining the silhouette score for each of the values, and selecting one of the values based on the silhouette scores;
for each cluster, determine a perceptual-similarity score between each pair of the images in that cluster;
for each cluster, reduce a dimensionality of the perceptual-similarity scores for that cluster, wherein, for each cluster, reducing the dimensionality of the perceptual-similarity scores includes performing principal component analysis;
for each cluster, optimize a number of subclusters into which the images in that cluster are classified, wherein, for each cluster, optimizing the number of subclusters in that cluster is based on the perceptual-similarity scores for that cluster after reducing the dimensionality;
in response to optimizing the number of the clusters and the numbers of the subclusters, determine a metric indicating a bias of the dataset toward at least one of the clusters or subclusters based on the number of clusters, the numbers of subclusters, distances between the respective clusters, and distances between the respective subclusters, wherein the bias toward the at least one of the clusters or subclusters indicates overrepresentation of the at least one of the clusters or subclusters in the dataset of the images; and
after determining the metric, train a machine-learning program using a training set constructed from the clusters and the subclusters.