US 12,190,215 B1
Automatically selecting relevant data based on user specified data and machine learning characteristics for data integration
Nitin Gupta, Saharanpur (IN); Shashank Mujumdar, Nagpur (IN); Ruhi Sharma Mittal, Bengaluru (IN); and Hima Patel, Bengaluru (IN)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Oct. 25, 2023, as Appl. No. 18/494,345.
Int. Cl. G06F 16/00 (2019.01); G06F 16/2457 (2019.01); G06N 20/00 (2019.01)
CPC G06N 20/00 (2019.01) [G06F 16/24573 (2019.01); G06F 16/24578 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method of automatically selecting data for machine learning datasets, the method comprising:
using a number of processors to perform:
receiving an input dataset;
receiving user-specified data quality metrics;
matching the input dataset to a first subset of candidate datasets in a repository according to schema characteristics;
selecting a second subset of candidate datasets from the first subset of candidate datasets having a distance from the input dataset above a specified threshold;
merging the second subset of candidate datasets into a merged dataset;
identifying top ranked samples from the merged dataset above a specified second threshold based on the user-specified data quality metrics wherein the identification of top ranked samples from the merged dataset comprises:
feeding samples from the merged dataset into a reinforcement machine learning model that outputs a probability of selecting each sample for a final dataset;
selecting samples from among the samples that have a probability of being selected greater than a specified fourth threshold; and
augmenting the input dataset with the selected samples; and
returning the input dataset augmented with the top ranked samples to a user.