US 12,327,398 B2
Generating balanced train-test splits for machine learning
Simona Rabinovici-Cohen, Haifa (IL); Ella Barkan, Haifa (IL); and Tal Tlusty Shapiro, Zichron Yaacov (IL)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed on Dec. 2, 2022, as Appl. No. 18/061,024.
Prior Publication US 2024/0185575 A1, Jun. 6, 2024
Int. Cl. G06V 10/00 (2022.01); G06N 3/0464 (2023.01); G06V 10/40 (2022.01); G06V 10/762 (2022.01); G06V 10/771 (2022.01); G06V 10/774 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01)
CPC G06V 10/774 (2022.01) [G06N 3/0464 (2023.01); G06V 10/40 (2022.01); G06V 10/762 (2022.01); G06V 10/771 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01); G06V 2201/03 (2022.01)] 20 Claims
OG exemplary drawing
 
1. A computer-based method of generating balanced train-test splits for machine learning analysis, the method comprising:
automatically extracting low-level features and high-level features from a series of received datasets;
automatically determining a series of impactful features for each of the received datasets correlating to a corresponding label;
selecting subsets of impactful features;
automatically clustering the received datasets to generate series of clusters, each of the generated series of clusters corresponding to one of the selected subsets of impactful features;
automatically generating train-test split versions using datasets from each cluster in each of the generated series of clusters;
automatically scoring the generated train-test split versions; and
automatically selecting a highest-scoring train-test split version.