CPC G06V 10/774 (2022.01) [G06N 3/0464 (2023.01); G06V 10/40 (2022.01); G06V 10/762 (2022.01); G06V 10/771 (2022.01); G06V 10/776 (2022.01); G06V 10/82 (2022.01); G06V 2201/03 (2022.01)] | 20 Claims |
1. A computer-based method of generating balanced train-test splits for machine learning analysis, the method comprising:
automatically extracting low-level features and high-level features from a series of received datasets;
automatically determining a series of impactful features for each of the received datasets correlating to a corresponding label;
selecting subsets of impactful features;
automatically clustering the received datasets to generate series of clusters, each of the generated series of clusters corresponding to one of the selected subsets of impactful features;
automatically generating train-test split versions using datasets from each cluster in each of the generated series of clusters;
automatically scoring the generated train-test split versions; and
automatically selecting a highest-scoring train-test split version.
|