US 11,797,516 B2
Dataset balancing via quality-controlled sample generation
Naama Tepper, Koranit (IL); Esther Goldbraich, Haifa (IL); Boaz Carmeli, Koranit (IL); Naama Zwerdling, Haifa (IL); George Kour, Tel Aviv (IL); and Ateret Anaby Tavor, Givat Ada (IL)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on May 12, 2021, as Appl. No. 17/317,922.
Prior Publication US 2022/0374410 A1, Nov. 24, 2022
Int. Cl. G06F 16/23 (2019.01); G06N 20/00 (2019.01)
CPC G06F 16/2365 (2019.01) [G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving a balancing policy and an imbalanced dataset that comprises samples distributed unequally between different classes;
automatically performing initial adjustment of the imbalanced dataset to comply with the balancing policy, by:
oversampling one or more of the classes which are underrepresented in the imbalanced dataset, and
based on one or more of the classes being overrepresented in the imbalanced dataset, undersampling the one or more overrepresented classes;
operating a generative machine learning model to generate samples for the one or more underrepresented classes, based on the initially-adjusted dataset;
operating a machine learning classification model to label the generated samples with class labels corresponding to the one or more underrepresented classes;
selecting some of the generated samples which, according to the labelling, have a relatively high probability of preserving their class labels, compared to other ones of the generated samples; and
composing a balanced dataset which complies with the balancing policy and comprises:
the samples from the imbalanced dataset belonging to the one or more underrepresented classes,
the selected generated samples, and
based on one or more of the classes being overrepresented in the imbalanced dataset, undersampling the samples belonging to the one or more overrepresented classes in the imbalanced dataset.