US 11,914,629 B2
Practical supervised classification of data sets
Arunav Mishra, Ludwigshafen (DE); Henning Schwabe, Ludwigshafen (DE); and Lalita Shaki Uribe Ordonez, Ludwigshafen (DE)
Assigned to BASF SE, Ludwigshafen (DE)
Filed by BASF SE, Ludwigshafen (DE)
Filed on Aug. 5, 2021, as Appl. No. 17/394,994.
Claims priority of application No. 20190061 (EP), filed on Aug. 7, 2020.
Prior Publication US 2022/0043850 A1, Feb. 10, 2022
Int. Cl. G06F 16/35 (2019.01); G06F 16/338 (2019.01); G06N 20/00 (2019.01)
CPC G06F 16/35 (2019.01) [G06F 16/338 (2019.01); G06N 20/00 (2019.01)] 9 Claims
OG exemplary drawing
 
1. A computer-implemented method for training a classifier model for data classification, in particular in response to a search query, comprising:
a) obtaining a dataset that comprises a seed set of labeled data representing a training dataset;
b) training the classifier model by using the training dataset to fit parameters of the classifier model;
c) evaluating a quality of the classifier model using a test dataset that comprises unlabeled data from the obtained dataset to generate a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset;
d) determining a global risk value of misclassification and a reward value based on the classifier confidence score on the test dataset;
e) iteratively updating the parameters of the classifier model and performing steps b) to d) until the global risk value falls within a predetermined risk limit value or an expected reward value is reached to obtain a trained classifier model for data classification,
wherein step d) further comprises:
d1) generating a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset;
d2) computing a classifier metric at different thresholds on classifier confidence score, the classifier metrics representing a measure of a test's accuracy;
d3) determining a reference threshold that corresponds to a peak in a distribution of the classifier metric over the threshold on classifier confidence score;
d4) determining a threshold range that defines a recommended window according to a predefined criteria, wherein the reference threshold is located within the threshold range; and
d5) computing the reward value at different thresholds on classifier confidence score, and
wherein the reward value includes at least one of a measure of information gain and a measure of decrease in uncertainty.