CPC G06N 20/00 (2019.01) [G06Q 40/10 (2013.01)] | 18 Claims |
1. A method for generating labeled training set data for a machine learning process, the method performed by one or more processors of a machine learning-based labeling system and comprising:
retrieving, using a machine learning analysis model, labeled data indicating labels entered by a user for a plurality of data items, the analysis model trained to generate a prediction of a training data label that a given user will enter for an unlabeled training data item based on training data items that the given user has already labeled;
identifying, using the trained analysis model, one or more characteristics of the labeled data, each respective characteristic of the identified characteristics predictive of a label that the user will enter for an unlabeled data item having the respective characteristic;
generating, for each respective unlabeled data item of a set of unlabeled data items, using the trained analysis model, a prediction of a label that the user will enter for the respective unlabeled data item and a confidence score indicative of a likelihood that the predicted label is correct;
selecting, based on the confidence scores, a subset of the set of unlabeled data items to be presented for labeling;
receiving one or more labels entered for the selected subset of unlabeled data items;
determining, based on the one or more labels entered for the subset of unlabeled data items, that a completion criteria associated with the trained analysis model is met; and
generating, using the trained analysis model, a label for one or more remaining unlabeled data items of the set of unlabeled data items.
|