CPC G06N 20/20 (2019.01) [G06N 5/04 (2013.01)] | 18 Claims |
1. A computer-implemented method of pruning features for training a machine learning model, the method performed by one or more processors of a pre-processing system and comprising:
receiving a dataset including a plurality of values for training a machine learning model, each of the plurality of values being associated with one of a plurality of features;
determining, for each of the plurality of features, one or more characteristics of the values associated with the feature;
identifying one or more less important features of the plurality of features based on the determined characteristics for each of the plurality of features, each less important feature including at least one of a constant feature, a quasi-constant feature, a duplicate feature, a correlated feature, or another feature deemed less important based on a numerical evaluation of its contribution to a predictive performance of the machine learning model;
generating, for each respective less important feature, a pruned dataset including a subset of values selected from the plurality of values, the selected subset of values excluding the values associated with the respective less important feature;
performing, for each pruned dataset generated, a mapping of the pruned dataset to one or more first predictions in accordance with a first machine learning algorithm;
determining, for each mapping performed, a performance level of the mapping based at least in part on one or more evaluation metrics;
selectively removing, from the dataset, the values associated with ones of the less important features based at least in part on a comparison of each determined performance level with a threshold performance level; and
generating a reduced dataset for training, using the first machine learning algorithm, the machine learning model, the reduced dataset including remaining values of the dataset after the values associated with the ones of the less important features are selectively removed.
|