| CPC G06N 20/00 (2019.01) [G06N 5/04 (2013.01)] | 17 Claims |

|
1. A computer-implemented method comprising:
receiving, by one or more processors, a source training set comprising a plurality of initial data features, wherein an initial data feature of the plurality of initial data features is associated with a per-feature mutual information measure that defines a predictive capability of the initial data feature with respect to a target feature;
determining, by the one or more processors, a limited correlation subset of the plurality of initial data features from the source training set based at least in part on a per-feature-pair symmetric correlation measure and the per-feature mutual information measure, wherein:
(i) the source training set comprises a plurality of training data entries and each training data entry of the plurality of training data entries is associated with a plurality of categorical values respectively corresponding to the plurality of initial data features,
(ii) the per-feature-pair symmetric correlation measure identifies a correlation between a feature pair of one or more feature pairs from the plurality of initial data features and the target feature, and
(iii) a respective per-feature-pair symmetric correlation measure is determined for each feature pair of the one or more feature pairs;
extracting, by the one or more processors, one or more limited correlation features from the plurality of initial data features based at least in part on the limited correlation subset, wherein a limited correlation feature of the one or more limited correlation features corresponds to a respective initial data feature of a respective feature pair of the one or more feature pairs that is associated with a particular per-feature-pair symmetric correlation measure that is below a hyper-parameter, and the limited correlation subset is determined by:
(i) determining the respective feature pair of the one or more feature pairs based at least in part on the per-feature mutual information measure of the initial data feature,
(ii) determining that the per-feature-pair symmetric correlation measure for the respective feature pair satisfies the hyper-parameter, and
(iii) in response to determining that the per-feature-pair symmetric correlation measure for the respective feature pair satisfies the hyper-parameter, excluding the initial data feature from the limited correlation subset;
generating, by the one or more processors and based at least in part on the one or more limited correlation features, a limited correlation training set from the source training set by removing one or more categorical values (a) from each of the plurality of training data entries and (b) that respectively correspond to one or more of the plurality of initial data features that are excluded from the one or more limited correlation features;
storing, by the one or more processors, the limited correlation training set in memory; and
accessing, by the one or more processors, the memory to perform one or more training operations for training a categorical input machine learning model based at least in part on the limited correlation training set.
|