US 12,190,219 B1
Systems and methods for outlier detection and feature transformation in machine learning model training
Joseph O. Nyangon, Hummelstown, PA (US); and Ruth Oluwadamilola Akintunde, Wake Forest, NC (US)
Assigned to SAS INSTITUTE INC., Cary, NC (US)
Filed by SAS INSTITUTE INC., Cary, NC (US)
Filed on Sep. 4, 2024, as Appl. No. 18/824,828.
Claims priority of provisional application 63/600,505, filed on Nov. 17, 2023.
Claims priority of provisional application 63/537,477, filed on Sep. 8, 2023.
Int. Cl. G06N 20/00 (2019.01)
CPC G06N 20/00 (2019.01) 29 Claims
OG exemplary drawing
 
1. A computer-program product comprising a non-transitory machine-readable storage medium storing computer instructions that, when executed by one or more processors, perform operations comprising:
obtaining a raw dataset comprising a plurality of data samples that store historical values of a target entity that includes an energy commodity, healthcare data management, retail inventory, an energy market, an energy consumption, energy utilities, electricity grid management, or energy;
executing an outlier filtration process based on obtaining the raw dataset, wherein the outlier filtration process includes:
detecting, by a quantile-based outlier filtration algorithm, outlier data samples of the plurality of data samples that exceed a lower quantile threshold or an upper quantile threshold,
generating an intermediate outlier-reduced dataset that includes a subset of the plurality of data samples, wherein the intermediate outlier-reduced dataset excludes the outlier data samples that exceed the lower quantile threshold or the upper quantile threshold,
decomposing, by a matrix decomposition algorithm, the intermediate outlier-reduced dataset into a transformed features matrix and a sparse matrix, wherein the transformed features matrix includes a plurality of feature vectors of a plurality of principal components of the intermediate outlier-reduced dataset; and
generating a refined outlier-reduced dataset that includes a subset of the plurality of feature vectors, wherein the refined outlier-reduced dataset excludes feature vectors of the transformed features matrix that are associated with an anomalous value in the sparse matrix;
training a model using the refined outlier-reduced dataset;
maintaining risk mitigation preparedness by predicting via the trained model a value of the target entity that includes predicting for a future time a demand of the energy commodity, the healthcare data management, the retail inventory, the energy market, the energy consumption, the energy utilities, the electricity grid management, or the energy; and
predicting, via the trained model, the value of the target entity at the future time.