US 12,093,822 B2
Anomaly detection based preprocessing for better classification tasks with noisy and imbalanced datasets
Arno Schneuwly, Effretikon (CH); and Suwen Yang, Belmont, CA (US)
Assigned to Oracle International Corporation, Redwood Shores, CA (US)
Filed by Oracle International Corporation, Redwood Shores, CA (US)
Filed on Oct. 28, 2022, as Appl. No. 17/976,473.
Prior Publication US 2024/0143993 A1, May 2, 2024
Int. Cl. G06F 11/00 (2006.01); G06F 11/07 (2006.01); G06N 3/04 (2023.01); G06N 3/08 (2023.01)
CPC G06N 3/08 (2013.01) [G06F 11/0727 (2013.01); G06F 11/079 (2013.01); G06N 3/04 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
training, based on a plurality of timeseries, a plurality of anomaly detectors, wherein:
each anomaly detector in the plurality of anomaly detectors is configured with a respective distinct contamination factor,
each timeseries in the plurality of timeseries comprises a temporal sequence of datapoints that characterize a device, and
each datapoint in the plurality of timeseries comprises a respective label that indicates whether the device failed when the datapoint occurred;
detecting, by each anomaly detector of the plurality of anomaly detectors after said training:
a plurality of anomalous datapoints in the plurality of timeseries, wherein a size of the plurality of anomalous datapoints is proportional to said contamination factor of the anomaly detector,
a respective healthy count of the plurality of anomalous datapoints in timeseries not containing a datapoint whose label indicates the device failed, and
a respective unhealthy count of the plurality of anomalous datapoints in timeseries containing a datapoint whose label indicates the device failed;
detecting, for a particular anomaly detector of the plurality of anomaly detectors, that a magnitude of difference between the respective healthy count and the respective unhealthy count is less than a threshold;
oversampling, based on said contamination factor of the particular anomaly detector, an oversampled plurality of anomalous datapoints from the anomalous datapoints of the plurality of anomaly detectors; and
training, based on the oversampled plurality of anomalous datapoints, a classifier;
wherein the method is performed by one or more computers.