US 12,456,085 B2
Anomaly detection of miscoded tags in data fields
Andy Leung, Richmond (CA); Mayur Pandya, Oakland, CA (US); Jon Nelson, Denville, CA (US); Dalmo Cirne, Longmont, CO (US); and Doron Zehavi, Venloe, CA (US)
Assigned to WORKDAY, INC., Pleasanton, CA (US)
Filed by WORKDAY, INC., Pleasanton, CA (US)
Filed on Mar. 18, 2022, as Appl. No. 17/698,458.
Prior Publication US 2023/0297916 A1, Sep. 21, 2023
Int. Cl. G06Q 10/00 (2023.01); G06Q 10/0635 (2023.01); G06Q 30/00 (2023.01); G06Q 30/018 (2023.01)
CPC G06Q 10/0635 (2013.01) [G06Q 30/0185 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
receiving, by a processor, a data record having a plurality of fields;
training, by the processor, an unsupervised machine learning model on historical data records by iteratively selecting different subsets of fields from the historical data records using a Bayesian Optimization and Hyperband (BOHB) algorithm to optimize model performance, and training the unsupervised machine learning model using the selected subset of fields that results in optimal model performance, wherein the iterative selection continues until identifying a subset of fields that results in a Bayesian network model having the best association with user-defined fields;
generating, by the processor, a risk score for the data record using the trained unsupervised machine learning model;
determining, by the processor, a threshold value based on analyzing a distribution of risk scores from the historical data records generated using the unsupervised machine learning model and selecting a specific quantile value from the distribution;
determining, by the processor, that the data record is a potential anomaly based on the risk score exceeding the threshold value;
identifying, by the processor, an anomalous field from the plurality of fields of the data record by iteratively removing each field in the plurality of fields to generate candidate data records, iteratively scoring each of the candidate data records using the unsupervised machine learning model to generate new risk scores for each iteration, selecting a candidate data record ranked with a lowest risk score, and identifying a field removed from the candidate data record as the anomalous field;
generating, by the processor, a plurality of permutations of the data record, the plurality of permutations generated by changing a value of the anomalous field based on a set of most frequently occurring values for the anomalous field in the historical data records, a number of the plurality of permutations dynamically determined based on available computational resources; and
outputting, by the processor, a replacement record selected from the plurality of permutations, the replacement record having a field value for the anomalous field that generates a lowest risk score among the plurality of permutations, the lowest risk score selected from a plurality of risk scores generated by inputting the plurality of permutations into the unsupervised machine learning model when scored by iteratively inputting the plurality of permutations into the unsupervised machine learning model.