US 12,306,907 B2
System and method for automatic data consistency checking using automatically defined rules
J. Mitchell Haile, Carlisle, MA (US)
Assigned to Data Culpa, Inc., Carlisle, MA (US)
Filed by Data Culpa, Inc., Carlisle, MA (US)
Filed on Apr. 18, 2022, as Appl. No. 17/722,500.
Claims priority of provisional application 63/178,711, filed on Apr. 23, 2021.
Prior Publication US 2022/0343109 A1, Oct. 27, 2022
Int. Cl. G06K 9/00 (2022.01); G06F 18/214 (2023.01); G06F 18/22 (2023.01); G06N 7/01 (2023.01); G06N 20/00 (2019.01)
CPC G06F 18/22 (2023.01) [G06F 18/214 (2023.01); G06N 7/01 (2023.01); G06N 20/00 (2019.01)] 17 Claims
OG exemplary drawing
 
1. A data pipeline monitoring system configured to monitor operations of a data pipeline, the data pipeline monitoring system comprising:
data processing circuitry configured to receive a training data set and process the training data set;
the data processing circuitry configured to identify a data type, data format, and data value range of the training data set based on the processing;
the data processing circuitry configured to determine an average throughput and entropy for the data pipeline;
the data processing circuitry configured to receive data configuration rules that indicate a preferred data format;
the data processing circuitry configured to generate a data standard that indicates at least the preferred data format based on the data type, data format, and data value range of the training data set, the average throughput and entropy for the data pipeline, and the data configuration rules that indicates the preferred data format;
the data processing circuitry configured to receive an output data set from the data pipeline wherein the data pipeline receives an input data set, processes the input data set, responsively generates the output data set, and transfers the output data set to the data processing circuitry; and
the data processing circuitry configured to determine similarities between the output data set and the data standard, score the output data set based on the similarity between the output data set and the data standard, and report the score for the output data set.