US 12,450,526 B2
	Systems and methods for predicting correct or missing data and data anomalies
Kirk J. Haslbeck, Woodbine, MD (US); and Brian N. Mearns, Falls Church, VA (US)
Assigned to Collibra Belgium BV, Brussels (BE)
Filed by Collibra Belgium BV, Brussels (BE)
Filed on Jun. 11, 2024, as Appl. No. 18/740,036.
Application 18/740,036 is a continuation of application No. 18/160,179, filed on Jan. 26, 2023, granted, now 12,008,453, issued on Jun. 11, 2024.
Application 18/740,036 is a continuation of application No. 17/236,823, filed on Apr. 21, 2021, granted, now 11,568,328, issued on Jan. 31, 2023.
Prior Publication US 2024/0420033 A1, Dec. 19, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 20/20 (2019.01); G06F 16/215 (2019.01)

CPC G06N 20/20 (2019.01) [G06F 16/215 (2019.01)]

20 Claims

1. A method for improving data quality in a first dataset, comprising:

receiving the first dataset;

applying a machine-learning algorithm to the first dataset, wherein the machine-learning algorithm identifies at least one relationship between a first data column and a second data column in the first dataset;

based on the identification of the at least one relationship between the first data column and the second data column in the first dataset, generating a second dataset, wherein the second dataset is a subset of the first dataset;

concatenating a plurality of column headers in the second dataset to obtain an itemset;

computing a probability matrix of itemset combinations;

based on the probability matrix of itemset combinations, identifying at least one frequency value associated with the at least one relationship between the first data column and the second data column in the first dataset; and

improving the data quality in the first dataset by identifying at least one anomaly in the first dataset based on the at least one frequency value associated with the at least one relationship.