US 11,886,385 B2
Scalable identification of duplicate datasets in heterogeneous datasets
Praduemn K. Goyal, Holmdel, NJ (US); Sandeep Hans, New Delhi (IN); Samiulla Zakir Hussain Shaikh, Bangalore (IN); and Diptikalyan Saha, Bangalore (IN)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed on Jun. 2, 2022, as Appl. No. 17/805,134.
Prior Publication US 2023/0394011 A1, Dec. 7, 2023
Int. Cl. G06F 16/174 (2019.01); G06F 16/14 (2019.01); G06F 16/16 (2019.01); G06F 18/22 (2023.01)
CPC G06F 16/1748 (2019.01) [G06F 16/148 (2019.01); G06F 16/162 (2019.01); G06F 18/22 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A computer-based method of identifying and sorting similar datasets stored within a pool of heterogeneous datasets, the method comprising:
receiving a plurality of heterogeneous datasets;
automatically comparing schema information and metadata within each of the received plurality of heterogeneous datasets to generate name-based similarity scores for each combination of datasets;
automatically pruning the plurality of heterogeneous datasets by removing datasets having no similarities;
automatically identifying clusters of similar datasets using the name-based similarity scores for each dataset and generating mapping graphs illustrating each cluster of similar datasets;
automatically comparing data distribution information within each of the received plurality of heterogeneous datasets to generate a plurality of data distribution similarity scores for each heterogeneous dataset;
automatically calculating an overall distance metric using the name-based similarity scores and plurality of data distribution similarity scores; and
based on the calculated overall distance metric, automatically generating distance graphs, wherein the automatically generated distance graphs identify clusters of similar datasets and illustrate inferred lineage for the clusters of similar datasets.