US 12,271,430 B2
Data cataloging based on classification models
Leon Burda, Cupertino, CA (US); Lingling Yan, Morgan Hill, CA (US); and Shayak Sadhu, Hooghly (IN)
Assigned to HITACHI VANTARA LLC, Santa Clara, CA (US)
Appl. No. 18/036,468
Filed by HITACHI VANTARA LLC, Santa Clara, CA (US)
PCT Filed Nov. 17, 2020, PCT No. PCT/US2020/060834
§ 371(c)(1), (2) Date May 11, 2023,
PCT Pub. No. WO2022/108576, PCT Pub. Date May 27, 2022.
Prior Publication US 2024/0012859 A1, Jan. 11, 2024
Int. Cl. G06F 16/906 (2019.01); G06F 16/907 (2019.01)
CPC G06F 16/906 (2019.01) [G06F 16/907 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A system comprising:
one or more processors configured by executable instructions to perform operations comprising:
receiving a first data set including structured or semi-structured data;
receiving a user input to create a classification to use for the first data set;
receiving a user input to associate, in a metadata glossary, the classification with the first data set as reference data;
generating a first fingerprint for a first classification model based on a plurality of data properties of the first data set, the plurality of data properties of the first data set being determined based on at least one of top K most frequent values, top K most frequent patterns, top K most frequent tokens, length distribution, minimum and/or maximum values, quantiles, cardinality, row counts, null counts, numeric counts, T-digest quantiles for numeric data, or hyperloglog for cardinality estimation;
determining a classification association with the first data set based in part on comparing the first fingerprint of the first classification model with a second fingerprint associated with a second classification model of a second data set, the second fingerprint being determined based at least in part on a plurality of data properties of the second data set;
determining a user curation result with respect to the classification association with the first data set; and
updating the first classification model for the first data set based at least in part on the user curation result.