| CPC G06F 16/906 (2019.01) [G06F 16/907 (2019.01)] | 20 Claims |

|
1. A system comprising:
one or more processors configured by executable instructions to perform operations comprising:
receiving a first data set including structured or semi-structured data;
receiving a user input to create a classification to use for the first data set;
receiving a user input to associate, in a metadata glossary, the classification with the first data set as reference data;
generating a first fingerprint for a first classification model based on a plurality of data properties of the first data set, the plurality of data properties of the first data set being determined based on at least one of top K most frequent values, top K most frequent patterns, top K most frequent tokens, length distribution, minimum and/or maximum values, quantiles, cardinality, row counts, null counts, numeric counts, T-digest quantiles for numeric data, or hyperloglog for cardinality estimation;
determining a classification association with the first data set based in part on comparing the first fingerprint of the first classification model with a second fingerprint associated with a second classification model of a second data set, the second fingerprint being determined based at least in part on a plurality of data properties of the second data set;
determining a user curation result with respect to the classification association with the first data set; and
updating the first classification model for the first data set based at least in part on the user curation result.
|