CPC G06F 16/2428 (2019.01) [G06F 16/287 (2019.01); G06F 40/284 (2020.01)] | 28 Claims |
1. A computer-program product embodied in a non-transitory machine-readable storage medium storing computer instructions that, when executed by one or more processors, perform operations comprising:
receiving an input, via a computer, of a pre-constructed target hierarchical taxonomy comprising a plurality of distinct hierarchical taxonomy categories organized according to relationships between the plurality of distinct hierarchical taxonomy categories;
identifying distinct hierarchical categories within the plurality of distinct hierarchical taxonomy categories of the pre-constructed target hierarchical taxonomy, wherein a distinct hierarchical category includes a hypernym and one or more hyponyms;
extracting, by the one or more processors, a plurality of distinct taxonomy tokens from the plurality of distinct hierarchical taxonomy categories of the pre-constructed target hierarchical taxonomy;
computing, by a token vectorization model, a taxonomy vector corpus based on the plurality of distinct taxonomy tokens, the taxonomy vector corpus comprising a distinct taxonomy embedding for each distinct taxonomy token of the plurality of distinct hierarchical taxonomy tokens of the pre-constructed target hierarchical taxonomy;
computing, by a machine learning model, a plurality of distinct taxonomy clusters based on an input of the taxonomy vector corpus, the machine learning model comprising an unsupervised machine learning model;
constructing a hierarchical taxonomy data table classification model based on the plurality of distinct taxonomy clusters, wherein the hierarchical taxonomy data table classification model is configured to classify unlabeled tabular datasets to one of the plurality of distinct hierarchical taxonomy categories of the pre-constructed target hierarchical taxonomy, wherein a given tabular dataset of the unlabeled tabular datasets includes at least a plurality of columns or a plurality of rows;
converting a volume of unlabeled tabular datasets to a plurality of distinct corpora of taxonomy-labeled tabular datasets based on using the hierarchical taxonomy data table classification model; and
outputting, via a graphical user interface, at least one corpus of taxonomy-labeled tabular datasets of the plurality of distinct corpora of taxonomy-labeled tabular datasets based on an input of a tabular data classification query.
|