US 12,277,144 B2
Systems, methods, and graphical user interfaces for taxonomy-based classification of unlabeled structured datasets
Nancy Anne Rausch, Apex, NC (US); Ruth Oluwadamilola Akintunde, Raleigh, NC (US); and Brant Nathan Kay, Pittsboro, NC (US)
Assigned to SAS INSTITUTE INC., Cary, NC (US)
Filed by SAS INSTITUTE INC., Cary, NC (US)
Filed on Jul. 13, 2023, as Appl. No. 18/221,684.
Claims priority of provisional application 63/398,827, filed on Aug. 17, 2022.
Claims priority of provisional application 63/391,772, filed on Jul. 24, 2022.
Prior Publication US 2024/0028621 A1, Jan. 25, 2024
Int. Cl. G06F 40/284 (2020.01); G06F 16/242 (2019.01); G06F 16/28 (2019.01)
CPC G06F 16/287 (2019.01) [G06F 16/2428 (2019.01); G06F 40/284 (2020.01)] 30 Claims
OG exemplary drawing
 
1. A non-transitory machine-readable storage medium storing computer instructions that, when executed by one or more processors, perform operations comprising:
identifying, from a database, a structured data corpus comprising a plurality of distinct, unlabeled structured datasets;
for each distinct, unlabeled structured dataset of the plurality of distinct, unlabeled structured datasets:
tokenizing, via one or more tokenization algorithms, a target distinct, unlabeled structured dataset into a plurality of distinct feature tokens;
computing, by a token vectorization model, an embedding value for the target distinct, unlabeled structured dataset based on the plurality of distinct feature tokens;
computing, by a taxonomy classification model, a taxonomy category label for the target distinct, unlabeled structured dataset based on an input of the embedding value, wherein the taxonomy classification model was created using tokens extracted from a pre-constructed taxonomy comprising a plurality of distinct hierarchical categories, wherein each distinct hierarchical category of the plurality of distinct hierarchical categories includes a hypernym token and one or more hyponym tokens, and wherein the tokens extracted from the pre-constructed taxonomy include the hypernym token and the one or more hyponym tokens of each distinct hierarchical category that have been clustered into a plurality of distinct clusters using an unsupervised machine learning model to form the taxonomy classification model;
associating the taxonomy category label with the target distinct, unlabeled structured dataset; and
outputting, to the database, a plurality of distinct corpora of taxonomy-labeled structured datasets based on the taxonomy category label computed for each of the plurality of distinct unlabeled, structured datasets, wherein each distinct corpus of the plurality of distinct corpora of taxonomy-labeled structured datasets relates to a distinct taxonomy category label and includes structured datasets classified to the distinct taxonomy category label.