US 11,748,382 B2
Data classification
Yannick Saillet, Stuttgart (DE); Namit Kabra, Hyderabad (IN); Mike W. Grasselt, Leinfelden-Echterdingen (DE); and Krishna Kishore Bonagiri, Ambajipet (IN)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on May 18, 2020, as Appl. No. 16/876,660.
Claims priority of application No. 19188266 (EP), filed on Jul. 25, 2019.
Prior Publication US 2021/0026872 A1, Jan. 28, 2021
Int. Cl. G06F 16/28 (2019.01); G06F 16/2457 (2019.01); G06F 16/22 (2019.01); G06N 20/00 (2019.01); G06F 16/248 (2019.01); G06F 18/214 (2023.01); G06N 7/01 (2023.01)
CPC G06F 16/285 (2019.01) [G06F 16/221 (2019.01); G06F 16/248 (2019.01); G06F 16/24573 (2019.01); G06F 18/214 (2023.01); G06N 7/01 (2023.01); G06N 20/00 (2019.01)] 16 Claims
OG exemplary drawing
 
1. A computer-implemented method for automatically classifying data fields of a dataset, the method comprising:
providing training datasets, each training dataset comprising one or more training first data fields and previous user-selected data class assignments for the one or more training first data fields, each training first data field being assigned with a plurality of data class candidates; and
executing a machine learning algorithm on the training datasets for generating a machine learning model, wherein executing the machine learning algorithm comprises:
retrieving, from a computer readable storage medium, the previous user-selected data class assignments for the one or more training first data fields
calculating, using the machine learning model, clusters of data classes being assigned as data class candidates for the same training data fields of the training datasets;
calculating, using the machine learning model, probabilities for each of the data class candidates assigned to the training first data fields that a respective data class candidate is a data class to which the respective training first data field is assigned taking into the clusters of data classes assigned as data class candidates to adjacent training data fields within a predefined range of interest around the respective training first data field; and
comparing the calculated probabilities to the retrieved user-selected data class assignments;
storing, in the computer readable storage medium, confidence values for a plurality of data classes, wherein the confidence values are determined by applying a classifier to data fields of a dataset, wherein the classifier is configured for determining the confidence values independently from one another for a plurality of data fields, wherein the confidence values identify a level of confidence that a respective data field belongs to a respective data class, and wherein the data class for which the confidence value exceeds a predefined threshold is identified as a data class candidate for the respective data field for which the respective confidence value is determined;
calculating, by a processor, and storing, in the computer readable storage medium, first data fields for which the plurality of data class candidates are identifiable;
calculating, by the machine learning model trained using previous user-selected data class assignments, and storing, in the computer readable storage medium, a probability for the data class candidates identified for the first data fields that the respective data class candidate is a data class to which a respective first data field is to be assigned;
storing, in the computer readable storage medium, classifications for the stored first data fields, wherein the classifications are determined, by the processor, using the stored probabilities for the data class candidates to select for the stored first data fields a data class from the data class candidates for the respective first data field to which the respective first data field being assigned; and
storing, in the computer readable storage medium, the dataset with metadata identifying for the classified data fields of the dataset the data classes to which the respective classified data fields are assigned;
importing the classified dataset into a target database in the computer readable storage medium;
organizing the target database using the target data model defining a class-based arrangement of data fields;
rearranging data fields of the dataset according to the target data model using metadata identifying data classes to which the data fields of the dataset are assigned;
adding the rearranged dataset to the target database in accordance with the target data model;
executing a search query on the classified dataset with the metadata identifying data classes using a data class identifier as a search parameter of the search query; and
providing a search result, for the search query, comprise one or more data values comprised by datasets from data fields assigned to the data classes identified by the data class identifiers.