US 11,720,533 B2
Automated classification of data types for databases
Rajmohan Chandrahasan, Kanchipuram (IN); Ankush Gupta, Uttarakhand (IN); Venkata Nagaraju Pavuluri, New Rochelle, NY (US); Arvind Agarwal, New Delhi (IN); and Sameep Mehta, Bangalore (IN)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Nov. 29, 2021, as Appl. No. 17/536,860.
Prior Publication US 2023/0169050 A1, Jun. 1, 2023
Int. Cl. G06F 16/21 (2019.01); G06F 16/22 (2019.01); G06F 16/23 (2019.01); G06F 16/2458 (2019.01); G06F 18/21 (2023.01); G06N 3/045 (2023.01)
CPC G06F 16/213 (2019.01) [G06F 16/2282 (2019.01); G06F 16/2358 (2019.01); G06F 16/2462 (2019.01); G06F 18/2178 (2023.01); G06N 3/045 (2023.01)] 20 Claims
OG exemplary drawing
 
1. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by one or more processors to cause the one or more processors to:
receive a portion of identifying information for one or more components of a database;
generate one or more descriptions for the one or more components based at least in part on the portion of the identifying information for the one or more components;
input the one or more descriptions and create, read, update and delete operations data of the database to one or more machine learning models;
predict, using the one or more machine learning models, one or more data types associated with the one or more components, wherein the prediction is based at least in part on the one or more descriptions and the create, read, update and delete operations data;
wherein the predicting comprises:
extracting from the create, read, update and delete operations data counts of a number of one or more of data reads, data writes, data deletes and data updates over a given time period for the one or more components; and
determining, based at least in part on the counts, the one or more data types associated with the one or more components; and
wherein the program instructions further cause the one or more processors to train the one or more machine learning models with: (i) labeled training data comprising respective ones of a plurality of data types corresponding to respective ones of a plurality of database components and respective ones of a plurality of descriptions of the database components; and (ii) data comprising correspondence between the respective ones of the plurality of data types and frequency of create, read, update and delete operations.