US 12,436,924 B2
Unsupervised learning from public tabular datasets
Thanh Lam Hoang, Maynooth (IE); Gabriele Picco, Dublin (IE); Lam Minh Nguyen, Ossining, NY (US); and Dzung Tien Phan, Pleasantville, NY (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Dec. 15, 2022, as Appl. No. 18/066,327.
Prior Publication US 2024/0202167 A1, Jun. 20, 2024
Int. Cl. G06F 16/21 (2019.01); G06F 16/22 (2019.01); G06N 20/00 (2019.01)
CPC G06F 16/211 (2019.01) [G06F 16/2282 (2019.01); G06N 20/00 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer-implemented method for transferable feature engineering and synthetic data generation, the computer-implemented method comprising:
retrieving a plurality of data tables, wherein the plurality of data tables are heterogeneous in format and content;
removing at least one timestamp from the plurality of data tables to reduce noise within the plurality of data tables;
generating a variational auto-encoder (VAE) model;
training the VAE model on the plurality of data tables after removal of the at least one timestamp;
receiving an input data table;
generating a synthetic data table based on the input data table and the trained VAE model;
determining a subset of the plurality of data tables which have a lower column width than a maximum column width of the plurality of data tables;
inserting blank columns up to the maximum column width for the subset of the plurality of data tables; and
inserting a predetermined data value in the blank columns to prevent the VAE model from training the blank columns with the predetermined data.