US 12,423,937 B2
	Automated data pre-processing for machine learning
Yu Zui You, Ningbo (CN); Zhan Peng Huo, Beijing (CN); Jun Zhu, Shanghai (CN); Kuo-Liang Chou, New Taipei (TW); Xuan Feng, Beijing (CN); and Jun Hao, Dalian (CN)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Feb. 8, 2023, as Appl. No. 18/166,004.
Prior Publication US 2024/0265664 A1, Aug. 8, 2024
Int. Cl. G06V 10/20 (2022.01); G06V 10/774 (2022.01)

CPC G06V 10/20 (2022.01) [G06V 10/774 (2022.01)]

20 Claims

1. A method for performing automated pre-processing of data for a machine-learning based prediction system, the method comprising:

receiving a plurality of raw data sets;

receiving a plurality of processed data sets, wherein each of the plurality of processed data sets corresponds to one of the plurality of raw data sets;

generating a plurality of pre-processing templates based on the plurality of raw data sets and the processed data set, wherein generating the plurality of pre-processing templates comprises: detecting and extracting attributes from the plurality of processed data sets;

comparing the plurality of processed data sets to the corresponding plurality of raw data sets;

and learning data processing methods used by data scientists;

receiving an input data set;

generating, for each of the plurality of pre-processing templates, a matching score for the input data set, wherein the matching score is calculated by executing a computer-implemented matching algorithm that, for each column in the input data set:

automatically extracting and comparing the column attribute and header to corresponding columns in the pre-processing template using a similarity metric;

applying user-assigned weights associated with each template column to the similarity results to generate a weighted column score; and

combining the weighted column scores for all columns in the input data set using a predetermined aggregation function to produce the matching score for the template;

selecting one of the plurality of pre-processing templates based on the matching score;

pre-processing the input data set using the selected pre-processing template; and

providing the pre-processed input data set to the machine learning based prediction system.