| CPC G06V 10/20 (2022.01) [G06V 10/774 (2022.01)] | 20 Claims |

|
1. A method for performing automated pre-processing of data for a machine-learning based prediction system, the method comprising:
receiving a plurality of raw data sets;
receiving a plurality of processed data sets, wherein each of the plurality of processed data sets corresponds to one of the plurality of raw data sets;
generating a plurality of pre-processing templates based on the plurality of raw data sets and the processed data set, wherein generating the plurality of pre-processing templates comprises: detecting and extracting attributes from the plurality of processed data sets;
comparing the plurality of processed data sets to the corresponding plurality of raw data sets;
and learning data processing methods used by data scientists;
receiving an input data set;
generating, for each of the plurality of pre-processing templates, a matching score for the input data set, wherein the matching score is calculated by executing a computer-implemented matching algorithm that, for each column in the input data set:
automatically extracting and comparing the column attribute and header to corresponding columns in the pre-processing template using a similarity metric;
applying user-assigned weights associated with each template column to the similarity results to generate a weighted column score; and
combining the weighted column scores for all columns in the input data set using a predetermined aggregation function to produce the matching score for the template;
selecting one of the plurality of pre-processing templates based on the matching score;
pre-processing the input data set using the selected pre-processing template; and
providing the pre-processed input data set to the machine learning based prediction system.
|