US 12,190,218 B2
System and method of operationalizing automated feature engineering
James Max Kanter, Boston, MA (US); and Kalyan Kumar Veeramachaneni, Watertown, MA (US)
Assigned to Alteryx, Inc.
Filed by Alteryx, Inc., Irvine, CA (US)
Filed on Feb. 21, 2024, as Appl. No. 18/583,205.
Application 18/583,205 is a continuation of application No. 17/039,428, filed on Sep. 30, 2020, granted, now 11,941,497.
Prior Publication US 2024/0193485 A1, Jun. 13, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06N 20/00 (2019.01); G06F 16/2457 (2019.01); G06F 16/28 (2019.01)
CPC G06N 20/00 (2019.01) [G06F 16/24578 (2019.01); G06F 16/285 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A computer implemented method, comprising:
receiving a dataset from a data source;
selecting a subset of primitives from a plurality of primitives based on the received dataset, each of the selected primitives comprising a computation configured to be applied to at least a portion of the received dataset to synthesize one or more features, wherein the selecting includes:
generating a representative vector for the received dataset;
obtaining representative vectors for the plurality of primitives; and
selecting the subset of primitives by comparing the representative vector for the received dataset to the representative vectors for the plurality of primitives;
synthesizing a plurality of features by applying the selected subset of primitives to the received dataset, wherein synthesizing the plurality of features includes, for each primitive in the subset:
identifying one or more variables in the received dataset; and
applying the primitive to the one or more variables to generate one or more features of the plurality of features;
iteratively evaluating the plurality of features to remove one or more features from the plurality of features to obtain a subset of features, wherein iteratively evaluating the plurality of features includes:
applying the plurality of features to a first portion of the received dataset to determine a first usefulness score of each of the plurality of features;
removing one or more of the plurality of features based on the first usefulness score of each of the plurality of features to obtain a preliminary subset of features;
applying the preliminary subset of features to a second portion of the received dataset to determine a second usefulness score of each of the preliminary subset of features; and
removing one or more of the preliminary subset of features from the preliminary subset of features based on the second usefulness score of each of the preliminary subset of features; and
generating a machine learning model based on the subset of features, the machine learning model configured to be used to make a prediction based on new data.