| CPC G06F 16/215 (2019.01) [G06F 16/258 (2019.01); G06N 3/045 (2023.01)] | 18 Claims |

|
1. A computer-implemented method comprising:
receiving, by one or more processors, (i) inbound data, (ii) a companion file specifying a data format of the inbound data, and (iii) a sample outbound table, wherein the sample outbound table comprises a representation of a data collection scheme currently employed by a system, and the data collection scheme is to which the inbound data will be ingested;
generating, by the one or more processors, an inbound table by extracting data from the inbound data based at least in part on the companion file;
generating, by the one or more processors and based at least in part on one or more encodings obtained from a column name encoding machine learning model and a column value encoding machine learning model, a cross-column similarity measure for a column pair, wherein the column pair comprises a first column from the inbound table and a second column from the sample outbound table, and wherein generating the cross-column similarity measure comprises:
identifying (i) a first plurality of column name characters of a first column name of the first column and (ii) a second plurality of column name characters of a second column name of the second column;
determining, using a column name character embedding sub-model of the column name encoding machine learning model, (i) a first plurality of column name character embeddings based at least in part on the first plurality of column name characters and (ii) a second plurality of column name character embeddings based at least in part on the second plurality of column name characters;
determining, using a sequential column name processing sub-model of the column name encoding machine learning model, (i) a first name-based encoding for the first column based at least in part on the first plurality of column name character embeddings and (ii) a second name-based encoding for the second column based at least in part on the second plurality of column name character embeddings;
identifying (i) a first plurality of sampled column values for the first column and (ii) a second plurality of sampled column values for the second column;
determining, using a sequential column value processing sub-model of the column value encoding machine learning model, (i) a first plurality of column value encodings for the first plurality of sampled column values and (ii) a second plurality of column value encodings for the second plurality of sampled column values;
determining (i) a first value-based encoding of the first column based at least in part on the first plurality of column value encodings and (ii) a second value-based encoding of the second column based at least in part on the second plurality of column value encodings;
determining a name-based cross-column similarity measure for the column pair based at least in part on the first name-based encoding and the second name-based encoding;
determining a value-based cross-column similarity measure for the column pair based at least in part on the first value-based encoding and the second value-based encoding; and
determining the cross-column similarity measure for the column pair based at least in part on the name-based cross-column similarity measure and the value-based cross-column similarity measure;
determining, by the one or more processors, an anomaly measure for a first column value encoding of the first plurality of column value encodings with respect to a second column value encoding of the second plurality of column value encodings;
determining, by the one or more processors, that the first plurality of sampled column values comprises one or more anomalous column values based at least in part on the anomaly measure;
automatically generating, by the one or more processors, a mapping from the inbound table to the sample outbound table while simultaneously purging the inbound table of the one or more anomalous column values, wherein the mapping is generated by selecting the column pair from a plurality of column pairs based at least in part on the cross-column similarity measure such that a column of the inbound table is assigned to a selected column of the sample outbound table, wherein the selected column of the sample outbound table has a highest cross-column similarity measure with respect to the column of the inbound table; and
automatically ingesting, by the one or more processors, the inbound data into the data collection scheme according to the generated mapping and without the one or more anomalous column values.
|