| CPC G06F 16/215 (2019.01) [G06F 16/285 (2019.01); G06F 40/284 (2020.01)] | 19 Claims |

|
1. A computer-implemented method comprising:
receiving a dataset comprising a plurality of tables;
generating embeddings for column titles of a first table of the plurality of tables;
based on the embeddings, forming a first cluster comprising the first table and a second table of the plurality of tables;
for each table in the first cluster, generating a vector for each column based on a frequency distribution of cell values for each column and forming a matrix from the vectors;
forming, within the first cluster, a second cluster based on a comparison of at least a portion of the matrices, wherein the second cluster includes the first table and the second table;
based on the second cluster including the first table and the second table, calculating a similarity score for the first table and the second table; and
based on the similarity score exceeding a threshold, deleting the second table.
|