US 12,436,926 B2
Relating data in data lakes
Raunak Shah, Wagholi (IN); Koyel Mukherjee, Bangalore (IN); Subrata Mitra, Bangalore (IN); Dhruv Joshi, Uttarakhand (IN); Sai Keerthana Karnam, Andhra Pradesh (IN); and Shivam Pravin Bhosale, Maharashtra (IN)
Assigned to ADOBE INC., San Jose, CA (US)
Filed by Adobe Inc., San Jose, CA (US)
Filed on May 18, 2023, as Appl. No. 18/319,748.
Prior Publication US 2024/0386002 A1, Nov. 21, 2024
Int. Cl. G06F 16/00 (2019.01); G06F 16/215 (2019.01); G06F 16/28 (2019.01); G06F 40/284 (2020.01)
CPC G06F 16/215 (2019.01) [G06F 16/285 (2019.01); G06F 40/284 (2020.01)] 19 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
receiving a dataset comprising a plurality of tables;
generating embeddings for column titles of a first table of the plurality of tables;
based on the embeddings, forming a first cluster comprising the first table and a second table of the plurality of tables;
for each table in the first cluster, generating a vector for each column based on a frequency distribution of cell values for each column and forming a matrix from the vectors;
forming, within the first cluster, a second cluster based on a comparison of at least a portion of the matrices, wherein the second cluster includes the first table and the second table;
based on the second cluster including the first table and the second table, calculating a similarity score for the first table and the second table; and
based on the similarity score exceeding a threshold, deleting the second table.