CPC G06F 16/221 (2019.01) [G06F 16/243 (2019.01)] | 20 Claims |
1. A non-transitory computer readable storage medium storing instructions that are executable by one or more processors to cause the one or more processors to perform operations for dataset discovery, the operations comprising:
accessing a data repository comprising a plurality of tables having cell values arranged in one or more columns and one or more rows;
generating serialized sequences of the cell values that correspond to particular columns of the plurality of tables;
inputting the serialized sequences into a natural language model;
converting, using the natural language model, the serialized sequences into contextualized embeddings associated with the plurality of tables;
storing the contextualized embeddings associated with the plurality of tables in one or more vector indices;
receiving a query table; and
generating an output of one or more candidate tables from the plurality of tables having a unionability with the received query table, by:
determining, using the one or more vector indices, one or more column unionability scores between columns of the query table and the one or more columns of the plurality of tables;
determining, using the one or more column unionability scores, one or more table unionability scores between the query table and one or more of the plurality of tables; and
outputting the one or more candidate tables based on the one or more table unionability scores.
|