US 12,189,590 B1
Systems and methods for querying large data repositories
Grace Fan, Wayne, PA (US); Jin Wang, Cupertino, CA (US); Yuliang Li, Cupertino, CA (US); Dan Zhang, Sunnyvale, CA (US); and Renée J. Miller, Boston, MA (US)
Assigned to RECRUIT CO., LTD., Toyko (JP)
Filed by Recruit Co., Ltd., Tokyo (JP)
Filed on Sep. 19, 2023, as Appl. No. 18/470,350.
Int. Cl. G06F 17/00 (2019.01); G06F 16/22 (2019.01); G06F 16/242 (2019.01)
CPC G06F 16/221 (2019.01) [G06F 16/243 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A non-transitory computer readable storage medium storing instructions that are executable by one or more processors to cause the one or more processors to perform operations for dataset discovery, the operations comprising:
accessing a data repository comprising a plurality of tables having cell values arranged in one or more columns and one or more rows;
generating serialized sequences of the cell values that correspond to particular columns of the plurality of tables;
inputting the serialized sequences into a natural language model;
converting, using the natural language model, the serialized sequences into contextualized embeddings associated with the plurality of tables;
storing the contextualized embeddings associated with the plurality of tables in one or more vector indices;
receiving a query table; and
generating an output of one or more candidate tables from the plurality of tables having a unionability with the received query table, by:
determining, using the one or more vector indices, one or more column unionability scores between columns of the query table and the one or more columns of the plurality of tables;
determining, using the one or more column unionability scores, one or more table unionability scores between the query table and one or more of the plurality of tables; and
outputting the one or more candidate tables based on the one or more table unionability scores.