US 12,346,330 B2
Efficient merge of tabular data using mixing
Bart Samwel, Oegstgeest (NL); Tathagata Das, New Haven, CT (US); Lars Kroll, Almere (NL); Yijia Cui, Sunnyvale, CA (US); Juliusz Sompolski, Amsterdam (NL); and Tom Van Bussel, Amsterdam (NL)
Assigned to Databricks, Inc., San Francisco, CA (US)
Filed by Databricks, Inc., San Francisco, CA (US)
Filed on Aug. 25, 2022, as Appl. No. 17/895,882.
Prior Publication US 2024/0070155 A1, Feb. 29, 2024
Int. Cl. G06F 16/2455 (2019.01); G06F 16/22 (2019.01)
CPC G06F 16/2456 (2019.01) [G06F 16/2282 (2019.01)] 22 Claims
OG exemplary drawing
 
1. A system, comprising:
one or more processors configured to:
determine to merge a target table and a source table, wherein the target table is stored in a plurality of target table files, each target table file storing one or more records, wherein the source table has one or more records that are different from a target table;
in response to determining to merge the target table and the source table,
perform a first job that determines a set of matching target table files,
wherein performing the first job comprises:
identifying the set of matching target table files comprising a subset of the plurality of target table files, wherein each matching target table file of the set of matching target table files is identified responsive to determining that the matching target table file stores a row that matches a row from the source table; and
storing target table information indicating (i) the set of matching target table files, and (ii) for each matching target table file of the set of matching target table files, a particular set of rows having matching rows in one or more matching source table files;
perform a second job that processes the rows of the set of the matching target table files, wherein performing the second job comprises:
for each of the set of matching target table files,
 performing a first matching action based at least in part on (i) a set of the particular set of rows comprised in the matching target table files, and (ii) matching rows in the one or more matching source table files; and
selecting a first set of unmatched rows from the matching target table files to process in the second job, the first set of unmatched rows being selected based at least in part on a mixing policy;
performing a second matching action based at least in part on the first set of unmatched rows;
obtaining one or more second job resulting files based on the first matching action and the second matching action;
obtain one or more other resulting files based at least in part on a second set of unmatched rows among the target table and the source table that results from the first set of unmatched rows having been processed in the second job; and
obtain a resulting table based at least in part on (i) the one or more second job resulting files, and (ii) the one or more other resulting files; and
a memory configured to store the resulting table.