US 12,346,331 B1
Batch materialization using bloom filters
Saikiran Sri Thunuguntla, Bengaluru (IN); Vishal Reddy Baddam, Bengaluru (IN); Suman Ghosh, Hennur (IN); and Gururaj Uddihal, Bangalore (IN)
Assigned to Intuit Inc., Mountain View, CA (US)
Filed by Intuit Inc., Mountain View, CA (US)
Filed on Jun. 13, 2024, as Appl. No. 18/742,767.
Int. Cl. G06F 15/16 (2006.01); G06F 16/23 (2019.01); G06F 16/2455 (2019.01)
CPC G06F 16/2456 (2019.01) [G06F 16/2358 (2019.01); G06F 16/24552 (2019.01)] 16 Claims
OG exemplary drawing
 
1. A method performed by a computing system configured for batch materialization, the method comprising:
receiving an incremental change data capture (CDC) changeset comprising a plurality of primary keys associated with corresponding data changes comprising at least one of additions, updates, and deletes;
extracting primary keys from the incremental CDC changeset;
adding extracted primary keys to at least one Bloom filter;
broadcasting the at least one Bloom filter to a plurality of executors;
filtering, by each executor, a baseline data table from a data lake based on the extracted primary keys in the broadcast at least one Bloom filter, wherein filtering the baseline data table produces a baseline match dataframe and a baseline unmatched dataframe, wherein all primary keys in the baseline match dataframe match the extracted primary keys from the incremental CDC changeset and wherein all primary keys in the baseline unmatched dataframe do not match the extracted primary keys from the incremental CDC changeset;
providing a different subset of the incremental CDC changeset to each of the plurality of executors;
applying, by each executor, changes in a received subset of the incremental CDC changeset to the baseline match dataframe to produce a baseline changed dataframe;
combining, by each executor, the baseline changed dataframe with the baseline unmatched dataframe to produce a final changed baseline data table; and
storing the final changed baseline data table in the data lake.