| CPC G06F 16/2456 (2019.01) [G06F 16/2358 (2019.01); G06F 16/2386 (2019.01)] | 18 Claims |

|
1. A method performed by a computing system configured for batch materialization, the method comprising:
receiving an incremental change data capture (CDC) changeset comprising a plurality of primary keys associated with full row changes comprising at least one of additions, updates, and deletes in the incremental CDC changeset, wherein the full row changes comprises values for changed data and unchanged data;
extracting primary keys from the incremental CDC changeset;
adding the extracted primary keys to at least one Bloom filter;
broadcasting the at least one Bloom filter to a plurality of executors;
filtering, by each executor, a baseline data table from a data lake based on the extracted primary keys in the broadcast at least one Bloom filter, wherein filtering the baseline data table produces a baseline match dataframe and a baseline unmatched dataframe, wherein all primary keys in the baseline match dataframe match the extracted primary keys from the incremental CDC changeset and wherein all primary keys in the baseline unmatched dataframe do not match the extracted primary keys from the incremental CDC changeset;
providing a different subset of the incremental CDC changeset to each of the plurality of executors;
applying, by each executor, the full row changes in the incremental CDC changeset to the baseline unmatched dataframe to produce a final changed baseline data table; and
storing the final changed baseline data table in the data lake.
|