US 12,314,240 B1
Incremental partition index update using bloom filters
Daniel Opincariu, Redmond, WA (US); Zhuonan Song, Bellevue, WA (US); Miradham Kamilov, Vancouver (CA); and Baosheng Wang, Vancouver (CA)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Mar. 18, 2021, as Appl. No. 17/205,839.
Int. Cl. G06F 16/22 (2019.01); G06F 16/23 (2019.01); G06F 16/2455 (2019.01); G06F 16/2458 (2019.01)
CPC G06F 16/2272 (2019.01) [G06F 16/2358 (2019.01); G06F 16/24554 (2019.01); G06F 16/2462 (2019.01)] 19 Claims
OG exemplary drawing
 
1. A system, comprising:
a data lake comprising a distributed object store; and
a data indexing system comprising one or more processors and one or more memories storing computer-executable instructions that, as a result of execution, cause the one or more processors to:
generate an index for a plurality of fields in respective partitions in the data lake, wherein the index comprises a first plurality of Bloom filters that collectively indicates whether there is a possibility that a value is present in the data lake, wherein each Bloom filter is associated with a respective field of the plurality of fields;
generate, for the index, a second plurality of Bloom filters based on one or more updates to records of the data lake applied after generation of the first plurality of Bloom filters that reflect an updated state of one or more partitions;
storing the second plurality of Bloom filters;
receive a query indicating the value;
determine, using at least a portion of the first plurality of Bloom filters and the second plurality of Bloom filters, a set of candidate partitions where the value is possibly stored and a set of non-candidate partitions where the value is definitely not stored, wherein determining the set of candidate partitions includes,
determine, based on an updated partition list, whether a partition of the one or more partitions was subject to the one or more updates,
based on the updated partition list indicating that the partition was subject to the one or more updates, determine that the value is possibly stored in the partition based on a Bloom filter of the second plurality of Bloom filters associated with the partition;
determine, based on the updated partition list, whether a second partition of the one or more partitions was subject to the one or more updates, and
based on the updated partition list indicating that the second partition was not subject to the one or more updates, determine that the value is possibly stored in the second partition based on a Bloom filter of the first plurality of Bloom filters associated with the second partition;
determine, using the set of candidate partitions, one or more records that comprise the value in one or more partitions of the set of candidate partitions; and
fulfill the query by providing the one or more records.