US 12,216,633 B1
Memory-aware system and method for identifying matching portions of two sets of data in a multiprocessor system
Thomas Kejser, Tallinn (EE); and Charles E. Gotlieb, San Juan, PR (US)
Assigned to Yellowbrick Data, Inc., Mountain View, CA (US)
Filed by Yellowbrick Data, Inc., Palo Alto, CA (US)
Filed on Sep. 20, 2021, as Appl. No. 17/480,131.
Application 17/480,131 is a continuation of application No. 16/890,799, filed on Jun. 2, 2020, granted, now 11,126,607.
Application 16/890,799 is a continuation of application No. 15/340,952, filed on Nov. 1, 2016, granted, now 10,671,608.
Claims priority of provisional application 62/249,268, filed on Nov. 1, 2015.
Claims priority of provisional application 62/249,265, filed on Nov. 1, 2015.
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/00 (2019.01); G06F 16/18 (2019.01); G06F 16/215 (2019.01); G06F 16/22 (2019.01); G06F 16/2453 (2019.01); G06F 16/2455 (2019.01); G06F 16/27 (2019.01); G06F 13/16 (2006.01)
CPC G06F 16/2255 (2019.01) [G06F 16/1858 (2019.01); G06F 16/215 (2019.01); G06F 16/2282 (2019.01); G06F 16/24544 (2019.01); G06F 16/2456 (2019.01); G06F 16/278 (2019.01); G06F 13/1673 (2013.01); G06F 2209/5018 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A method of joining a first database data set and a second database data set, the method comprising:
(A) identifying a size of a storage space to be used for joining the first database data set and the second database data set;
(B) identifying a number of a plurality of processor cores to be used for joining the first database data set and the second database data set;
(C) hashing each of a plurality of data elements of the first database data set to produce a first hash result for each of the plurality of data elements, each first hash result comprising a first portion and a second portion, the first and second portions each comprising less than all of the first hash result and not entirely overlapping with each other;
(D1) assigning each of the plurality of data elements of the first database data set to one of a plurality of buffers, responsive to the first portion of the first hash result for each of the respective data elements in the plurality of data elements;
(D2) identifying a size of the plurality of data elements in the first database data set;
(D3) comparing the size of the plurality of data elements in the first database data set plus an additional amount with the size of the storage space;
(E) identifying a number of a plurality of sub buffers responsive to the size of the storage space identified, the number of the plurality of processor cores identified, and a size to be used as a size for each of the plurality of the sub buffers, each said sub buffer corresponding to a range of potential first hash results, a plurality of said sub buffers corresponding to each buffer in the plurality of buffers if the size of the storage space is less than the size of the plurality of data elements in the first database data set plus the additional amount, one of the plurality of sub buffers corresponding to each of the plurality of buffers if the size of the storage space is not less than the size of the plurality of data elements in the first database data set plus the additional amount;
(F) by each of the plurality of processor cores, with other processor cores in the plurality of processor cores:
(i) selecting a buffer in the plurality of buffers not already selected by any of the plurality of processor cores;
(ii) assigning each of the plurality of data elements assigned to the selected buffer, to one of the sub buffers in the plurality of sub buffers, responsive to the second portion of the first hash result of each said data element and the range of potential first hash results of said one of the sub buffers;
(iii) generating a hash table for each of the sub buffers in the plurality of sub buffers comprising a first alternate hash result for each of the data elements that is generated using, and different from, the first hash result for said data element;
(iv) storing each sub buffer corresponding to the selected buffer and the hash table of said sub buffer; and
(v) repeating steps F (i)-(iv) until all buffers in the plurality of buffers have been selected;
(G) receiving a portion, less than all, of a plurality of data elements of the second database data set into a plurality of chunks of memory;
(H) by each of the plurality of processor cores, with other processor cores in the plurality of processor cores:
(i) selecting one of the plurality of chunks not already selected by any of the plurality of processor cores; and
(ii) for each of the plurality of data elements in the selected chunk:
(a) hashing said data element in the selected chunk to produce a second hash result for said data element;
(b) assigning said data element to a sub partition in the selected chunk to one of a plurality of sub partitions, each of the sub partitions in the plurality of sub partitions being assigned a range of potential second hash results equal to a range of a different one of the sub buffers, said assigning being responsive to the range of potential second hash results of said sub partition and the second hash result of said data element in the selected chunk; and
(iii) repeating steps (i) and (ii) until all of the chunks have been processed;
(I) by each of the plurality of processor cores, with other processor cores in the plurality of processor cores:
(i) selecting one of the plurality of sub partitions not already selected by any of the plurality of processor cores;
(ii) reading the hash table and data elements of the first database data set of any of the sub buffers having one of the ranges of potential first hash results corresponding to the range of potential second hash results of the selected sub partition;
(iii) for each of the plurality of data elements in the selected sub partition:
(a) identifying whether a second alternate hash result of said data element corresponds to any first alternate hash result of said hash table read; and
(b) if the second alternate hash result corresponds to the first alternate hash result of said hash table read, comparing said data element in the selected sub partition with a data element in the sub buffer read that corresponds to the corresponding first alternate hash result, and if the comparing results in a match, identifying as matched with said data element in the selected sub partition, the data element in the sub buffer read that corresponds to said data element in the selected sub partition; and
(iv) repeating steps (i)-(iii) until all of the sub partitions have been selected; and
(J) repeating steps G-I until all of the plurality of data elements of the second database data set have been processed as in steps G-I.