US 12,443,599 B1
Storage-side filtering for performing hash joins
Gopi Krishna Attaluri, Cupertino, CA (US); Kamal Kant Gupta, Snoqualmie, WA (US); Saileshwar Krishnamurthy, Palo Alto, CA (US); Yingjie He, Cupertino, CA (US); and Yongsik Yoon, Sammamsih, WA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Dec. 13, 2017, as Appl. No. 15/841,110.
Claims priority of provisional application 62/590,225, filed on Nov. 22, 2017.
Int. Cl. G06F 16/2453 (2019.01)
CPC G06F 16/24544 (2019.01) [G06F 16/24537 (2019.01)] 16 Claims
OG exemplary drawing
 
1. A system, comprising:
a memory to store program instructions which, if performed by at least one processor, cause the at least one processor to perform a method to at least:
receive, at a query engine, a request to perform a query at a database stored separate from the query engine across a plurality of storage nodes in a distributed data store;
generate, by the query engine, a plan to perform the query, wherein to generate the plan, the program instructions cause the at least one processor to:
during the generation of the query plan:
identify a hash join operation to perform the query;
identify a build table for performing the hash join operation before accessing different tables of the database joined in the query based on respective sizes tracked for the different tables of the database joined in the query that can be determined before accessing the different tables of the database, wherein the identification of the build table determines that the build table is a smallest one of the respective tables;
after generation of the query plan to perform the query, execute, by the query engine, the query plan, wherein to execute the query plan, the query engine is configured to:
send, by the query engine to the plurality of storage nodes, respective requests over a network that instruct individual ones of the plurality of storage nodes associated with the hash join operation to perform a scan of the build table, identified in the query plan before the respective requests are sent, for the hash join according to a join predicate of the hash join operation;
generate, by the query engine, a filter using values of the build table obtained from the individual storage nodes based on the scan of the build table at the individual storage nodes;
send, by the query engine, respective requests and the filter generated by the query engine using the values from the individual storage nodes over the network that instruct the individual storage nodes to individually filter data stored at respective ones of the individual storage nodes according to the filter generated at the query engine as part of performing the hash join operation in the plan and then send filtered values over the network directly to the query engine; and
cause a result of the query to be provided to a user.