US 12,072,863 B1
Data ingestion using data file clustering with KD-epsilon trees
Prakhar Jain, Sunnyvale, CA (US); Frederick Ryan Johnson, Orem, UT (US); and Bart Samwel, Oegstgeest (NL)
Assigned to Databricks, Inc., San Francisco, CA (US)
Filed by Databricks, Inc., San Francisco, CA (US)
Filed on Jul. 5, 2023, as Appl. No. 18/218,400.
Int. Cl. G06F 16/20 (2019.01); G06F 16/22 (2019.01); G06F 16/23 (2019.01); G06F 16/245 (2019.01); G06F 16/28 (2019.01)
CPC G06F 16/2246 (2019.01) [G06F 16/2358 (2019.01); G06F 16/245 (2019.01); G06F 16/285 (2019.01)] 21 Claims
OG exemplary drawing
 
1. A method comprising:
receiving, from a client device, a request to ingest a set of records to a data table stored in a data storage system, the data table including a plurality of records for one or more features;
accessing a data tree for the data table including a plurality of nodes and edges, the nodes of the data tree representing conditions with respect to key-values for two or more keys, a leaf node of the data tree configured as a data file that includes a respective subset of records having key-values satisfying the condition for the node and conditions associated with parent nodes of the node, and a parent node of the data tree configured as a file with a buffer that includes changes to the data table and a storage for pointers to child nodes of the parent node;
determining whether a parent node of the data tree has sufficient data storage in the buffer to store the set of records of the request; and
responsive to determining that the parent node has insufficient data storage to store the set of records, writing at least a portion of the set of records or records in the file associated with the parent node to at least one file associated with a set of child nodes of the parent node.