US 11,789,902 B2
Incrementally improving clustering of cross partition data in a distributed data system
Babatunde Micheal Okutubo, Bellevue, WA (US); Maninderjit Singh Parmar, Redmond, WA (US); and Edgars Sedols, Bellevue, WA (US)
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Nov. 23, 2022, as Appl. No. 18/58,331.
Application 18/058,331 is a continuation of application No. 16/881,379, filed on May 22, 2020, granted, now 11,537,557.
Prior Publication US 2023/0100025 A1, Mar. 30, 2023
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 16/00 (2019.01); G06F 16/13 (2019.01); G06F 16/27 (2019.01); G06F 16/28 (2019.01)
CPC G06F 16/13 (2019.01) [G06F 16/27 (2019.01); G06F 16/285 (2019.01)] 20 Claims
OG exemplary drawing
 
1. A system for improved access to rows of data, each data row associated with a partition of a plurality of partitions, the data rows distributed in one or more files, wherein a file including data rows associated with different partitions of the plurality of partitions is an impure file, the system comprising:
at least one processor;
a first compute node that includes first program code structured to cause the at least one processor to:
select a subset of impure files from a plurality of impure files; and
schedule a clustering task in a queue for sorting data rows of the subset of impure files; and
a second compute node that includes second program code structured to cause the at least one processor to, independent of the first compute node:
retrieve the clustering task from the queue;
execute the sorting of the data rows of the subset of the impure files according to a respective associated partition of each of the data rows;
generate a set of disjoint partition range files based on the sorting; and
transfer each file of the disjoint partition range files to a respective target pure partition.