US 11,675,517 B2
Multi-pass distributed data shuffle
Mohsen Vakilian, Kirkland, WA (US); and Hossein Ahmadi, Seattle, WA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Oct. 19, 2022, as Appl. No. 17/969,296.
Application 17/969,296 is a continuation of application No. 17/359,810, filed on Jun. 28, 2021, granted, now 11,513,710.
Application 17/359,810 is a continuation of application No. 16/672,939, filed on Nov. 4, 2019, granted, now 11,061,596, issued on Jul. 13, 2021.
Prior Publication US 2023/0040749 A1, Feb. 9, 2023
Int. Cl. G06F 3/06 (2006.01); G06F 12/02 (2006.01); G06F 12/0804 (2016.01)
CPC G06F 3/0644 (2013.01) [G06F 3/067 (2013.01); G06F 3/0611 (2013.01); G06F 12/0223 (2013.01); G06F 12/0804 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A method of repartitioning data in a distributed network, the method comprising:
executing, by one or more first processors, a first shuffle of a first portion of a data set from a plurality of first sources to a plurality of first sinks, each first sink collecting data from one or more of the first sources;
tracking, by the one or more first processors, metadata of the first portion of the data set during the first shuffle;
executing, by the one or more second processors separate from the one or more first processors, a second shuffle of a second portion of the data set from the plurality of first sources to a plurality of second sinks, each second sink collecting data from one or more of the first sources; and
tracking, by the one or more second processors, metadata of the second portion of the data set during the second shuffle,
wherein executing the first and second shuffles causes the data set to be repartitioned such that one or more first sinks and one or more second sinks collect data that originated from two or more of the first sources.