US 12,306,777 B1
	Hierarchical collective compute operations using DMA transfers
Yongseok Koh, San Jose, CA (US); Se Wang Oh, Campbell, CA (US); and Ron Diamant, Santa Clara, CA (US)
Assigned to Amazon Technologies, Inc., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Mar. 30, 2023, as Appl. No. 18/193,291.
Int. Cl. G06F 13/28 (2006.01); G06N 3/091 (2023.01)

CPC G06F 13/28 (2013.01) [G06N 3/091 (2023.01)]

22 Claims

1. A method for performing distributed training of a neural network model having a plurality of model partitions in a compute system having a plurality of processing nodes, each processing node comprising a plurality of processing ranks, and each processing rank having a rank identifier (ID), the method comprising:

providing respective training data to each of the processing ranks;

providing respective model partitions to each of the processing ranks;

performing a first hierarchical all-gather operation to provide each of the processing ranks with weights of each of the model partitions to execute a forward pass of the neural network model;

performing a second hierarchical all-gather operation to provide each of the processing ranks with the weights of each of the model partitions to execute a backward pass of the neural network model; and

performing a hierarchical reduce-scatter operation to provide each of the processing ranks with respective sets of gradients for the model partition of the corresponding processing rank to update the weights of the model partition,

wherein each of the first and the second hierarchical all-gather operations includes all-gather direct memory access (DMA) transfers to perform an inter-node all-gather operation of non-contiguous memory regions across the processing nodes, and an intra-node all-gather operation of non-contiguous memory regions within each of the processing nodes.