| CPC G06F 12/1081 (2013.01) [G06F 9/544 (2013.01); G06F 2212/65 (2013.01)] | 20 Claims |

|
1. A coarse-grained reconfigurable processor system for implementing data-parallel training of a neural network, comprising:
a first memory;
a set of coarse-grained reconfigurable units (CGRUs) in a first coarse-grained reconfigurable processor that is coupled to the first memory and configured to implement at least a portion of the neural network, to determine first and second gradients, respectively, of first and second model parameters based on a batch of training data, and to store the first and second gradients in the first memory;
a network interface including an external direct memory access (DMA) engine coupled between the first memory and a network; and
a work queue associated with the external DMA engine, wherein completion of determining the first gradient triggers a first work queue entry of the work queue that directs the external DMA engine to transfer the first gradient for a gradient reduction operation from the first memory over the network to a second memory that is coupled to a second coarse-grained reconfigurable processor.
|