US 11,886,969 B2
Dynamic network bandwidth in distributed deep learning training
Wei Zhang, Elmsford, NY (US); Xiaodong Cui, Chappaqua, NY (US); Abdullah Kayi, Westchester, NY (US); and Alper Buyuktosunoglu, White Plains, NY (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Jul. 9, 2020, as Appl. No. 16/925,192.
Prior Publication US 2022/0012642 A1, Jan. 13, 2022
Int. Cl. G06N 20/20 (2019.01); H04L 12/24 (2006.01); G06N 3/02 (2006.01); H04L 41/16 (2022.01)
CPC G06N 20/20 (2019.01) [G06N 3/02 (2013.01); H04L 41/16 (2013.01)] 14 Claims
OG exemplary drawing
 
1. A computer-implemented method, comprising:
performing distributed deep learning training on a batch of training data;
determining a training time representing an amount of time between:
a beginning batch time for a learner; and
an end batch time for the learner;
determining that the learner is a communication straggler by determining that the training time exceeds a predetermined threshold time; and
modifying a communication aspect of the learner to reduce a future network communication time for the communication straggler to send a future result of the distributed deep learning training on a new batch of training data in response to a centralized parameter server determining that the learner is the communication straggler, wherein modifying the communication aspect comprises compressing the future result before sending the future result to the centralized parameter server, wherein the future result is compressed using a compression rate based on a network communication time of the communication straggler.