CPC G06N 20/20 (2019.01) [G06N 3/02 (2013.01); H04L 41/16 (2013.01)] | 14 Claims |
1. A computer-implemented method, comprising:
performing distributed deep learning training on a batch of training data;
determining a training time representing an amount of time between:
a beginning batch time for a learner; and
an end batch time for the learner;
determining that the learner is a communication straggler by determining that the training time exceeds a predetermined threshold time; and
modifying a communication aspect of the learner to reduce a future network communication time for the communication straggler to send a future result of the distributed deep learning training on a new batch of training data in response to a centralized parameter server determining that the learner is the communication straggler, wherein modifying the communication aspect comprises compressing the future result before sending the future result to the centralized parameter server, wherein the future result is compressed using a compression rate based on a network communication time of the communication straggler.
|