US 12,393,863 B2
	Distributed training method and system, device and storage medium
Daxiang Dong, Beijing (CN); Weibao Gong, Beijing (CN); Yi Liu, Beijing (CN); Dianhai Yu, Beijing (CN); Yanjun Ma, Beijing (CN); and Haifeng Wang, Beijing (CN)
Assigned to BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY CO., LTD., Beijing (CN)
Filed by Beijing Baidu Netcom Science and Technology Co., Ltd., Beijing (CN)
Filed on Jan. 6, 2021, as Appl. No. 17/142,822.
Claims priority of application No. 202010599075.1 (CN), filed on Jun. 28, 2020.
Prior Publication US 2021/0406767 A1, Dec. 30, 2021
Int. Cl. G06N 20/00 (2019.01); G06F 16/182 (2019.01); G06N 5/04 (2023.01)

CPC G06N 20/00 (2019.01) [G06F 16/182 (2019.01); G06N 5/04 (2013.01)]

7 Claims

1. A distributed training method based on a distributed training system, wherein

the distributed training system is configured to perform model training according to training data, and the distributed training system comprises: a task information server, data servers, and one or more computing servers, wherein the number of the data servers is more than one, and the number of the computing servers is variable;

the distributed training method comprises:

sending, by the task information server, a first training request and information of an available first computing server to at least a first data server among a plurality of data servers;

sending, by the first data server, a first batch of training data to the first computing server, according to the first training request;

performing, by the first computing server, model training according to the first batch of training data, sending model parameters to the first data server so as to be stored as the trained model parameters after the training is completed, and sending identification information of the first batch of training data to the task information server so as to be recorded;

recording, by the task information server, a training progress, so as to configure training tasks to each of the computing servers in the system;

wherein each of the data servers comprises a parameter server; and

after the first computing server sends the trained model parameters to the first data server, the method further comprises: storing the trained model parameters in a first parameter server in the first data server;

wherein the model parameters are not stored at any one of the computing servers;

the distributed training method further comprises: downloading, by each of the data servers, training data and information of a model to be trained from a distributed file system, before the training is started;

wherein the distributed training method further comprises:

performing, by the task information server, a survival detection on each of the computing servers in the system, and if the number of available computing servers in the system remains unchanged, enabling the parameter server in each of the data servers to save the latest model parameters, and if the number of available computing servers in the system changes, updating a list of available computing server and enabling the parameter server in each of the data servers to reload the model parameters of the last survival detection;

suspending the training by the system, when the task information server performs the survival detection, and

sending a new training request to each of the data servers, according to current model parameters and identification information of recorded training data that has completed the training by the task information server, after the survival detection is completed.