US 12,314,753 B2
Preflight checks for hardware accelerators in a distributed system
Jiafan Zhu, San Jose, CA (US); Jianqiao Liu, Basking Ridge, NJ (US); Xiangyu Dong, Sunnyvale, CA (US); Xiao Zhang, San Jose, CA (US); Jikai Tang, Santa Clara, CA (US); Kexin Yang, Sunnyvale, CA (US); Yong Zhao, Sunnyvale, CA (US); Alireza Ghaffarkhah, San Jose, CA (US); Arash Rezaei, Saratoga, CA (US); Dayou Du, Jersey City, NJ (US); Yazhou Zu, San Francisco, CA (US); Xiangling Kong, Sunnyvale, CA (US); Hoang-Vu Dang, San Jose, CA (US); and Alexander Vadimovich Kolbasov, Palo Alto, CA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on May 17, 2024, as Appl. No. 18/667,501.
Application 18/667,501 is a continuation of application No. 17/540,123, filed on Dec. 1, 2021, granted, now 12,020,063.
Prior Publication US 2024/0385873 A1, Nov. 21, 2024
This patent is subject to a terminal disclaimer.
Int. Cl. G06F 9/48 (2006.01); G06F 9/50 (2006.01); G06F 11/30 (2006.01); G06F 11/34 (2006.01)
CPC G06F 9/4843 (2013.01) [G06F 9/5027 (2013.01); G06F 11/3024 (2013.01); G06F 11/3433 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A method comprising:
performing a preflight check on a set of hardware accelerator machines to verify a functionality of the set of hardware accelerator machines before performing a computing workload assigned to the set of hardware accelerator machines, comprising:
for each hardware accelerator machine of the set of hardware accelerator machines,
transmitting data including a program code package to the hardware accelerator machine, the program code package comprising a task action representing a sequence of operations to be performed by a node manager in the hardware accelerator machine, the task action being based at least in part on characteristics of the computing workload;
executing, by the node manager, at least a portion of the sequence of operations represented by the task action in the program code package to generate a preflight output that indicates whether the task action fails; and
in response to determining that the preflight output indicates that the task action fails on the hardware accelerator machine, re-assigning the computing workload to a different hardware accelerator machine.