CPC G06F 1/20 (2013.01) [G05B 19/416 (2013.01); G05B 2219/49216 (2013.01)] | 20 Claims |
1. A data processing system for providing computer-implemented services using, at least in part, a graphics processing unit complex comprising graphics processors and pumps used to thermally manage the graphics processors, comprising:
a processor; and
a management controller that is programmed to:
make a first determination regarding whether actual operating conditions for the graphics processing unit complex are within a range of known good configuration data, the actual operation conditions being based on operation of the pumps;
in an instance of the first determination where the actual operating conditions are not within the range:
obtain a failure pattern based on the actual operating conditions;
make a second determination regarding whether the failure pattern matches a known failure pattern;
in an instance of the second determination where the failure pattern matches the known failure pattern:
obtain a type of failure for the known failure pattern;
make a third determination regarding whether the type of failure matches a known failure type;
in an instance of the third determination where the failure type matches the known failure type:
obtain a failure response corresponding to the known failure type; and
performing a remediation for the graphics processing unit complex using the failure response.
|