US 11,928,009 B2
Predicting a root cause of an alert using a recurrent neural network
Chun Qi Ji, Beijing (CN); Qiang Li, Beijing (CN); Jing Sun, Beijing (CN); Zong Nan Jin, Beijing (CN); and He Jun, Beijing (CN)
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Aug. 6, 2021, as Appl. No. 17/395,730.
Prior Publication US 2023/0045303 A1, Feb. 9, 2023
Int. Cl. G06F 11/00 (2006.01); G06F 11/07 (2006.01); G06F 18/231 (2023.01); G06N 3/08 (2023.01)
CPC G06F 11/079 (2013.01) [G06F 18/231 (2023.01); G06N 3/08 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A computer-implemented method comprising:
detecting, by a processor, an error alert from a target computer system;
retrieving, by the processor, performance data from the target computer system in response to detecting the error alert;
generating, by the processor and via a gated recurrent unit (GRU) neural network, a prediction of a root cause of the error alert based on the performance data;
receiving, by the processor, feedback of the prediction; and
adjusting, by the processor, a weights of a reset gate of the GRU neural network based on the feedback,
wherein the GRU neural network is trained by:
detecting a first computer system in an abnormal state and a second computer system in an abnormal state, when a determination that the first computer system and the second computer system are in abnormal states is based on system parameters the first computer system and the second computer system exhibiting time-based patterns that deviate from expected parameter patterns;
extracting unlabeled data from the first computer system in the abnormal state, wherein the unlabeled data is extracted from the first computer system based on a sampling frequency that varies continuously during a period of time that the first computer system has been in an abnormal state, and wherein the sampling frequency is inversely proportional to the period of time that the first computer system has been in an abnormal state;
labeling the extracted data from the first computer system with the root cause;
clustering the labeled data from the first computer system with unlabeled data from the second computer system;
training the GRU neural network initially with the labeled data from the first computer system; and
training the GRU neural network subsequently with the unlabeled data from the second computer system.