US 12,277,049 B2
Fault localization in a distributed computing system
Seema Nagar, Bangalore (IN); Pooja Aggarwal, Bengaluru (IN); Qing Wang, Sunnyvale, CA (US); and Larisa Shwartz, Greenwich, CT (US)
Assigned to International Business Machines Corporation, Armonk, NY (US)
Filed by International Business Machines Corporation, Armonk, NY (US)
Filed on Mar. 21, 2022, as Appl. No. 17/655,568.
Prior Publication US 2023/0297490 A1, Sep. 21, 2023
Int. Cl. G06F 11/3604 (2025.01); G06F 11/362 (2025.01)
CPC G06F 11/3612 (2013.01) [G06F 11/366 (2013.01)] 23 Claims
OG exemplary drawing
 
1. A computer-implemented method for localizing faults, the method comprising:
monitoring, during runtime execution of an application, for an occurrence of a request failure by tracking requests of the application for a failure, the application communicating with a plurality of resources within a distributed computing system;
identifying a timeframe in which a request failure is observed while tracking requests of the application;
building a causal graph using erroneous logs generated during the timeframe when the request failure is observed;
identifying real-time execution sequences during the timeframe of the request failure based on paths from a gateway node to a set of leaf nodes according to the causal graph;
establishing a set of frequent execution sub-sequences arising during normal operation of the application including communications with the plurality of resources, the establishing based on a log template time series dataset from normal execution logs; and
identifying a missing resource of the plurality of resources by analyzing a candidate execution sequence occurring during the timeframe with respect to a corresponding partially matching frequent execution sub-sequence of the set of frequent execution sub-sequences, the missing resource being present in the corresponding partially matching frequent execution sub-sequence and absent from the candidate execution sequence.