US 12,463,876 B2
Utilizing monitoring service exploration to improve service incident mitigation and localization
Myriam Titon, Jerusalem (IL); Adir Hudayfi, Eilat (IL); and Zakie Mashiah, Qiryat Ono (IL)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Feb. 14, 2023, as Appl. No. 18/169,009.
Prior Publication US 2024/0275699 A1, Aug. 15, 2024
Int. Cl. H04L 41/5074 (2022.01); G06F 16/35 (2019.01); H04L 41/069 (2022.01)
CPC H04L 41/5074 (2013.01) [G06F 16/35 (2019.01); H04L 41/069 (2013.01)] 17 Claims
OG exemplary drawing
 
1. A computer-implemented method for mitigating an outage on a cloud computing system, comprising:
identifying an outage ticket for an outage that corresponds to a user-impacting incident within the cloud computing system;
grouping a set of monitoring incident tickets by monitoring service identifiers, the set of monitoring incident tickets occurring within a target time window of the outage ticket;
determining an initial group service incident score for a plurality of monitoring incident tickets within a given monitoring service identifier group associated with a given monitoring service identifier, the initial group service incident score being based on elapsed times between each of the plurality of monitoring incident tickets and corresponding consecutive monitoring incident tickets of the given monitoring service identifier, wherein each monitoring incident ticket of the plurality of monitoring incident tickets shares the initial group service incident score;
for each monitoring incident ticket of the plurality of monitoring incident tickets, generating a service incident score for the given monitoring incident ticket by weighting the initial group service incident score of the given monitoring incident ticket based on a set of monitoring service factors;
generating a ranked list of relevant monitoring incident tickets based on the service incident scores of the plurality of monitoring incident tickets;
generating tokenized text from the outage ticket and from monitoring incident tickets from the ranked list of relevant monitoring incident tickets;
correlating the tokenized text from the outage ticket and from the monitoring incident tickets from the ranked list of relevant monitoring incident tickets with text or metadata of a set of relevant mitigation teams to generate a list of corresponding relevant mitigation teams for the outage ticket;
ranking the corresponding relevant mitigation teams in the list of corresponding relevant mitigation teams based on correlation strengths; and
providing the outage ticket and one or more monitoring incident tickets on the ranked list of relevant monitoring incident tickets to a top-ranked mitigation team from the list of corresponding relevant mitigation teams, wherein the top-ranked mitigation team uses information from the outage ticket and the one or more monitoring incident tickets on the ranked list of relevant monitoring incident tickets to identify and resolve the outage.