US 12,216,552 B2
	Multi-phase cloud service node error prediction based on minimization function with cost ratio and false positive detection
Qingwei Lin, Beijing (CN); Kaixin Sui, Beijing (CN); and Yong Xu, Beijing (CN)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Appl. No. 17/056,744
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
PCT Filed Jun. 29, 2018, PCT No. PCT/CN2018/093775 § 371(c)(1), (2) Date Nov. 18, 2020, PCT Pub. No. WO2020/000405, PCT Pub. Date Jan. 2, 2020.
Prior Publication US 2021/0208983 A1, Jul. 8, 2021
Int. Cl. G06F 11/14 (2006.01); G06F 9/455 (2018.01); G06F 9/48 (2006.01); G06F 9/50 (2006.01); G06N 5/01 (2023.01)

CPC G06F 11/1484 (2013.01) [G06F 9/45558 (2013.01); G06F 9/4856 (2013.01); G06F 9/5072 (2013.01); G06F 11/142 (2013.01); G06N 5/01 (2023.01); G06F 2009/45562 (2013.01); G06F 2009/4557 (2013.01); G06F 2009/45591 (2013.01)]

20 Claims

1. A system for predicting computing node failure in a cloud computing platform, the system comprising:

at least one processor; and

memory including instructions that, when executed by the at least one processor, cause the at least one processor to perform operations to:

obtain a set of spatial metrics and a set of temporal metrics for computing node devices in the cloud computing platform, the set of spatial metrics comprising spatial signals from hardware and software components shared by the computing node devices, and the set of temporal metrics comprising temporal signals from hardware and software components for each computing node device of the computing node devices;

evaluate the computing node devices using a spatial machine learning model and the set of spatial metrics and using a temporal machine learning model and the set of temporal metrics to create a spatial output and a temporal output for each computing node device of the computing node devices;

determine one or more potentially faulty computing node devices based on an evaluation of the spatial output and the temporal output using a ranking model, wherein the one or more potentially faulty computing node devices is a subset of the computing node devices;

identify one or more migration source computing no de devices from the one or more potentially faulty computing node devices, wherein a number of computing node devices included in the one or more migration source computing node devices are determined using a threshold calculated by applying a minimization function to a cost ratio and a predicted number of false positive detections included in the one or more potentially faulty computing node devices and a predicted number of false negative detections excluded from the one or more potentially faulty computing node devices, the cost ratio representing a ratio between a historical cost of false positive detection and a historical cost of false negative detection;

identify one or more migration target computing node devices from one or more potentially healthy computing node devices; and

migrate a virtual machine (VM) from a faulty computing node device of the one or more migration source computing node devices to a healthy computing node device of the one or more migration target computing node devices.