Alibaba Innovative Research (AIR) > System Software and Operation
Node failure prediction in large-scale cloud computing service system


System Software and Operation


Node failure prediction in large-scale cloud computing service system


In recent years, more and more users have migrated their business software systems to cloud computing platforms, such as Elastic Computing Service(ECS) of Alibaba Cloud. Alibaba Elastic Computing Service provides basic computing units like virtual machine(VM) for users to deploy their own business application. The service quality matters because system failures could seriously affect business and user experience, ECS typically contains a large number of computing physical servers, or "nodes". In reality, nodes may fail and the VM run on the node may be unavailable, further more which may affect customer application availability. According to our experience, node failure was one of the top causes of service down time. The ability to predict faulty nodes before node failure actually happens enables cloud service system to perform proactive live migration - to migrate the virtual machines from the faulty node to a healthy node without disconnecting the service, therefore improving service availability. 

We propose this topic to predict the failure-proneness of a node in our cloud computing service system. which apply machine learning techniques to learn the characteristics of historical failure data, build a failure prediction model, and evaluate the model performance on real-world data. 

There are several technical challenges in designing a failure prediction model for a large-scale cloud computing service system.

Due to highly complexity of the cloud service system, node failures could be caused by many different software or hardware issues. like memory fault, cpu fault, motherboard fault, disk failure, kernel crash, VMM failure, application bugs, overheating, etc. Simple rule-based or threshold-based models are not able to locate the problem and achieve good prediction results.

Complex failure-indicating signals. Failures of a single node could be indicated by many signals coming from a variety of software or hardware sources of the node. Examples of the temporal signals are kernel events, hardware sensor data, VMM logs, application logs, performance counters, server resource monitoring data(such as cpu utilization, disk utilization, etc.), etc. They are continuously monitored from a variety of data source with different data structures.

Highly imbalanced data.Node fault data is highly imbalanced as the node ratio between failure and healthy classes is less than 1:2000, The highly imbalanced data poses great challenges to prediction.

Improve interpretability of the prediction algorithms will be helpful for cloud service system take maintenance operation on nodes.

To tackle the challenges on node failure prediction in Elastic Computing Service of Alibaba cloud, we have build up a node failure data set with accurate labels and rich failure-indicating signals data base on Elastic Computing Service. The date set has labeled root cause of a node failure and collected a full stack (including software and hardware) monitoring data and logs data before the node failure.


  • A methodical model for node failure prediction in our large-scale cloud computing service system.
  • A general node failure prediction technique for time series data of multi-dimensional metrics and multi-source data in our large-scale cloud computing service system.

Related Research Topics

  • High interpretability of node failure prediction algorithms in large-scale cloud computing service system.
  • Performance evaluation model for node failure prediction in our production environment.
  • Feature engineering for multi-source data (including feature extraction, fusion, screening, multi-source feature correlation analysis. As there are data from different sources, eg, sensors, logs, monitoring data, etc., with different structures like sequential time series data or unstructured log data and so on.)
  • Prediction algorithm for node failure caused by different subsystem (kernel, virtualization software, memory, cpu, motherboard, etc.) in large-scale cloud computing service system.
  • Hardware fault prediction algorithm (such as memory, cpu, disk, motherboard) in large-scale cloud computing service system.


Suggested Collaboration Method

AIR (Alibaba Innovative Research), one-year collaboration project. 

Scan QR code
关注Ali TechnologyWechat Account