Research on AIOps for cloud platforms based on service self-operating and maintenance system

Research Themes



Alibaba Cloud's platform basic services are deployed in a clustered manner, ranging from small clusters of a dozen servers to large ones with tens of thousands of servers, requiring large-scale, automated deployment and operation capabilities. Currently, tianji is responsible for the operation and maintenance of millions of servers in the public cloud, and has deployed hundreds of environments in the private cloud, in the conditions of lacking sufficient manpower and knowledge reserves for operation and maintenance. From the perspective of operation and maintenance scenarios, in terms of proactive operation and maintenance, due to the wide range and large scope of involvement, security and efficiency requirements are imposed on operation and maintenance, and continuous improvement is needed in the areas of gray release and status feedback. In terms of passive operation and maintenance, there are significant differences in the deployed environments and servers, and software, disk, and machine failures that occur during daily operation cannot be avoided. It is necessary to be able to quickly detect and timely remediate them, and in the scenario of millions of servers, even if it is only one in ten thousand per day, it is manpower that cannot cope with and respond in a timely manner.

We hope to help solve the problems of cloud platform operation and maintenance by optimizing the hardware and software linkage protocols and intelligent algorithms. We aim to use algorithms such as trend forecasting, anomaly detection, and correlation analysis to analyze monitoring and operation and maintenance data, and form decision-making plans to achieve automatic or semi-automatic operation and maintenance experience, ultimately achieving efficient and low-cost realization of the cloud platform's security production goals. Specifically, we hope to enhance the following business directions:

1.     Integrated regulatory control: This operation and maintenance platform is based on a state machine to represent the running state of software, servers, and other components. It uses monitoring data to influence the state machine and uses a self-developed decider mechanism to link software and servers. We hope to have more communication with the academic community to find more theoretical methods to support increasingly complex states, monitoring, and operation and maintenance actions between multiple levels.

2.     Fault management scenario: Intelligence in the four links of fault prediction, fault discovery, fault diagnosis, and fault fast recovery. Early warning of faults can avoid affecting business operations, including typical problems such as disk failure prediction, memory OOM warning, disk capacity prediction, and machine crash prediction. Fast fault discovery directly affects the MTTR time of the cloud platform. In this link, in addition to the engineering side's constant threshold, it also includes algorithm problems such as business intelligence baseline prediction, application gold indicators detect, and also requires algorithms to solve the configuration cost problem of massive indicators. Accurate fault diagnosis can narrow the range of recovery, and fault localization algorithms are needed to provide solutions under various data collection capabilities. Fast fault recovery requires the construction of a fault fast recovery knowledge base and common fault stop-loss contingency plans. Algorithms are also needed to provide the matching relationship between the current fault and the fast recovery plan and link the operation and maintenance platform for execution.

3.     Change management scenario: Typically, 60% to 70% of faults in the industry belong to the change type. Therefore, we need to build the observability of changes, and use algorithms to observe the changes before and after, and the unchanged instances to obtain recommendations on whether the current changes are normal or not. If necessary, according to the gray-scale strategy of the operation and maintenance platform, slow down or even pause changes and introduce operation and maintenance personnel to confirm the effectiveness of the changes.

4.     Capacity management scenario: The cloud also faces capacity management issues for application software and physical servers. For example, the cloud platform needs to clearly define the capacity watermark of cloud product management and the capacity watermark of cloud instances. When the capacity is about to reach the upper and lower limits of the watermark threshold, it can promptly notify operation and maintenance personnel to deal with it or trigger the operation and maintenance platform to automate operations to reduce the platform's load risk. It also ensures the effective use of resources by shrinking capacity and achieving optimal application layout through scheduling. In addition, operation and maintenance administrators also need to consider the current capacity of physical servers. For example, according to the current trend, how long can the current machine scale support business development? When do we need to add machines or expand the entire cloud platform?

5.     Cost optimization scenario: With the development of containerization technology and the improvement of cloud platform scheduling capabilities, the cloud platform has the ability to plan and schedule application instances to other machines without affecting the upper-level business. This makes it possible for algorithms to reduce machine resource costs by solving the instance redistribution problem with the least number of moves, as a typical operations research problem.

In conjunction with the above business problems, we hope to solve the following technical problems:

1.     Optimization of hardware and software linkage protocols: In complex operation and maintenance scenarios such as software release, system configuration changes, hardware maintenance, and switch changes, research and design self-closing linkage protocols to enable multiple operation and maintenance operations on the same machine or machine group to be executed in order to avoid deadlock or abnormal blocking. This will allow monitoring data for nodes, resources, applications, data, and other layers to better function in various operation and maintenance capabilities.

2.     Machine fault prediction: Solving common problems such as disk failure prediction, memory OOM warning, disk capacity prediction, and machine crash prediction, and linking operation and maintenance operations according to different scenarios, such as disk cleaning, application restart, container migration, machine restart, etc., to prevent applications from being affected by underlying machine unavailability, transforming from firefighting to fire prevention.

3.     Large-scale indicator inspection: During faults and daily operation and maintenance, cloud platform applications and cloud instances need to be inspected. Much of the work comes from viewing monitoring indicators, and massive indicators cause a huge threshold configuration cost. It is challenging to meet the real-time computing requirements of conventional anomaly detection algorithms. Solving large-scale indicator inspection is a significant problem.

4.     Capacity watermark assessment: The capacity watermark of cloud platform applications or cloud instances directly affects the stability of the cloud platform. Evaluating the watermark of applications is a relatively complex problem. When the capacity is about to reach the watermark threshold, it automatically triggers the operation and maintenance platform to perform automatic capacity expansion operations, reducing the capacity risk to the cloud platform.


Complete the algorithm design and deliver the source code. Paper: Publish 1-2 papers in CCF-A or top-level conferences or journals recognized by Alibaba. Technical indicators:

1.     The software-hardware linkage protocol scheme can be implemented, and the operation and maintenance actions within and between servers can be executed in an orderly and automatic manner. It covers all current monitoring methods and automatically or semi-automatically converts them into operation and maintenance action instructions, avoiding mutual interference between monitoring data at various layers.

2.     Fault warning can cover at least 3 types of machine, cloud products, and cloud instance fault scenario predictions, with an accuracy rate of >X% and a recall rate of >Y%. The timeliness of fault warning is less than Z ms.

3.     Large-scale index inspection supports fault inspection of more than tens of thousands of indicators at the same time. The inspection timeliness is less than X minutes, and the fault recall rate is > Y %.

4.     Capacity level evaluation can estimate the level of application or cloud product, cloud instance, and the difference between the capacity level and the actual capacity level is < X %. The estimated time for a single application is less than Y ms.

Related Research Topics

  1. Anomaly Detection of Temporal Indicators
  2. Automated Fault Diagnosis System for Microservices
  3. Research on Fault Self-healing of Large-scale Distributed Systems

Scan QR code
关注Ali TechnologyWechat Account