Research on Next-generation Virtualization Technologies
Large-scale data centers are the key support for today's enterprise-level Internet applications and cloud computing systems. However, the current utilization rate of server resources in data centers is always low (only 10% to 20%), brings a large amount of waste of computing resources and becoming a key problem restricting enterprises to improve the computing efficiency.
To improve resource utilization and reduce machine costs, co-location technology is widely adopted in cloud data centers, which can co-locate latency-critical online services and best-effort offline tasks in the same set of computing resources. The resource demands of online services and offline tasks are usually complementary. For example, the online services are usually over-provided to ensure service level agreement (SLA), so reclaiming resources from online services and offering reclaimed resources to offline tasks can improve resource utilization in cloud data centers.
In current cloud data centers, services and tasks are usually maps to a set of Linux containers, which are managed by a container manager such as Kubernetes. Comparing with workload running in virtual machine, container-based workloads are more flexible and elastic. Since the low cost of container scaling, it is available to adopt reactive solutions to cope with fluctutaing workloads, such as horizontal autoscaling (add or remove replicas in response to end-user traffic fluctuation) and vertical autoscaling (tune the amount of resources available to each replica).
However, the reactive scaling can have delays in responding to the sudden needs of the applications and may incur significant overhead due to improper resource decisions. For example, when an application has a sudden spike in resource demand but it reside on a cluster without sufficient resource, it is necessary to migrate the application to another cluster with sufficient resource. The migration will incur delay on handling workload spike, and significant overhead on resource usage.
Predicating the resource demands of online services is one of the critical issues in cloud-native cost efficiency technology, which determines how much resources can be reclaimed to offline tasks without violating the SLA of online services. Workload resource demand prediction has been widely investigated over several years. However, most of existing prediction models are regression based, which require the workload exhibit seasonality and/or trend. Various services and tasks are issued in a cloud data center, while some of these services may have predictable workload patterns, most other tasks (such as data analysis and mining jobs) usually don't have seasonality and/or predictable trends. Thus, conventional prediction model can't accurately predict unexpected workload patterns, such as resource demand sudden spike.
为了提高资源利用率和降低机器成本，云数据中心广泛采用混部技术，将高优先级的实时服务（时延敏感，称为在线服务）和低优先级的批处理任务（对时延不敏感，资源消耗高，称为离线任务）混合部署在同一组计算资源中。在线服务和离线任务的资源需求通常是互补的。例如，为了保障在线服务的服务等级协议（service level agreement, 简称 SLA），数据中心往往会为在线服务预留大量的服务器资源以保证其服务质量。因此,提升云数据中心整体资源利用率的理想方法是在保障在线作业SLA的前提下，将在线服务过度预留的空闲资源分配给离线作业。
由于云数据中心计算任务的多样性和负载的频繁波动，预测在线服务的资源需求是云原生成本效率技术的关键问题之一，它决定了在不违反在线服务 SLA 的情况下可以将多少资源回收并分配给离线任务。目前，针对工作负载预测问题，已有学者提出了一些预测方法和模型。然而，这些预测模型往往都是基于统计回归的，只适用于具有周期性特征的工作负载。在云数据中心，只有部分在线服务负载具有可预测的周期性特征，其余离线任务（例如数据分析和数据挖掘作业）负载通常不具有周期性。
- An efficiency workload resource (include CPU, memory, disk, network bandwidth and etc.) demand predication model for various workload patterns, which should keep the balance between prediction accuracy and computing cost.
- A systematic approach to detect shared resource interference between co-located workloads and generate schedules that avoid problematic co-locations.
- Publication of 1-2 papers in CCF-A category or top conferences and journals in the field recognised by Alibaba.
发表阿里巴巴认可的 CCF-A 类或者领域内顶级会议、期刊论文1-2篇
Related Research Topics
- Container-based workload co-location technology
- Workload resource demand prediction method in cloud computing
- Time series based prediction model
- Interference-aware cluster management technology