Alibaba Innovative Research (AIR) > Research on Next-generation Virtualization Technologies
Refined Resource Portrait of Cloud Native Applications (云原生应用精细化资源特征画像)

Research Themes

Research on Next-generation Virtualization Technologies

Background

Large-scale data centers are the key support for today's enterprise-level Internet applications and cloud computing systems. However, the current utilization rate of server resources in data centers is always low (only 10% to 20%), brings a large amount of waste of computing resources and becoming a key problem restricting enterprises to improve the computing efficiency.


To improve resource utilization and reduce machine costs, co-location technology is widely adopted in cloud data centers, which can co-locate latency-critical online services and best-effort offline tasks in the same set of computing resources. The resource demands of online services and offline tasks are usually complementary. For example, the online services are usually over-provided to ensure service level agreement (SLA), so reclaiming resources from online services and offering reclaimed resources to offline tasks can improve resource utilization in cloud data centers.


In current cloud data centers, services and tasks are usually maps to a set of Linux containers, which are managed by a container manager such as Kubernetes. Comparing with workload running in virtual machine, container-based workloads are more flexible and elastic. Since the low cost of container scaling, it is available to adopt reactive solutions to cope with fluctutaing workloads, such as horizontal autoscaling (add or remove replicas in response to end-user traffic fluctuation) and vertical autoscaling (tune the amount of resources available to each replica).


However, the reactive scaling can have delays in responding to the sudden needs of the applications and may incur significant overhead due to improper resource decisions. For example, when an application has a sudden spike in resource demand but it reside on a cluster without sufficient resource, it is necessary to migrate the application to another cluster with sufficient resource. The migration will incur delay on handling workload spike, and significant overhead on resource usage.


Predicating the resource demands of online services is one of the critical issues in cloud-native cost efficiency technology, which determines how much resources can be reclaimed to offline tasks without violating the SLA of online services. Workload resource demand prediction has been widely investigated over several years. However, most of existing prediction models are regression based, which require the workload exhibit seasonality and/or trend. Various services and tasks are issued in a cloud data center, while some of these services may have predictable workload patterns, most other tasks (such as data analysis and mining jobs) usually don't have seasonality and/or predictable trends. Thus, conventional prediction model can't accurately predict unexpected workload patterns, such as resource demand sudden spike.


大规模数据中心是当今企业级互联网应用和云计算系统的关键支撑。然而,目前数据中心的服务器资源资源利用率却始终处于较低状态(仅为10%~20%),导致了大量的计算资源浪费,成为制约各大企业提升计算效能的关键问题。


为了提高资源利用率和降低机器成本,云数据中心广泛采用混部技术,将高优先级的实时服务(时延敏感,称为在线服务)和低优先级的批处理任务(对时延不敏感,资源消耗高,称为离线任务)混合部署在同一组计算资源中。在线服务和离线任务的资源需求通常是互补的。例如,为了保障在线服务的服务等级协议(service level agreement, 简称 SLA),数据中心往往会为在线服务预留大量的服务器资源以保证其服务质量。因此,提升云数据中心整体资源利用率的理想方法是在保障在线作业SLA的前提下,将在线服务过度预留的空闲资源分配给离线作业。


与在虚拟机中运行的应用相比,容器化应用更加富有弹性。由于容器较低的弹性伸缩成本,可采用水平弹性伸缩(根据流量波动添加或删除容器副本)和垂直弹性伸缩(调整每个容器副本的可用资源量)来应对工作负载的波动。弹性资源伸缩往往基于某些可反映工作负载压力的指标(如 QPS)阈值进行配置,会存在一定的时延,导致短时间内应用的整体负载突增,响应时间变慢。除此之外,当一个应用程序的资源需求突增,但它却驻留在一个资源不足的集群上时,就需要将应用程序迁移到另一个资源充足的集群中,这也将导致相应时间变慢以及额外的资源开销。


由于云数据中心计算任务的多样性和负载的频繁波动,预测在线服务的资源需求是云原生成本效率技术的关键问题之一,它决定了在不违反在线服务 SLA 的情况下可以将多少资源回收并分配给离线任务。目前,针对工作负载预测问题,已有学者提出了一些预测方法和模型。然而,这些预测模型往往都是基于统计回归的,只适用于具有周期性特征的工作负载。在云数据中心,只有部分在线服务负载具有可预测的周期性特征,其余离线任务(例如数据分析和数据挖掘作业)负载通常不具有周期性。



Target

  • An efficiency workload resource (include CPU, memory, disk, network bandwidth and etc.) demand predication model for various workload patterns, which should keep the balance between prediction accuracy and computing cost.

一种针对各类工作负载的资源(包括 CPU、内存、磁盘、网络带宽等)需求预测模型,并综合权衡模型的预测精度与构建模型所带来开销


  • A systematic approach to detect shared resource interference between co-located workloads and generate schedules that avoid problematic co-locations.

一种检测混部任务间共享资源干扰的系统性方法,并以此为集群层面的混部作业调度提供指导


  • Publication of 1-2 papers in CCF-A category or top conferences and journals in the field recognised by Alibaba.

发表阿里巴巴认可的 CCF-A 类或者领域内顶级会议、期刊论文1-2篇

Related Research Topics

  • Container-based workload co-location technology

针对容器化应用的混部技术


  • Workload resource demand prediction method in cloud computing

云计算场景下的资源需求预测方法


  • Time series based prediction model

基于时间序列的预测模型


  • Interference-aware cluster management technology

干扰感知的集群管理技术

Scan QR code
关注Ali TechnologyWechat Account