MaxCompute is the major big data platform that executes most computing tasks in Alibaba Group. MaxCompute2.0, which rolled out the New SQL language, unstructed data handling, cost based optimization and DAG based task scheduling and major performance improvements, is a major breakthrough of big data platform from both technical and business perspective.
Currently, MaxCompute processes millions of tasks, hundreds of PBs data, already poses significant challenges for MaxCompute. The business demand is continuously growing while the computing resources are limited. Supporting different computing models with one big data platform, such as batch jobs with and without SLA constraint, interactive jobs, ad-hoc jobs, with global optimized scheduling.
How to better use resources on large scale platfrom needs to:
- Have fine grained understand resource consumption, data distribution and network communication, job scheduling.
- Be from local optimization to globle resource optimization
- More accurate and reliable scheduling of task submission, execution and reliable
- leverage CBO optimizer，which focusing on local optimization at single query(job) level, which has no cross-job/global optimization capabilities
- HBO(Historical Based Optimization), which has the capability to optimize job resource usage based on statistics of repetitive historical jobs. However, the algorithm is rule based, which has limited coverage and sensitive to data change.
With the cooperation, we expected to be able to have more accurate and smarter job resource predition, resource scheduling, which can significantly improve the throughput of the big data platform, and also provide insights for next generation of big data platform.
MaxCompute has full job statistic data of all jobs, which provides valuable resources for data mining and even incooperating AI technologies for research of improving job prediction, dynamic/global resource optimization, etc. With the mining/learning of historical job statistics data, which provides the possiblity of more optimization opportunites besides the output of CBO and currrent HBO, we call it iHBO.
- Based on historical job statistic data,leveraging data mining/machine learning algorithms to better estimate datadistribution and provides useful feedbacks to CBO, generated better execution plans.
- Based on current MaxCompute HBO，leveraging historical job statistic data, generating better job sources management and scheduling schemes for resource optimization.
Expected outputs are:
- Provides useful insights for design and implementation of next generation big data platform.
- Improve throughput 20% of current big data platform.
- Improving stability and efficiency of key jobs, reducing out-of SLA rate 3 to 4 times.