Alibaba Innovative Research (AIR) > Next Generation Intelligent Data Processing Platforms
Query Optimization Architecture for Big Data

Research Themes

Next Generation Intelligent Data Processing Platforms

Background

Traditional query optimizers, despite their long research history, still have shortcomings when combined with big data scenarios. Taking the existing open source optimizers on the market for example - Calcite still stays in a Volcano framework, and ORCA in PostgreSQL does not have much consideration for big data scenarios, and is still a more traditional database optimization engine. MaxCompute SQL Optimizer is built on Calcite and is still a Volcano model. We need to develop our query optimizer framework into a more efficient Cascades model in big data scenarios.

Target

Algorithm prototype: Complete the design and prototype verification of Cascades query optimization framework for big data scenarios

Papers: Publish 1-2 papers of CCF-A category recognized by Alibaba or top conferences and journals in the field

Technical indicators:

• Based on the new query optimization framework, the efficiency of query optimization is increased by 50%

• In some MaxCompute production scenarios, based on the new query optimization framework and cost model, query execution resources are reduced by 20%

Related Research Topics

To build a query optimizer for Cascades model in big data scenarios, we need to consider some unique factors, so that for the query optimizer, how to efficiently find a more optimal execution plan provides a lot of new research opportunities:

• It is necessary to consider the characteristics of the DAG execution engine in the big data scenario, consider the partition, and integrate these factors into the cost model

• It is necessary to consider the dynamics of big data scenarios, including UDF, semi-structured, partitioned, clustering and other characteristics, so that we need to have better robustness in query optimization and good interaction with the execution engine (Refer to the related paper of progressive optimizer).

• Because the use of Volcano for optimization is an NP problem, it will make the optimization time explode, and it is not suitable for large queries in big data scenarios. How to use the bottom-up of interesting properties and the top-down with direction guidance in the optimization process, How to build a cascades optimization model, and how to effectively improve the efficiency of the optimization process, to greatly reduce the optimization time are the keys to the optimizer research framework.

Scan QR code
关注Ali TechnologyWechat Account