Next Generation Intelligent Data Processing Platforms
Modern distributed clusters are populated with a wide spectrum of data-processing workloads, with types and sizes of input data covering various combinations, and data-transformations exhibiting distinct characteristics from job to job. On production systems, applicability of pre-compiled execution plans generated before job submission may be challenged with practical uncertainties that underly real-world data-processing. Particularly, for complex computations, the characteristics of intermediate data after multiple data-transformations, may be way too complicated, even for a well-designed query optimizer, to reason about. In addition, the varying characteristics of shuffle-data, together with the common existence of user defined functions (UDF), would introduce more uncertainties for query executions, which may further undermine the optimality of a pre-compiled query execution plan. This proposal aims to seek answers for such practical challenges inherent in practical large-scale data processing.
- A Dynamic Optimization framework capable of leveraging the interactions between query optimizer and the distributed execution engine, to ensure (dynamic) selection of optimal execution plan(s) throughout the lifetime of a distributed job-execution. In addition to an execution engine capable of full-fledged dynamic graph reconfiguration during runtime, we seek to enable a cost-based optimizer to perform re-optimization and adjust, on the fly, query plans, based on real-time statistics collected from partially-executed subgraphs produced in real-time.
- Thorough understanding on how the theoretical optimality of a new execution plan, should be reconciled with the plan-delta produced during the query runtime.
- In a cost-based optimization framework, how to integrate the (practical) cost of distributed execution into existing cost model. Particularly, when the output of an already-executed subgraph shall be discarded, in exchange for a (overall) more cost-efficient new plan.
- How data statistics shall be collected throughout the lifecycle of a distributed execution, to support a wide range of decision makings that may involve plan-adjustments. This includes determination of the cost for generating, collecting, transporting, and aggregating all the statistics.
- The capabilities the underlying execution framework shall provide, to support efficient plan adjustments, including to reuse intermediate data to the maximal possible extent, and to abstract and extract equivalent execution subgraphs among different plans.
Related Research Topics
- Technical solutions built atop Alibaba MaxCompute platform that can handle dynamic adjustments of execution plans enabled by real-time interactions between query optimizer and execution framework, to facilitate optimal execution on a wide range of distributed workloads.
- Through the establishment of a general framework that allow the interactions between different components, enhance capabilities of the system to adapt to the uncertainties of complex variations in real-world executions. For cases that are particularly challenging when plans are pre-compiled, such as existence of UDF and/or large fluctuations of data characteristics after multiple transformations, expect significant improvement on the workload performance.
- General applicability of the dynamic optimization, with no noticeable performance regression for plain cases where plan re-adjustment is unnecessary.
- 1-2 publications on top-tier system/database conferences, including but not limited to, SIDMOD, VLDB, OSDI, ICDE etc.