Next Generation Intelligent Data Processing Platforms
Early distributed relational databases were mainly used for transaction processing (TP), typically using row storage, which is more friendly for transactions, row locks, etc. Then the database needed to be synchronized to an analysis database for analysis processing (AP), which is typically read-intensive and tends to be column-oriented. The drawbacks of this system are obvious: 1. It requires the operation and maintenance of two different database products, resulting in high operation and maintenance costs. 2. It requires the introduction of cumbersome and expensive ETL operations, but the delay in synchronization itself makes it impossible to analyze the latest data. Therefore, one important direction of distributed databases today is the ability to process both online transactions and online analysis, known as Hybrid Transactional and Analytical Processing (HTAP). In HTAP databases, a hybrid storage scheme using both row and column storage is generally adopted to ensure both transactional and analytical friendliness. Strongly consistent data synchronization is achieved between row storage and column storage based on technologies such as Paxios. In the context of hybrid storage, higher requirements are placed on existing query optimization. We hope to establish a suitable model based on the background of hybrid storage to help improve our query optimization capabilities, which can satisfy both high-concurrency TP business requirements and our batch processing needs for AP-class businesses.
Algorithm Prototype: Complete the design of the xxxx algorithm and deliver a set of xxxxx source code.
Paper: Publish 1-2 papers in Ali-approved CCF-A or top-level conferences/journals in the field.
The technical indicators can be referenced as follows:
Performance data comparison of the algorithm, improving the performance from xx% (current level) to xx%;
The accuracy of the cost model is tested in typical business scenarios. Based on the cost model, a suitable execution plan can be generated. The accuracy is improved from xx% (current level) to xx%;
The advancement of the cache architecture: design HTAP mixed business scenarios, and make horizontal comparisons with industry products in terms of performance/stability/architecture.
Related Research Topics
Combining the above background, we hope to provide some breakthrough innovations in the following technical issues:
- In the study of data structures, we aim to efficiently fit the existing row-based and column-based data structures, avoiding the overhead of format conversion in query optimization;
- Design and use appropriate indexes: selecting appropriate indexes can take care of both row-based and column-based queries;
- Design more efficient row-column hybrid operators, which will mainly focus on Agg/Join/Sort operators. For example, in HashJoin, the build side can be based on row storage to build the build table, and the probe side can scan column storage;
- Establish a cost model to select row-based or column-based storage and related indexes for different workloads, and construct a better plan with the ability of row-column hybrid storage;
- Since distributed databases now adopt a storage-computation separation architecture, research on building a more suitable cache acceleration layer that is applicable to both TP and AP scenarios to improve query performance;
Overall, there are many technical points that can be addressed in query optimization under the background of row-column hybrid storage.