Big Date & Data Mining
High Performance Index for semi-Structured Data Types in Hybrid Serving/Analytics Processing
In big data hybrid serving/analytics processing (HSAP) scenario (represented by Alicloud’s Hologres interactive analysis engine), complex semi-structured data (represented by JSON, XML, MAP, GIS, vector type) are an important data category. With such data types, user can achieve schema-free data processing, which is quite common in HSAP based real-time data warehouse. For example, some internet companies unify all its tracking data in one format, with JSON as the payload.
Good index is essential to efficiently process (analyze and serve) such data. However, index for such data types is quite different from traditional index, not only different at data type but also at HSAP distributed nature and performance requirement. On data type side, such index needs to not only index sub-item key and value, but also need to index path etc. On HSAP nature side, how to balance performance (ingest and query) and index efficiency is a problem.
Hologres, Alicloud’s big data real time data warehouse, now supports efficient storing and processing for structured data. We want to enhance it with strong semi-structured data index support to better support serving and analytics scenario, which is a must-have for HSAP scenario.
Design and implement a high-performance index solution in Hologres for semi-structured data types (JSON, XML, MAP, GIS, vector data type) in HSAP scenario. Such solution shall meet:
- Relatively small inflation ratio (<2X inflation).
- Minimal impact on real-time ingest performance (<10% performance lost at millions record per second ingest speed).
- Strong expression capability. Support sub-item key/value/path/similarity retrieval indexing. Allow user to flexibly set one or a set of such indexing.
- High performance for index-based query. E.g. existence lookup for a sub-item key shall have similar performance as existence lookup for a normal text column.
This work shall be landed in Alicloud’s Hologres product. And at least one CCF A class paper shall be submitted for this work.
Related Research Topics
Index for semi-structured data is a relative complex field for traditional database. For example, SQL Server provided some limited support via computed column. Oracle has relative comprehend support for these data types. However, there are little research on this topic in big data field. And there are even less researches in HSAP real-time data warehouse, which imposes more restricted requirements on such index.
Suggested Collaboration Method
AIR (Alibaba Innovative Research), one-year collaboration project.