Data Deduplication for Database Backup Service and Snapshots Based on Cloud

Background

Today is the age of cloud computing, supplying computing resources, network resources, storage resources as basic infrastructures to hundreds of millions of users. How to make users easily enjoy the benefits brought by the cloud is an issue that we need to solve urgently at present.

Many companies choose to back up data on the cloud for remote disaster recovery, enjoy low-cost, highly reliable storage on the cloud, and quickly recover databases based on Cloud Relational Database Service.

Implementing an efficient data deduplication storage engine, which supports file reading, as well as point-in-time snapshots, is quite valuable in database backup scenarios. Due to the periodic database backup, the deduplication storage engine will greatly reduce the backup cost. Providing snapshot capability is very helpful for users to quickly restore database, and support the Change Data Capture backup. Supporting file reads enables users to use data lake query on data backup sets.

The database backup service is based on the cloud, it will use the existing distributed object storage service OSS as the underlying storage, and the data deduplication engine runs on the visual machine ECS to read and write data on OSS to implement asynchronous deduplication. How to optimize the chunk size to archive high deduplication ratio, and how to implement data fingerprinting to improve high deduplication efficiency, are very challenging problems. The industry has a fixed-length and variable-length chunk size solution, or a combination of the two. We need to choose the best chunk size automatically base on different data characteristics.

We hope that the efficient data deduplication storage engine will enable all users backup at low cost and recover quickly. 

Target

  • Design an efficient data deduplication engine based on distributed object storage OSS on the cloud, supporting file reading and point-in-time snapshots.
  • In Database backup scenario, saving more than 60% of the backup storage space, and the recovery speed drops by no more than 10%.
  • Academic papers and patents.

Related Research Topics

Two-side data deduplication mechanism for non-center cloud storage systems.

  • NF-Dedupe: A novel no-fingerprint deduplication scheme for flash-based SSDs. 
  • AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment.

Scan QR code
关注Ali TechnologyWechat Account