You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pegasus.apache.org by GitBox <gi...@apache.org> on 2022/07/27 07:44:57 UTC
[GitHub] [incubator-pegasus] hycdong opened a new issue, #1081: Feature: enhance cold backup and restore function

hycdong opened a new issue, #1081:
URL: https://github.com/apache/incubator-pegasus/issues/1081

   # Background
   Pegasus currently supports cold backup and restore functions, but both of them have some disadvantages. 
   
   For cold backup, pegasus supports periodic backup through policy. Users can create a policy with backup related parameters such as provider, interval time, and apply this policy to sereval tables. Besides, pegasus also supports onetime backup since release 2.3.0. 
   However, backup function has following disadvantages:
   - Periodic backup can not start accurately by start time
   - When Periodic backup interval time is less than 1 day, periodic backup will be triggered unexpectedly.
   - User defined provider path is not supported for periodic backup.
   - Once backup is started, it can not be canceled. When backup failed, it will continue to retry until succeed, even restart meta server.
   - Current backup will cost heavy I/O during creating checkpoint.
   - The path on provider is hard to find one table's backup.
   - Backup code is not firendly to read and maintain.
   
   For restore, pegasus supports two data_version. Tables created in release 1.x is V0, and tables created in release 2.x is V1. Restore process will create an empty table, then apply the backup checkpoint. There will be a compatible problem that release 2.x table can not apply V0 checkpoint, which will lead to coredump making cluster useless. As a result, restore need to check table data_version to make it robust.
   
   # New backup design
   The enhance version of backup, simplify backup v2, will solve all probelms above, providing a simple backup function.
   
   ## Components
   Meta backup function is consist of three parts:
   - Backup engine - intertact with replica server
   - Periodic backup context - manage table periodic backup policy and backups
     - meta server will have a timer to check whether periodic backup should be triggered
     - for first triggered backup, server will check it by start_time whose format is like "15:00"
     - for not-first backup, server  will compare last backup start time and periodic backup interval
     - periodic backup is not allowed to be modified, but can be deleted and recreated
   - Backup service - manage cluster all tables backup, including onetime backup and periodic backup. Besides, it also expose the rpc interface to admin-cli and shell
     - add table periodic backup policy
     - query periodic backup policy
     - disable/enable periodic backup policy
     - delete periodic backup policy
     - start onetime backup
     - query backup (onetime and periodic)
     - cancel backup (onetime and periodic)
    
   ## Main flow
   ![image](https://user-images.githubusercontent.com/17868458/181185836-6eea1905-eb29-4557-bb1b-421c2b12d3a4.png)
   - when receving start backup, engine will turn its backup status into `checkpointing` and send request to replica servers
   - replica will turn its state into `checkpointing`, and turn to `checkpointed` after generating checkpoint succeed
   - when all partitions status is `checkpointed`, meta will turn status into `uploading`
   - replica will turn its state into `uploading`, and turn to `succeed` after uploading checkpoint succeed, the backup checkpoint directory will be deleted after a while
   - when all partitions status is `succeed`, meta will turn status into `succeed` and consider backup succeed
   - if any errors happended during whole process, backup will be failed
   - if receiving cancel backup, checkpointing or uploading backup will be canceled
   
   ## Backup paths
   ### Path on remote storage (zk)
   ```
   <cluster_root>/backup/<app_id>/once/<timestamp>/<backup_item>
   <cluster_root>/backup/<app_id>/periodic/<policy_context>
                                                  /<timestamp>/<backup_item>
   ```
   
   ### Path on remote backup provider (such as HDFS)
   ```
   <root>/<cluster_name>/<app_name>_<app_id>/<timestamp>/<pidx>/<checkpoint>
                                                                           /meta
                                                                           /backup_info
   ```
   
   # New restore
   Restore v2 won't update design, just add data version check, refactor code and compatible for old backup path on backup provider.
   
   # Pull request merge plan
   - Add a new branch call `backup-restore-dev`, all pull reuqests will be firstly added into this branch, and finally into master branch.
   - Remove all old backup and restore codes firstly because that new code is huge different from the old implementation.
   - This feature is NOT planed in 2.4.0, just next release, will not block releasing process
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org