You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bryan Beaudreault (Jira)" <ji...@apache.org> on 2023/02/22 22:50:00 UTC
[jira] [Created] (HBASE-27659) Incremental backups should re-use splits from last full backup

Bryan Beaudreault created HBASE-27659:
-----------------------------------------

             Summary: Incremental backups should re-use splits from last full backup
                 Key: HBASE-27659
                 URL: https://issues.apache.org/jira/browse/HBASE-27659
             Project: HBase
          Issue Type: Improvement
            Reporter: Bryan Beaudreault


All incremental backups require a previous full backup. Full backups use snapshots + ExportSnapshot, which includes exporting the SnapshotManifest. The SnapshotManifest includes all of the regions in the table during the snapshot.

Incremental backups use WALPlayer to turn new HLogs since last backup into HFiles. This uses HFileOutputFormat2, which writes HFiles along the split boundaries of the tables at the time that it runs.

Active clusters may have regions split and merge over time, so the split boundaries of incremental backup hfiles may not align to the original full backup. This means we need to use MapReduceHFileSplitterJob during restore in order to read all of the hfiles for all of the incremental backups and re-split them based on the restored table.
 * So let's say a cluster with regions A, B, C does a full backup. Data in that backup will be segmented into those 3 regions.
 * Over time the cluster splits and merges and we end up with totally different regions D, E, F. An incremental backup occurs, and the data will be segmented into those 3 regions.Later the cluster splits those 3 regions so we end up with new regions G, H, I, J, K, L.  Then next incremental backup goes with that

When we go to restore this cluster, it'll pull the full backup and the 2 incrementals. The full backup will get restored first, so the new table will have regions A, B, C.  Then all of the hfiles from the incrementals will be combined together and run through MapReduceHFileSplitterJob. This will cause all of those data files to get re-partitioned based on the A, B, C regions of the newly restored table (based on the full backup).

This splitting process is expensive on a large cluster. We could skip it entirely if incremental backups used the RegionInfos from the original full backup SnapshotManifest as the splits for WALPlayer. Therefore, all incremental backups will use the same splits as the original full backup. The resulting hfiles could be directly bulkloaded without any split process, reducing cost and time of restore.

One other benefit is that one could use the combination of a full backup + all incremental backups as an input to their own mapreduce job. This impossible now because all of the backups will have HFiles with different start/end keys which don't align to a common set of splits for combining into ClientSideRegionScanner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)