You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bryan Beaudreault (Jira)" <ji...@apache.org> on 2023/01/19 04:26:00 UTC
[jira] [Created] (HBASE-27579) CatalogJanitor can cause data loss due to errors during cleanMergeRegion
Bryan Beaudreault created HBASE-27579:
-----------------------------------------
Summary: CatalogJanitor can cause data loss due to errors during cleanMergeRegion
Key: HBASE-27579
URL: https://issues.apache.org/jira/browse/HBASE-27579
Project: HBase
Issue Type: Bug
Reporter: Bryan Beaudreault
In CatalogJanitor.cleanMergeRegion, there is the following check:
{code:java}
HRegionFileSystem regionFs = null;
try {
regionFs = HRegionFileSystem.openRegionFromFileSystem(this.services.getConfiguration(), fs,
tabledir, mergedRegion, true);
} catch (IOException e) {
LOG.warn("Merged region does not exist: " + mergedRegion.getEncodedName());
}
if (regionFs == null || !regionFs.hasReferences(htd)) {
.. do the cleanup ..
} {code}
I think the assumption here is that an IOException would only be thrown if a region doesn't exist? We had a very poorly timed NameNode failover, during CatalogJanitor run, after a merge. The NameNode failover caused the openRegionFromFileSystem call to fail, which logged:
{code:java}
WARN org.apache.hadoop.hbase.master.janitor.CatalogJanitor: Merged region does not exist: 32c71224852c5a4b94a3ba271b4fcb15 {code}
This region did in fact exist and had not fully compacted, so there were still some lingering reference files.
The cleanup process moves the parent regions to the archive directory, but the default TTL for those files in the archive directory is only 5 minutes. After that they are cleaned up and the data is now unrecoverable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)