You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Bryan Beaudreault (Jira)" <ji...@apache.org> on 2023/01/19 13:27:00 UTC

[jira] [Assigned] (HBASE-27579) CatalogJanitor can cause data loss due to errors during cleanMergeRegion

     [ https://issues.apache.org/jira/browse/HBASE-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Beaudreault reassigned HBASE-27579:
-----------------------------------------

    Assignee: Bryan Beaudreault

> CatalogJanitor can cause data loss due to errors during cleanMergeRegion
> ------------------------------------------------------------------------
>
>                 Key: HBASE-27579
>                 URL: https://issues.apache.org/jira/browse/HBASE-27579
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Blocker
>             Fix For: 2.4.16, 2.5.3
>
>
> In CatalogJanitor.cleanMergeRegion, there is the following check:
> {code:java}
> HRegionFileSystem regionFs = null;
> try {
>   regionFs = HRegionFileSystem.openRegionFromFileSystem(this.services.getConfiguration(), fs,
>     tabledir, mergedRegion, true);
> } catch (IOException e) {
>   LOG.warn("Merged region does not exist: " + mergedRegion.getEncodedName());
> }
> if (regionFs == null || !regionFs.hasReferences(htd)) {
>  .. do the cleanup ..
> } {code}
>  
> I think the assumption here is that an IOException would only be thrown if a region doesn't exist? We had a very poorly timed NameNode failover, during CatalogJanitor run, after a merge. The NameNode failover caused the openRegionFromFileSystem call to fail, which logged:
> {code:java}
> WARN org.apache.hadoop.hbase.master.janitor.CatalogJanitor: Merged region does not exist: 32c71224852c5a4b94a3ba271b4fcb15 {code}
> This region did in fact exist and had not fully compacted, so there were still some lingering reference files.
> The cleanup process moves the parent regions to the archive directory, but the default TTL for those files in the archive directory is only 5 minutes. After that they are cleaned up and the data is now unrecoverable.
> This resulted in FileNotFoundExceptions trying to read or open this region. Our only course of action was to move the lingering reference files aside, so the data is unrecoverable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)