You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Rushabh Shah (Jira)" <ji...@apache.org> on 2021/11/02 15:28:00 UTC

[jira] [Comment Edited] (HBASE-26408) Aborting to preserve WAL as source of truth can abort in recoverable situations

    [ https://issues.apache.org/jira/browse/HBASE-26408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437423#comment-17437423 ] 

Rushabh Shah edited comment on HBASE-26408 at 11/2/21, 3:27 PM:
----------------------------------------------------------------

>  I wonder if we should wrap thepostWALWrite in a try/catch so that we can have more control over how exceptions from that call get wrapped.
 I was thinking on similar lines. We can still have DamagedWALException thrown from append method but we should have different checked exceptions in the underlying cause and then we can take better decisions on which ones to abort/not abort. Today we will have to do string comparisons and that would be very brittle.


was (Author: shahrs87):
>  I wonder if we should wrap thepostWALWrite in a try/catch so that we can have more control over how exceptions from that call get wrapped.
I was thinking on similar lines. We can still have DamagedWALException thrown from append method but we should have different checked exceptions in the underlying cause and then we can take better decisions on which ones to abort/not abort. Today we will have to do string comparisons and that would be very brittle.
[|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13409189]

> Aborting to preserve WAL as source of truth can abort in recoverable situations
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-26408
>                 URL: https://issues.apache.org/jira/browse/HBASE-26408
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> HBASE-26195 added an important feature to avoid data corruption by preserving the WAL as a source of truth when WAL sync fails. See that issue for background.
> That issue's primary driver was a TimeoutIOException, but the solution was to catch and abort on Throwable. The idea here was that we can't anticipate all possible failures, so we should err on the side of data correctness. As pointed out by [~rushabh.shah] in his comments, this solution has the potential to lose HBase capacity quickly in "not very grave" situations. It would be good to add an escape hatch for those explicit known cases, of which I recently encountered:
> I recently rolled this out to some of our test clusters, most of which are small. Afterward, doing a rolling restart of DataNodes caused the following IOException: "Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try..."
> If you're familiar with HDFS pipeline recovery, this error will be familiar. Basically the restarted DataNodes caused pipeline failures, those datanodes were added to an internal exclude list that never gets cleared, and eventually there were no more nodes to choose from resulting in an error.
> This error is pretty explicit, and at this point the DFSOutputStream for the WAL is dead. I think this error is a reasonable one to simply bubble up and not abort the RegionServer on, instead just failing and rolling back the writes.
> What do people think about starting an allowlist of known good error messages for which we do not trigger an abort of the RS? Something like this:
> {{} catch (Throwable t) {}}
>  {{  // WAL sync failed. Aborting to avoid a mismatch between the memstore, WAL,}}
>  {{  // and any replicated clusters.}}
>  {{  if (!walSyncSuccess && !allowedException(t)) {}}
>  {{  rsServices.abort("WAL sync failed, aborting to preserve WAL as source of truth", t);}}
>  \{{ }}}
> {{... snip ..}}
> {{private boolean allowedException(Throwable t) {}}{\{  }}
> {{  return t.getMessage().startsWith("Failed to replace a bad datanode");}}
> {{}}}
> We could of course make configurable if people like, or just add to it over time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)