You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Yu Li (JIRA)" <ji...@apache.org> on 2018/03/08 06:49:00 UTC

[jira] [Created] (HBASE-20156) Allow regionserver to live during HDFS failure

Yu Li created HBASE-20156:
-----------------------------

             Summary: Allow regionserver to live during HDFS failure
                 Key: HBASE-20156
                 URL: https://issues.apache.org/jira/browse/HBASE-20156
             Project: HBase
          Issue Type: New Feature
            Reporter: Yu Li


Currently if something is wrong with HDFS, for example NN fencing or get into safe mode, RS will abort itself immediately after detecting it (such as log roll or flush fail). And if we have a large scale cluster with dense writing workload, there will be a huge amount of WAL to split and replay when HDFS is back, and the recovery time might be tens of minutes or even hours (actually we experienced this more than once in production, there're always some surprise like unstable power supply for NN that we never expected...).

Here we propose to add an option to allow RS not aborting during HDFS failure, instead we will throw exceptions to clients indicating we're out of service, while we could get recovered right after HDFS is back.

This will also make it possible to restart HDFS in some extreme case, and allow us to survive if anything wrong happened during HDFS upgrading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)