You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Yu Li (JIRA)" <ji...@apache.org> on 2018/03/08 07:15:00 UTC

[jira] [Created] (HBASE-20158) Enhance regionserver self health check to avoid stale server

Yu Li created HBASE-20158:
-----------------------------

             Summary: Enhance regionserver self health check to avoid stale server
                 Key: HBASE-20158
                 URL: https://issues.apache.org/jira/browse/HBASE-20158
             Project: HBase
          Issue Type: New Feature
            Reporter: Yu Li
            Assignee: Yu Li


Currently we have many good metrics to monitor our cluster status, such as totalCallTime/processCallTime/queueCallTime etc. But these metrics won't work if server got stale and the client call timed out, for example during RS fullgc or there're some bad disk on HDFS and the read IO got stuck.

We also have a periodic health check chore introduced by HBASE-7351 which allow us to launch some external script periodically to perform some self detection. However this won't work if the server's system resource has ran out, for example no new native thread could be created, no new network connection could be setup, etc. Notice that although no new thread could not be launched, running thread won't be affected so zookeeper session is still alive and RS still regarded as alive, but clients cannot access since no new connection could be setup.

Here we propose a new HealthChecker called DirectHealthChecker. In this new checker we won't launch any outer script, but picking some regions on the RS and send some rpc request to itself, regarding the server as unhealthy if the call failure ratio exceeds some limit, and send the metrics out to our monitoring system. More details please refer to the coming patch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)