You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2018/12/14 00:38:04 UTC

[jira] [Commented] (HBASE-18058) Zookeeper retry sleep time should have an upper limit

    [ https://issues.apache.org/jira/browse/HBASE-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720753#comment-16720753 ] 

Hudson commented on HBASE-18058:
--------------------------------

SUCCESS: Integrated in Jenkins build HBase-1.3-IT #509 (See [https://builds.apache.org/job/HBase-1.3-IT/509/])
HBASE-18058 Zookeeper retry sleep time should have an upper limit (Allan (apurtell: rev ed4f7d1b1b4497caf859da6119c2940fdfaba9a9)
* (edit) hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java
* (edit) hbase-client/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java
* (edit) hbase-common/src/main/resources/hbase-default.xml


> Zookeeper retry sleep time should have an upper limit
> -----------------------------------------------------
>
>                 Key: HBASE-18058
>                 URL: https://issues.apache.org/jira/browse/HBASE-18058
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.4.0, 2.0.0
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>             Fix For: 1.4.0, 1.3.3, 2.0.0
>
>         Attachments: HBASE-18058-branch-1.patch, HBASE-18058-branch-1.v2.patch, HBASE-18058-branch-1.v3.patch, HBASE-18058.patch, HBASE-18058.v2.patch
>
>
> Now, in {{RecoverableZooKeeper}}, the retry backoff sleep time grow exponentially, but it doesn't have any up limit. It directly lead to a long long recovery time after Zookeeper going down for some while and come back.
> A case of damage done by high sleep time:
> If the server hosting zookeeper is disk full, the zookeeper quorum won't really went down but reject all write request. So at HBase side, new zk write request will suffers from exception and retry. But connection remains so the session won't timeout. When disk full situation have been resolved, the zookeeper quorum can work normally again. But the very high sleep time cause some module of RegionServer/HMaster will still sleep for a long time(for example, the balancer) before working.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)