You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Peter Somogyi (JIRA)" <ji...@apache.org> on 2017/11/15 16:14:00 UTC

[jira] [Resolved] (HBASE-13850) Check for dead server on CallTimeoutException

     [ https://issues.apache.org/jira/browse/HBASE-13850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Somogyi resolved HBASE-13850.
-----------------------------------
    Resolution: Duplicate

> Check for dead server on CallTimeoutException
> ---------------------------------------------
>
>                 Key: HBASE-13850
>                 URL: https://issues.apache.org/jira/browse/HBASE-13850
>             Project: HBase
>          Issue Type: Improvement
>          Components: Client, MTTR
>    Affects Versions: 2.0.0, 1.2.0
>            Reporter: Matteo Bertozzi
>            Assignee: huaxiang sun
>            Priority: Minor
>         Attachments: HBASE-13850-v0.patch, TestGetPerf.java
>
>
> WARN this may be a misconf, so let me know if there is a timeout param to set.
> {noformat}
> hbase-site.xml
> zookeeper.session.timeout 10000
> hbase.regionserver.storefile.refresh.period 10000
> hbase.client.operation.timeout 5000
> hbase.client.meta.operation.timeout 5000
> hbase.client.scanner.timeout.period 10000
> hbase.regionserver.lease.period 10000
> {noformat}
> I have a test that does a kill STOP on a RS and tries to query it.
> From the conf the zk lease is 10sec, and the master is correctly doing the reassign after 10sec and meta is updated.
> the client keep trying to query the RS for a specific row until it get a response. The table.get(row) in the loop throws a CallTimeoutException every 5sec (which is the configured settings). but instead of succeed after 2/3 retries (> 10sec where the master reassign) it keeps retrying up to 60sec (I don't know what that 60sec is, maybe a conf param that I'm not able to find)
> one simple fix in the code is handling the CallTimeoutException in RegionServerCallable and clear the meta cache for that RS that is not responding. (but maybe there is already a conf to set to reduce that 60sec period)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)