You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "nkeywal (JIRA)" <ji...@apache.org> on 2012/11/21 11:35:58 UTC
[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501838#comment-13501838 ] 

nkeywal commented on HBASE-5843:
--------------------------------

New scenario on datanode issue during a WAL write:

Scenario: With Replication factor 2, Start 2 DN & 1 RS, do a first put. Start a new DN, unplug the second one. Do another put, measure the time of this second put.

HBase trunk / HDFS 1.1: ~5 minutes
HBase trunk / HDFS 2 branch: ~40s seconds
HBase trunk / HDFS 2.0.2-alpha-rc3: ~40 seconds


The time is HDFS 1.1 is spent in:
~66 seconds: wait for connection timeout (SocketTimeoutException: 66000 millis while waiting for the channel to be ready for read).
then, we have two imbricated retries loops:
- 6 retries: Failed recovery attemp #0 from primary datanode x.y.z.w:11011 -> NoRouteToHostException
- 10 sub-retries: Retrying connect to server: deadbox/x.y.z.w:11021. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

There are more or less 4 seconds between two sub-retries, so the total time it around:
66 + 6 * (~4 * 10) = ~300 seconds. That's our 5 minutes.

If we change HDFS code to have "RetryPolicies.TRY_ONCE_THEN_FAIL" vs. the default "RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)", the put succeeds in ~80 seconds.

Conclusion:
- time with HDFS 2.x is in line with what we have for other scenarios (~40s), so it's acceptable today.
- time with HDFS 1.x is much less satisfying (~5 minutes!), could be easily decreased to 80s with an HDFS modification.

Some points to think about:
- Maybe we could decrease the timeout for WAL: we're usually writing much less data than for a memstore flush, so having more aggressive settings for the WAL makes sense. There is a (bad) side effect: we may have more false positive, and this could decrease the performances, and it will increase the workload when the cluster is globally instable. So on the long term it makes sense, but may be today is to early.
- While the Namenode will consider the datanode as stale after 30s, we still continue trying. Again, it makes sense to lower the global workload, but it's a little bit boring... There could be optimizations if the datanode state was shared to DFSClients.
- There are some cases that could be handled faster: ConnectionRefused means the box is there but the port is not open: no need to retry here. AndNoRouteToHostException could be considered as well as critical enough to stop trying. Again as well, this is trading global workload vs. reactivity.

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira