You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2013/07/08 01:19:48 UTC

[jira] [Commented] (HBASE-8888) Tweak retry settings some more, *some more*

    [ https://issues.apache.org/jira/browse/HBASE-8888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701695#comment-13701695 ] 

stack commented on HBASE-8888:
------------------------------

Playing, it gets pretty silly pretty fast.  Why have long intervals between retries?  If a region is offline, it could come online at any time so a wait of 30 seconds or 60 seconds seems too much.  We want to be reactive.  So, ten seconds seems the outer bound I'm thinking.  So, we can have HConstants#RETRY_BACKOFF ramp up fast to ten seconds.  But retrying every ten seconds to reach five or ten minutes (MR timeout), that is an awful lot of retries especially if they are being saved up to be dumped out at the end when retries are exhausted (all in the one go... when you get a RetriesExhaustedException).

Then hitting five minutes and not five minutes and 15 seconds or four minutes and 47 seconds is tough to do w/ retries and a retry timer function which has jitter in it; if I was an operator trying to calculate how long our timeout is, the arbitrary timings would freak me out.

Then, on top of this, bounding how long we keep going by a retry count gets messed up if say down in the retry we need to do a socket timeout... could mean that we'd do 30 * socket timeout at an extreme and then there would be our backoff timings on top of this.

So, a bound on how long we wait seems necessary.
                
> Tweak retry settings some more, *some more*
> -------------------------------------------
>
>                 Key: HBASE-8888
>                 URL: https://issues.apache.org/jira/browse/HBASE-8888
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>             Fix For: 0.95.2
>
>
> Follow on from hbase-8776.
> Need to fix retries and timeouts.  We cut them down so much hbase-it tests fail.
> From https://issues.apache.org/jira/browse/HBASE-8776?focusedCommentId=13698762&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13698762 @nkeywal says:
> {code}
> I would like to change
> hbase.client.retries.number -> 30 (instead of 14 or 20 today)
> hbase.client.pause -> 500 (instead of 100 or 1000 today).
> Context: see HBASE-6295.
> As well, would it make sense to remove all the hbase-site.xml and hbase-defaults.xml to rely only on the defaults in the code. This would trigger another set of issues, as sometimes the defaults are duplicated and different. But these are bugs as well. Imho, this duplication is confusing and it leads to unreliable behavior as we don't really know what are the setting actually used.
> {code}
> Regards removing hbase-site.xml from everywhere to rely on defaults in code, over in hbase-8776 I tried removing them and way too many tests failed.  Looks like it'd be tough removing them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira