You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "nkeywal (Created) (JIRA)" <ji...@apache.org> on 2012/04/20 12:33:43 UTC

[jira] [Created] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Improve HBase MTTR - Mean Time To Recover
-----------------------------------------

                 Key: HBASE-5843
                 URL: https://issues.apache.org/jira/browse/HBASE-5843
             Project: HBase
          Issue Type: Umbrella
    Affects Versions: 0.96.0
            Reporter: nkeywal
            Assignee: nkeywal


A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit

The ideal target is:
- failure impact client applications only by an added delay to execute a query, whatever the failure.
- this delay is always inferior to 1 second.


We're not going to achieve that immediately...

Priority will be given to the most frequent issues.
Short term:
- software crash
- standard administrative tasks as stop/start of a cluster.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "Gregory Chanan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459356#comment-13459356 ] 

Gregory Chanan commented on HBASE-5843:
---------------------------------------

bq. Was reading today that VoltDB does offheap so can have small java heap and just run w/ defaults other than setting -Xmx and -Xms. Could try a few RS per box.. w/ small heap each (RS needs to be put on a CPU diet but could look at that afterward).

Yes, CPU definitely need a diet.  Probably start with eliminating a bunch of threads.

bq. Elsewhere nkeywal puts up a suggested prescription where on crash we assign and then split logs (rather than other way round as we do now). Nicolas suggests that the assigned regions could immediately start taking writes; we'd throw an exception if a read attempt was made until after the split completed and the regions edits had been replayed: HBASE-6752

Right, I think HBASE-6752 is a great idea, but it doesn't address serving reads more quickly.  I'm wondering if there is more we can do to address that.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427182#comment-13427182 ] 

nkeywal commented on HBASE-5843:
--------------------------------

Same tests as before, with the datanodes.

1) Clean stop of one RS; wait for all regions to become online again:
Pseudo distributed without datanode:
0.92: ~800 seconds
0.96: ~13 seconds

With two datanodes, on hadoop 1.0
0.92: ~460 seconds
0.96: ~12 seconds

3) Start of the cluster after a clean stop; wait for all regions to
Pseudo distributed without datanode:
become online.
0.92: ~1020s
0.94: ~1023s (tested once only)
0.96: ~31s

With two datanodes, on hadoop 1.0
0.92: ~640 seconds
0.96: ~35 seconds

So it seems 0.92 is faster with the DN, but we still see a major improvement.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397559#comment-13397559 ] 

nkeywal commented on HBASE-5843:
--------------------------------

Small status as of June.

* Improvements identified
    Failure detection time: performed by ZK, with a timeout. With the default value, we needed 90 seconds before starting to act on a software or hardware issue.
    Recovery time - server side: split in two parts: reassigning the regions of a dead RS to a new RS, replaying the WAL. Must be as fast as possible.
    Recovery time - client side: errors should be transparent for the user code. On the client side, we must as well limit the time lost on errors to a minimum.
    Planned rolling restart: just make this as fast and less disruptive as possible
    Other possible changes. detailed below.

* Status
    Failure detection time: software crash - done
    Done in HBASE-5844, HBASE-5926

Failure detection time: hardware issue - not started
1) as much as possible, it should be handled by ZooKeeper and not HBase, see open Jira as ZOOKEEPER-702, ZOOKEEPER-922, ...
2) we need to make easy for a monitoring tool to tag a RS or Master as dead. This way, specialized HW tools could point out dead RS. Jira to open.

Recovery time - Server: in progress
1) bulk assignment: To be retested, there are many just-closed JIRA on this (HBASE-5998, HBASE-6109, HBASE-5970, ...). A lot of work by many people. There are still possible improvements (HBASE-6058, ...)
2) Log replay: To be retested, there are many just-closed JIRA on this (HBASE-6134, ...).

Recovery time - Client - done
1) The RS now returns the new RS to the client after a region move (HBASE-5992, HBASE-5877)
2) Client retries sooner on errors (HBASE-5924).
3) In the future, it could be interesting to share the region location in ZK with the client. It's not reasonable today as it could lead to have too many connection to ZK. ZOOKEEPER-1147 is an open JIRA on this.

Planned rolling restart performances - in progress
Benefits from the modifications in the client mentioned above.
To do: analyze move performances to make it faster if possible.

Other possible changes
    Restart the server immediately on software crash: done in HBASE-5939
    Reuse the same assignment on software crash: not planned
    Use spare hardware to reuse the same assignment on hardware failure: not planned
    Multiple RS for the same region (excluded in the initial document: hbase architecture change previously discussed by Gary H./Andy P.): not planned


                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398535#comment-13398535 ] 

nkeywal commented on HBASE-5843:
--------------------------------

@ram Thanks for pointing this one out. I will wait for the fix before redoing a perf check.
@andrew Yes, there should be no issue. HBASE-5926 modifies the content of HBASE-5844, so the merged patch will be smaller. I will have a look.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451100#comment-13451100 ] 

nkeywal commented on HBASE-5843:
--------------------------------

Improve HBase MTTR ? Unsexy? Really?
More seriously: agreed. Will do.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409666#comment-13409666 ] 

nkeywal commented on HBASE-5843:
--------------------------------

@andrew I had a look at HBASE-5844 and HBASE-5926, they have a small dependency to protobuf stuff that I had forgotten (they read the server name from zk), so it's not a pure git port.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397674#comment-13397674 ] 

ramkrishna.s.vasudevan commented on HBASE-5843:
-----------------------------------------------

@N
Just for update, I feel HBASE-6060(under progress) may also be related to MTTR.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459313#comment-13459313 ] 

stack commented on HBASE-5843:
------------------------------

bq. ...multiple RS on a single box?

Was reading today that VoltDB does offheap so can have small java heap and just run w/ defaults other than setting -Xmx and -Xms.  Could try a few RS per box.. w/ small heap each (RS needs to be put on a CPU diet but could look at that afterward).

bq. ....An improvement would be to reassign the region in parallel of the split

Elsewhere nkeywal puts up a suggested prescription where on crash we assign and then split logs (rather than other way round as we do now).  Nicolas suggests that the assigned regions could immediately start taking writes; we'd throw an exception if a read attempt was made until after the split completed and the regions edits had been replayed: HBASE-6752
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397876#comment-13397876 ] 

Andrew Purtell commented on HBASE-5843:
---------------------------------------

HBASE-5844 and HBASE-5926 are not in 0.94, can we port these back? Seems like pretty self contained changes.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "Gregory Chanan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476563#comment-13476563 ] 

Gregory Chanan commented on HBASE-5843:
---------------------------------------

One other question, could you explain these numbers in more detail?

{quote}
The split in 10s per 60Gb, on a single and slow HD. With a reasonable cluster, this should scale pretty well. We could improve things by using locality.
{quote}
Should this be 10s per 60 Mb, which is the size of a log ("creates 8 log files of 60 Mb each").

{quote}
Insert 1M or 10M rows, distributed on all regions. That creates 8 logs files of 60Mb each, on a single server.
{quote}
The number and size of log files is independent of the number of rows?  Is the size of a row in the 10M setup 1/10 the size of the row in the 1M setup?

{quote}
We can expect, in production, server side point of view
30s detection time for hw failure, 0s for simpler case (kill -9, OOM, machine nicely rebooted, ...)
10s split (i.e: distributed along multiple region servers)
10s assignment (i.e. distributed as well).
{quote}

For detection: In all the tests where we are not deleting the znode, either because on 0.94 or hw failure, the detection time is 180s.  Why is this listed as 30 sec?
For split: I think I understand this.  8 log files, 60 Mb each means 8 split tasks on 4 regionservers = 2 split tasks per RS.  This took ~25 seconds, so if we had enough RS for each RS to only do 1 split task, we'd be done in about 10s.
For assignment: Not sure where this number if coming from.  I see ~30 assignment and we know this would go faster with more RS, but how fast?

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450625#comment-13450625 ] 

nkeywal commented on HBASE-5843:
--------------------------------

Some tests and analysis around Distributed Split

Scenario with a single machine, killing only a region server
- dfs.replication = 2
- local HD. The test failed with the ramDrive. 
- Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. The first holds meta & root.
- Insert 10M rows, distributed on all regions. That creates 8 logs files of 60Mb each.
- start 3 more DN & RS.
- kill -9 the second RS (NOT the datanode).
- Wait for the regions to be available again.

0.94
~180s detection time (sometimes less, it's strange. But always more than 120s).
~25s split (two tasks of ~10s each per regionserver)
~30s assignment

0.96
~0s detection time (because it's a kill -9, so we have HBASE-5844. A hw failure would bring the same result as 0.94).
~25s split (two tasks of ~10s each)
~30s assignment. Other tests seems to show that the actual time is taken on replaying the edits.

=> No difference in time here. Except the detection time, that will not play a role in a HW failure.
=> A split task in done in 10s. If you have a reasonable cluster, you can expect this task to takes 10s in production. 
=> Same should apply to replaying edits, if the regions are well redistributed among the other machines.
=> Still, the assignment/replaying could be faster. Analysis to do, and JIRA to create.

As of today, if the regionserver crashes, we should have, in production:
- 0.94: ~50s (30s detection + 10s split + 10s assignment)
- 0.96: ~20s (0s detection + 10s split + 10s assignment)

If it's a HW failure, we need to take into account that we've lost a datanode as well.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450718#comment-13450718 ] 

nkeywal commented on HBASE-5843:
--------------------------------

Some tests and analysis around Distributed Split / Datanodes failures

On a real cluster, 3 nodes.
- dfs.replication = 2
- local HD. The test failed with the ramDrive. 
- Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. The first holds meta & root.
- Insert 1M or 10M rows, distributed on all regions. That creates 8 logs files of 60Mb each, on a single server.
- Start another box with a DN and a RS. This box is empty (no regions, no blocks).
- Unplug (physically) the box with the 100 regions and the 1 (for 1M puts) or 8 (for 10M puts) log files.

Durations are, in seconds. With HDFS 1.0.3 if not stated differently.

1M puts on 0.94:
~180s detection time, sometimes around 150s
~130s split time (there is a single file to split. This is to be compared to the 10s per split above)
~180s assignment, included replaying edits. There could be some locks, as we're reassigning/replaying 50 regions per server.

1M puts on 0.96 3 tests. One failure.
~180s detection time, sometimes around 150s
~180s split time. Once again a single file to split. It's unclear why it takes longer than 0.94
~180s assignment, as 0.94.

Out of 3 tests, it failed once on 0.96. It didn't fail on 0.94.

10M puts on 0.96 + HDFS branch 2 as of today
~180s detection time, sometimes around 150s
~11 minutes split. Basically it fails until HDFS nanemode marks the datanode as dead. It takes 7:30 minutes, so the split finishes after this.
~60s assignment? Tested only once.

0M (zero) puts on 0.96 + HDFS branch 2 as of today
~180s detection time, sometimes around 150s
~0s split. 
~3s assignment (This seems to say that the assignment time is spent in the edit replay.)

10M puts on 0.96 + HDFS branch 2 + HDFS-3703 full (read + write paths)
~180s detection time, sometimes around 150s
~150s split. This for a bad reasons: all tasks excepts one succeeds. The last one seems to connect to the dead server, and finishes after ~100s. Tested twice.
~50s assignment. Measured once.

So:
- The measures on assignments are fishy. But it seems to say that we are now spending our time in replaying edit. We could have issues linked to HDFS as well here: in the last two scenarios we're not going to the dead nodes when we replay/flush edits, so that could be the reason.
- The split in 10s per 60Gb, on a single and slow HD. With a reasonable cluster, this should scale pretty well. We could improve things by using locality.
- There will be datanodes errors if you don't have HDFS-3703. And in this case it becomes complicated. HBASE-6738.
- With HDFS-3703, we're 500s faster. That's interesting.
- Even with HDFS-3703 there is still something to look at in how HDFS connects to the dead node. It seems the block is empty, so retried multiple times. There are multiple possible paths here.
- We can expect, in production, server side point of view
   - 30s detection time for hw failure, 0s for simpler case (kill -9, OOM, machine nicely rebooted, ...)
   - 10s split (i.e: distributed along multiple region servers)
   - 10s assignment (i.e. distributed as well).
- Without HDFS effects here. See above.
- This scenario is extreme, as we're loosing 50% of our data. Still, if you're loosing a regionserver with 300 regions, the split can go well if you're not lucky.
- It means as well that the detection time dominates the other parameters when everything goes well.

Conclusion:
- Link HDFS / HBase plays the critical role in this scenario. HDFS-3703 is one of the keys.
- The Distributed Split seems to work well in terms of performances.
- Assignment itself seems ok. Replaying should be looked at (more in terms of lock than raw performances).
- Detection time will become more an more important.
- An improvement would be to reassign the region in parallel of the split, with:
   - continue to serve writes before the end of the split as well: the fact that we're splitting the logs does not mean we cannot write. There are real applications that could use this (may be open tsdb for example: whatever application that logs data: they just need to know where to write).
   - continue to server reads if there are timeranged with the max time stamp before the failure: There are many applications that don't need fresh data (i.e. less than 1 minute old). 
- With this, the downtime will be totally dominated by the detection time.
- There are JIRAs around the detection time already (basically: improve ZK and open HBase to external monitoring systems).
- There will be some work around the client part.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459262#comment-13459262 ] 

Ted Yu commented on HBASE-5843:
-------------------------------

bq. multiple RS on a single box?
In case that box goes down, recovery would be more costly (compared to single region server running on the box), right ?
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "Gregory Chanan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459278#comment-13459278 ] 

Gregory Chanan commented on HBASE-5843:
---------------------------------------

I don't know.  You'd have to split the logs and replay them, which should be similar performance whether there is 1 or 2 WALs?
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476582#comment-13476582 ] 

nkeywal commented on HBASE-5843:
--------------------------------

Should this be 10s per 60 Mb, which is the size of a log ("creates 8 log files of 60 Mb each").
Yes :-).

bq. The number and size of log files is independent of the number of rows? Is the size of a row in the 10M setup 1/10 the size of the row in the 1M setup?
No / ~Yes. There's a maximum number of log files before a flush however. 

bq.  on 0.94 or hw failure, the detection time is 180s. Why is this listed as 30 sec?
It's discussed very briefly in HBASE-5844: it was bumped recently from 60s to 180s to help newcomers. ZK default is 30s and is imho the best choice for someone requiring a reasonable mttr without too much risk.

bq. For split: I think I understand this. 8 log files, 60 Mb each means 8 split tasks on 4 regionservers = 2 split tasks per RS. This took ~25 seconds, so if we had enough RS for each RS to only do 1 split task, we'd be done in about 10s.
Yes, exactly. It's a reasonable expectation imho, even if there will be other cases.

bq. For assignment: Not sure where this number if coming from. I see ~30 assignment and we know this would go faster with more RS, but how fast?
The calculations seems to say that the time is spent in the replay (that's the initial results from 20/Jun/12 16:52 and the test with zero rows from 07/Sep/12 17:27). It would be useful to redo the tests, as many things changed recently on assignment. But if the results are ok, and if the cluster is big enough, the regions are distributed, we should expect only one replay. Again, it's average, not worse case.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459444#comment-13459444 ] 

nkeywal commented on HBASE-5843:
--------------------------------

bq. Interesting. The sad part is we often find ourselves having to increase the ZK timeout in order to deal with Juliet GC pauses. Given that detection time dominates, perhaps we should put some effort into correcting that (multiple RS on a single box?)
Imho, multiple RS on the same box would put us in a dead end: it increases the number of tcp connections, add workload on ZooKeeper, makes the balancer more complicated, ... We can also have operational issues (rolling upgrade, fixed tcp ports, ...).
The possible options I know are:
- improving ZooKeeper to have an algorithm that takes variance into account: it's a common solution to have a good failure detection while minimizing wrong positive. 20 years ago, it saved TCP from dying by congestion. There is ZOOKEEPER-702 about this. That's medium term (hopes...), but would be useful for HDFS HA also.
- Using the new gc options available in JDK 1.7 (see http://www.oracle.com/technetwork/java/javase/tech/g1-intro-jsp-135488.html). That's short term, simple. Only issue, it has been tried a few month ago (by Andrew Purtell IIRC), and crashed the JVM. Still, it's something to look at, and may be we should raise the bugs to Oracle if we find some. 
- The offheap mentioned by Stack.

I don't think it's one or another, we're likely to need all of them :-). Still, knowing where we stand in regards of JDK 1.7 is important imho.


bq. Yes, CPU definitely need a diet. Probably start with eliminating a bunch of threads.
It's not directly MTTR, but I agree, we have far too many threads, and far too many thread pools. Not only it's bad for performance, it makes analysing the performances complicated.

bq. Right, I think HBASE-6752 is a great idea, but it doesn't address serving reads more quickly. I'm wondering if there is more we can do to address that.
There is HBASE-6774 for the special case of "empty hlog" regions. It would be interesting to see how many regions are in this situation on different production clusters. There are so many ways to be in this situation... I would love to have a stat on "at a given point of time, what's the proportion of the regions with a non empty memstore". And improving memstore flush policy would lead us to improvement here as well I think.
With HBASE-6752 we serve as well timeranged reads (if they're lucky on the range).
But yep, we don't cover all cases. Ideas welcome :-)


bq. Why do you say this with respect to locking? Is the performance not as good as you would expect? Or just haven't looked at it yet?
I was expecting much better performances, but I haven't looked enough at it.

bq. I've wondered why we don't do this. Do you see any implementation challenges with doing this? Maybe I'll look into it.
Well, it's closed to the assignment part, so... :-) But it would be great if you can have a look at this, because with all the discussions around assignment, it's important to take these new use cases into account as well..

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501838#comment-13501838 ] 

nkeywal commented on HBASE-5843:
--------------------------------

New scenario on datanode issue during a WAL write:

Scenario: With Replication factor 2, Start 2 DN & 1 RS, do a first put. Start a new DN, unplug the second one. Do another put, measure the time of this second put.

HBase trunk / HDFS 1.1: ~5 minutes
HBase trunk / HDFS 2 branch: ~40s seconds
HBase trunk / HDFS 2.0.2-alpha-rc3: ~40 seconds


The time is HDFS 1.1 is spent in:
~66 seconds: wait for connection timeout (SocketTimeoutException: 66000 millis while waiting for the channel to be ready for read).
then, we have two imbricated retries loops:
- 6 retries: Failed recovery attemp #0 from primary datanode x.y.z.w:11011 -> NoRouteToHostException
- 10 sub-retries: Retrying connect to server: deadbox/x.y.z.w:11021. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

There are more or less 4 seconds between two sub-retries, so the total time it around:
66 + 6 * (~4 * 10) = ~300 seconds. That's our 5 minutes.

If we change HDFS code to have "RetryPolicies.TRY_ONCE_THEN_FAIL" vs. the default "RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)", the put succeeds in ~80 seconds.

Conclusion:
- time with HDFS 2.x is in line with what we have for other scenarios (~40s), so it's acceptable today.
- time with HDFS 1.x is much less satisfying (~5 minutes!), could be easily decreased to 80s with an HDFS modification.

Some points to think about:
- Maybe we could decrease the timeout for WAL: we're usually writing much less data than for a memstore flush, so having more aggressive settings for the WAL makes sense. There is a (bad) side effect: we may have more false positive, and this could decrease the performances, and it will increase the workload when the cluster is globally instable. So on the long term it makes sense, but may be today is to early.
- While the Namenode will consider the datanode as stale after 30s, we still continue trying. Again, it makes sense to lower the global workload, but it's a little bit boring... There could be optimizations if the datanode state was shared to DFSClients.
- There are some cases that could be handled faster: ConnectionRefused means the box is there but the port is not open: no need to retry here. AndNoRouteToHostException could be considered as well as critical enough to stop trying. Again as well, this is trading global workload vs. reactivity.

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409663#comment-13409663 ] 

nkeywal commented on HBASE-5843:
--------------------------------

Some tests results:

I tested the following scenarios, on a local machine, a pseudo
distributed cluster with ZooKeeper and HBase writing in a ram drive,
no datanode nor namenode, with 2 region servers, and one empty table
with 10000 regions, 5K on each RS. Versions taken monday 2nd

1) Clean stop of one RS; wait for all regions to become online again:
0.92: ~800 seconds
0.96: ~13 seconds

=> Huge improvement, hopefully from stuff like HBASE-5970 and HBASE-6109.

1.1) As above with 2Mb memory per server
Results as 1)

=> Results don't depend on any GC stuff (memory reported is around 200 Mb)


2) Kill -9 of a RS; wait for all regions to become online again:
0.92: 980s
0.96: ~13s

=> The 180s gap comes from HBASE-5844. For master, HBASE-5926 is not tested but should bring similar results.



3) Start of the cluster after a clean stop; wait for all regions to
become online.
0.92: ~1020s
0.94: ~1023s (tested once only)
0.96: ~31s

=> The benefit is visible at startup
=> This does not come from something implemented for 0.94



4) As 3) But with HBase on a local HD
0.92: ~1044s (tested once only)
0.96: ~28s (tested once only)

=> Similar results. Seems that HBase i/o was not and is not becoming the bottleneck.


5) As 1) With 4RS instead of 2
0.92) 406s
0.96) 6s

=> Twice faster in both cases. Scales with the number of RS with both versions on this minimalistic test.



6) As 3) But with ZK on a local HD
Impossible to get something consistent here. Machine and test dependent.
The most credible result was similar to 2).
>From ZK mailing list or ZOOKEEPER-866 is seems that what we should expect.



7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long it takes to have all the regions available.
0.92) 180s detection time+ then hangs twice out of 2 tests.
0.96) 14s (hangs once out of 3)

=> There's a bug ;-)
=> Test to be changed to get a real difference when we need to replay the wal.

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "Gregory Chanan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418863#comment-13418863 ] 

Gregory Chanan commented on HBASE-5843:
---------------------------------------

Looks great so far, nkeywal.

Some questions:

{quote}
2) Kill -9 of a RS; wait for all regions to become online again:
0.92: 980s
0.96: ~13s
=> The 180s gap comes from HBASE-5844. For master, HBASE-5926 is not tested but should bring similar results.
{quote}

I'm confused as to what the 180s gap refers to.  I see 980 (test 2) - 800 (test1) = 180, but that is against 0.92, which doesn't have HBASE-5970, right?  Could you clarify?

{quote}
3) Start of the cluster after a clean stop; wait for all regions to
become online.
0.92: ~1020s
0.94: ~1023s (tested once only)
0.96: ~31s
=> The benefit is visible at startup
=> This does not come from something implemented for 0.94
{quote}

Awesome.. We think this is also due to HBASE-5970 and HBASE-6109? (since I assume HBASE-5844 and HBASE-5926 do not apply in this case).

{quote}
7) With 2 RS, Insert 20M simple puts; then kill -9 the second one. See how long it takes to have all the regions available.
0.92) 180s detection time+ then hangs twice out of 2 tests.
0.96) 14s (hangs once out of 3)
=> There's a bug 
{quote}
Has a JIRA been filed?

{quote}
Test to be changed to get a real difference when we need to replay the wal.
{quote}
Could you clarify what you mean here?

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418919#comment-13418919 ] 

nkeywal commented on HBASE-5843:
--------------------------------

bq. I'm confused as to what the 180s gap refers to. I see 980 (test 2) - 800 (test1) = 180, but that is against 0.92, which doesn't have HBASE-5970, right? Could you clarify?
Yes, it's because with a clean stop, the RS unregisters itself in ZK, so the recovery starts immediately. With a kill -9, the RS remains registered in ZK. So if you don't have HBASE-5844 or HBASE-5926, you wait for the ZK timeout.

bq. Awesome.. We think this is also due to HBASE-5970 and HBASE-6109? 
Yes.
 
bq. Has a JIRA been filed?
Not yet. I'm writing specific unit tests for this, I found issues that I have not yet fully analyzed, and I need to create the jiras. Also, may be my test was not good for this part: as I was doing the test without a datanode, it could be that the recovery was not working for this reason (I wonder if the sync works with the local file system for example).


bq. Test to be changed to get a real difference when we need to replay the wal.
bq. Could you clarify what you mean here?
It's does not last long enough, so I won't be able to see much difference even if there is one. So I need to redo the work with a real datanode, check that it recovers, then check that I measure something meaningful.
I will also redo the first tests with a DN to see if there is still a gap.



                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454911#comment-13454911 ] 

nkeywal commented on HBASE-5843:
--------------------------------

Test with meta:

On a real cluster, 3 nodes.
    dfs.replication = 2
    local HD.
    Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. The first holds meta & root.
    Start another box with a DN and a RS. This box is empty (no regions, no blocks).
    Unplug the box with meta & root.
    try to create a table

=> Time taken is the recovery time of the bow holding meta. No bad surprise. It means as well that with the default zookeeper timeout you're loosing the cluster for 3 minutes if your meta regionserver dies. HBASE-6772, HBASE-6773 and HBASE-6774 would help to increase meta failure resiliency.

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451096#comment-13451096 ] 

stack commented on HBASE-5843:
------------------------------

Nicolas, the above should be put on the dev list.  Its great stuff.  Too good to be buried out here as a comment on issue with such an unsexy subject.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476830#comment-13476830 ] 

nkeywal commented on HBASE-5843:
--------------------------------

bq. Not sure what you mean here. Are the rows the same or not? Are there are just more flushes on the 10M case?
Yes, it's exactly the same rows for 1M puts and 10M puts.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504937#comment-13504937 ] 

nkeywal commented on HBASE-5843:
--------------------------------

Some more tests: ther

scenario: 2 RS. Create a table with 3000 regions.
Start a third RS. Kill -15 the second one: there are 1500 regions to reassign.

I've made multiple measures, there are all there:
It's in seconds.
0.96 trunk is patched with HBASE-7220 (no metrics at all).

Creating a table with 3000 regions:
0.92: 261s; 260s
0.94.0: 260s; 260s
0.94.1: 261s; 260s
0.94 trunk: 292s; 281s; 282s;
0.96 trunk: 173s; 178s

Reassign after the kill
0.92: 107s, 110s
0.94.0: 105s; 105s
0.94.1: 107s; 107s
0.94 trunk: 122s; 105s; 116s
0.96 trunk: 50s; 50s;

Same test, but putting ZK on a separate node, 1Gb network.
Reassign after the kill:
0.96 trunk: 101s.


I need to do more tests, but it seems to match the profiling I've done; the time is now spent in ZK communications.

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "nkeywal (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453296#comment-13453296 ] 

nkeywal commented on HBASE-5843:
--------------------------------

The issues when reading even when HDFS-3703 comes from something new in HDFS 2.0. It seems that the blocks being written are not anymore in the 'main' block list, so the initial fix was not working for this block. It show that if you have the error with a single log file out of 8, you're still paying a huge price in terms of added delay...


                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "Gregory Chanan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459251#comment-13459251 ] 

Gregory Chanan commented on HBASE-5843:
---------------------------------------

Again, great work Nicolas.  Some questions/comments:

bq. It means as well that the detection time dominates the other parameters when everything goes well.
bq. Detection time will become more an more important.
Interesting.  The sad part is we often find ourselves having to *increase* the ZK timeout in order to deal with Juliet GC pauses.  Given that detection time dominates, perhaps we should put some effort into correcting that (multiple RS on a single box?)

bq. Replaying should be looked at (more in terms of lock than raw performances).
Why do you say this with respect to locking?  Is the performance not as good as you would expect?  Or just haven't looked at it yet?

bq. An improvement would be to reassign the region in parallel of the split
I've wondered why we don't do this.  Do you see any implementation challenges with doing this?  Maybe I'll look into it.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover

Posted by "Gregory Chanan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476590#comment-13476590 ] 

Gregory Chanan commented on HBASE-5843:
---------------------------------------

bq. No / ~Yes. There's a maximum number of log files before a flush however.

Not sure what you mean here.  Are the rows the same or not?  Are there are just more flushes on the 10M case?

Thanks for clarifying.
                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira