You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2013/10/18 15:41:42 UTC
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

    [ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799098#comment-13799098 ] 

Steve Loughran commented on HBASE-9802:
---------------------------------------

This sounds interesting and potentially very useful beyond just HBase. Hadoop YARN applications are the obvious target, as they need to be written to expect failure, and if they don't get tested, well, they won't work. I ended up doing some basics of this with ssh and reboot operations, but I really wanted something that could talk to an open WRT base station and actually generate real network partitions, rather than just simulations. 

# Accumulo has something similar, though I've not seen it
# would it be possible to make this more generic? Even if starts off in HBase, it could be good to have the option of branching off into its own project -and to allow people downstream to use it even earlier.

I'd propose making the core test framework a module that could be picked up and used downstream, precisely to get that cross-application testing

> A new failover test framework for HBase
> ---------------------------------------
>
>                 Key: HBASE-9802
>                 URL: https://issues.apache.org/jira/browse/HBASE-9802
>             Project: HBase
>          Issue Type: Improvement
>          Components: test
>    Affects Versions: 0.94.3
>            Reporter: chendihao
>            Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will restart regionserver, force balancer and perform other actions randomly and periodically. However, we need a more extensible and full-featured framework for our failover test and we find ChaosMonkey cant' suit our needs since it has the following drawbacks.
> 1) Only process-level actions can be simulated, not support machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the reason.
> Therefore, we have developed a new framework to satisfy the need of failover test. We extended ChaosMonkey and implement the function to validate data and to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table as well.
> 5) The set of actions that caused test failure can be replayed, and the reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were exposed and fixed by running this test framework. Moreover, we have a monitor program which shows the progress of failover test and make sure our cluster is as stable as we want. Now we are trying to make it more general and will opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)