You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Thomas Pan (Created) (JIRA)" <ji...@apache.org> on 2011/12/01 04:59:39 UTC

[jira] [Created] (HBASE-4925) Collect test cases for hadoop/hbase cluster

Collect test cases for hadoop/hbase cluster
-------------------------------------------

                 Key: HBASE-4925
                 URL: https://issues.apache.org/jira/browse/HBASE-4925
             Project: HBase
          Issue Type: Brainstorming
          Components: test
            Reporter: Thomas Pan


This entry is used to collect all the useful test cases to verify a hadoop/hbase cluster. This is to follow up on yesterday's hack day in Salesforce. Hopefully that the information would be very useful for the whole community.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209115#comment-13209115 ] 

gaojinchao commented on HBASE-4925:
-----------------------------------

@ Thomas

Your framework is available?  We finished part cases automation , but I find it is not stable. 
                
> Collect test cases for hadoop/hbase cluster
> -------------------------------------------
>
>                 Key: HBASE-4925
>                 URL: https://issues.apache.org/jira/browse/HBASE-4925
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: test
>            Reporter: Thomas Pan
>
> This entry is used to collect all the useful test cases to verify a hadoop/hbase cluster. This is to follow up on yesterday's hack day in Salesforce. Hopefully that the information would be very useful for the whole community.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster

Posted by "Thomas Pan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209193#comment-13209193 ] 

Thomas Pan commented on HBASE-4925:
-----------------------------------

gaojinchao, I am not the person working on the framework. Stack should know better. For stability, any cluster could go down, regardless. So, we need to set up the same cluster on different data centers. In case one is down, another is still running. The beauty of HBase is that the underlining data is still available. As far as the downtime is controllable, it should be fine.
                
> Collect test cases for hadoop/hbase cluster
> -------------------------------------------
>
>                 Key: HBASE-4925
>                 URL: https://issues.apache.org/jira/browse/HBASE-4925
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: test
>            Reporter: Thomas Pan
>
> This entry is used to collect all the useful test cases to verify a hadoop/hbase cluster. This is to follow up on yesterday's hack day in Salesforce. Hopefully that the information would be very useful for the whole community.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster

Posted by "Thomas Pan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160635#comment-13160635 ] 

Thomas Pan commented on HBASE-4925:
-----------------------------------

Here is the list of fault injection test cases that we've collected:
1. Kill -9 one region server and kill -9 the region server that serves the .META. table 
2. While BES is writing data to HBase table, kill -9 the region server that holds the .META. table 
3. Kill -9 the region server that serves the .META. table. Then, kill -9 the region server that serves the -ROOT- table. [Thomas: Is it the case in our environment?] 
4. A large number of region servers get killed. After restoration, there is no data loss. 
5. No job impact while shifting from the primary HBase master to the secondary HBase master. 
6. Shift from the primary HBase master to the secondary HBase master after multiple region servers fail. 
7. Shift from the primary HBase master to the secondary HBase master after new region servers are added. 
8. Repeatedly stop and restart the primary HBase master. There should be no major impact as the secondary HBase master kicks in automatically. 
9. Shift from the primary HBase master to the secondary HBase master while a table is creating with 3600 regions. 
10. Disable network access for the node running the region server that serves the .META. table 
11. Disable network access for the node running the primary HBase master 
12. Disable network access for the node running the secondary HBase master 
13. Trigger short-lived network interruption for the node running the region server that serves the .META. table 
14. Trigger short-lived network interruption for the node running the primary HBase master 
15. Trigger short-lived network interruption for the node running the secondary HBase master 
16. While BES is writing to a table heavily with high CPU usage in the cluster. 
17. Restart one RS with high CPU usage in the cluster. 
18. Offline data nodes with high CPU usage in the cluster. 
19. While BES is writing to a table heavily with high memory usage in the cluster. 
20. Restart one RS with high memory usage in the cluster. 
21. Offline data nodes with high memory usage in the cluster. 
22. With no load in the cluster, test failover of the primary NN to the secondary NN 
23. With jobs running in the cluster, test failover of the primary NN to the secondary NN 
24. Repeatedly stop and restart the primary NN to make sure that the NN failover works fine 
25. Kill -9 the primary zookeeper. The failover to the second NN should be in time with no job failure. 
26. Kill -9 the primary zookeeper and the primary NN, the cluster should quickly fail over to the secondary ZK and NN 
27. Restart the node that holds the primary NN 
28. Disable network access for the node running the primary NN 
29. Trigger short-lived network interruption for the node running the primary NN 
30. Disable network access for the node running the primary ZK 
31. Trigger short-lived network interruption for the node running the primary ZK 
32. Disable network access for the node running ZK in follower state 
33. Trigger short-lived network interruption for the node running ZK in follower state 
34. Offline multiple data nodes at once. Keep them offline for a while. 
35. Offline multiple data nodes at once. Keep them offline for a while. Put them back at once. 
37. Offline multiple data nodes at once. Put them back at once, instantly. 
38. Offline a data node at once. Keep it offline for a while. 
39. Offline a data node at once. Keep it offline for a while. Put it back at once. 
40. Offline a data node at once. Put it back at once, instantly. 
41. Hard disk failure in the primary NN triggers NN failover. 
42. The directory dfs.data.dir on data node gets corrupted 
43. Corrupted dfs.name.dir on the primary NN gets detected and triggers NN failover. 
44. Corrupted dfs.name.dir on the secondary NN gets detected. 
45. A data node runs out of disk space. 
46. Under heavy IO on data nodes, BES writes to a table heavily. 
47. Under heavy IO on data nodes, offline multiple data nodes.
                
> Collect test cases for hadoop/hbase cluster
> -------------------------------------------
>
>                 Key: HBASE-4925
>                 URL: https://issues.apache.org/jira/browse/HBASE-4925
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: test
>            Reporter: Thomas Pan
>
> This entry is used to collect all the useful test cases to verify a hadoop/hbase cluster. This is to follow up on yesterday's hack day in Salesforce. Hopefully that the information would be very useful for the whole community.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster

Posted by "Thomas Pan (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161466#comment-13161466 ] 

Thomas Pan commented on HBASE-4925:
-----------------------------------


Once framework is available, we will write them. Regardless, we plan to carry out these tests to mainly achieve two goals:

1. To build up the experiences on how to handle various outages before production launch, which we plan to share with the community once we have more details.
2. To reveal more issues in the code base so that the community could fix them. Last time, Todd found more HBase bugs while helping us recover from one big outage.

                
> Collect test cases for hadoop/hbase cluster
> -------------------------------------------
>
>                 Key: HBASE-4925
>                 URL: https://issues.apache.org/jira/browse/HBASE-4925
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: test
>            Reporter: Thomas Pan
>
> This entry is used to collect all the useful test cases to verify a hadoop/hbase cluster. This is to follow up on yesterday's hack day in Salesforce. Hopefully that the information would be very useful for the whole community.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161460#comment-13161460 ] 

stack commented on HBASE-4925:
------------------------------

That is a pretty nice list Thomas. Thanks.  You have a means of running these scenarios now or you are waiting on framework in which you/we could write them?
                
> Collect test cases for hadoop/hbase cluster
> -------------------------------------------
>
>                 Key: HBASE-4925
>                 URL: https://issues.apache.org/jira/browse/HBASE-4925
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: test
>            Reporter: Thomas Pan
>
> This entry is used to collect all the useful test cases to verify a hadoop/hbase cluster. This is to follow up on yesterday's hack day in Salesforce. Hopefully that the information would be very useful for the whole community.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210670#comment-13210670 ] 

stack commented on HBASE-4925:
------------------------------

@Gaojinchao What did you finish?  You have an automation that runs some of Thomas's tests?
                
> Collect test cases for hadoop/hbase cluster
> -------------------------------------------
>
>                 Key: HBASE-4925
>                 URL: https://issues.apache.org/jira/browse/HBASE-4925
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: test
>            Reporter: Thomas Pan
>
> This entry is used to collect all the useful test cases to verify a hadoop/hbase cluster. This is to follow up on yesterday's hack day in Salesforce. Hopefully that the information would be very useful for the whole community.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira