You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Todd Lipcon (Created) (JIRA)" <ji...@apache.org> on 2012/03/29 23:57:25 UTC

[jira] [Created] (HADOOP-8228) Auto HA: Refactor tests and add stress tests

Auto HA: Refactor tests and add stress tests
--------------------------------------------

                 Key: HADOOP-8228
                 URL: https://issues.apache.org/jira/browse/HADOOP-8228
             Project: Hadoop Common
          Issue Type: Test
          Components: auto-failover, ha, test
    Affects Versions: Auto Failover (HDFS-3042)
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon


It's important that the ZKFailoverController be robust and not contain race conditions, etc. One strategy to find potential races is to add stress tests which exercise the code as fast as possible. This JIRA is to implement some test cases of this style.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8228) Auto HA: Refactor tests and add stress tests

Posted by "Aaron T. Myers (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242643#comment-13242643 ] 

Aaron T. Myers commented on HADOOP-8228:
----------------------------------------

This is some great cleanup, Todd. The new test/testing API look good.

One question: are you positive that the ordering of the two @After methods either doesn't matter, or is guaranteed to happen in the right order?

One comment: maybe use a deterministic random seed for the Random instances you're using? Or at least log the amount of time that the test is sleeping for and what it's throwing? I realize that the test won't be deterministic regardless, but it will be really tough to try to reproduce test failures caused by a particular ZK disconnect pattern or health check failure pattern if we have no idea what that pattern was.

+1 once those are addressed
                
> Auto HA: Refactor tests and add stress tests
> --------------------------------------------
>
>                 Key: HADOOP-8228
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8228
>             Project: Hadoop Common
>          Issue Type: Test
>          Components: auto-failover, ha, test
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt
>
>
> It's important that the ZKFailoverController be robust and not contain race conditions, etc. One strategy to find potential races is to add stress tests which exercise the code as fast as possible. This JIRA is to implement some test cases of this style.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8228) Auto HA: Refactor tests and add stress tests

Posted by "Todd Lipcon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242677#comment-13242677 ] 

Todd Lipcon commented on HADOOP-8228:
-------------------------------------

bq. One question: are you positive that the ordering of the two @After methods either doesn't matter, or is guaranteed to happen in the right order?

The order of the two @After methods is nondeterministic. But, in this case, it's only important that our @After method runs before the superclass (ClientBase)'s tearDown. JUnit does guarantee the ordering in this case.

bq. One comment: maybe use a deterministic random seed for the Random instances you're using? Or at least log the amount of time that the test is sleeping for and what it's throwing?
Good point. I added additional logging for when it throws exceptions, and for when it expires sessions. I don't think the deterministic seed helps things, since the interleaving is still non-deterministic (that's part of the value of these tests :) )
                
> Auto HA: Refactor tests and add stress tests
> --------------------------------------------
>
>                 Key: HADOOP-8228
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8228
>             Project: Hadoop Common
>          Issue Type: Test
>          Components: auto-failover, ha, test
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt
>
>
> It's important that the ZKFailoverController be robust and not contain race conditions, etc. One strategy to find potential races is to add stress tests which exercise the code as fast as possible. This JIRA is to implement some test cases of this style.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8228) Auto HA: Refactor tests and add stress tests

Posted by "Todd Lipcon (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-8228:
--------------------------------

    Attachment: hadoop-8228.txt

New rev addresses the above.

I also noticed that testRandomHealthAndDisconnects was only having one of the two services randomly throw the errors. I fixed it so now both throw errors. The test still passes, though I'll run it through a longer run before committing
                
> Auto HA: Refactor tests and add stress tests
> --------------------------------------------
>
>                 Key: HADOOP-8228
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8228
>             Project: Hadoop Common
>          Issue Type: Test
>          Components: auto-failover, ha, test
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt
>
>
> It's important that the ZKFailoverController be robust and not contain race conditions, etc. One strategy to find potential races is to add stress tests which exercise the code as fast as possible. This JIRA is to implement some test cases of this style.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8228) Auto HA: Refactor tests and add stress tests

Posted by "Todd Lipcon (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-8228:
--------------------------------

    Attachment: hadoop-8228.txt

Slight fix to the fencing setup in the dummy HA service. Fencing wasn't properly releasing the DummySharedResource, so some of the earlier tests started failing. Fixed it to add a FenceMethod which marks the DummySharedResource as released.
                
> Auto HA: Refactor tests and add stress tests
> --------------------------------------------
>
>                 Key: HADOOP-8228
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8228
>             Project: Hadoop Common
>          Issue Type: Test
>          Components: auto-failover, ha, test
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt
>
>
> It's important that the ZKFailoverController be robust and not contain race conditions, etc. One strategy to find potential races is to add stress tests which exercise the code as fast as possible. This JIRA is to implement some test cases of this style.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8228) Auto HA: Refactor tests and add stress tests

Posted by "Todd Lipcon (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-8228:
--------------------------------

    Attachment: hadoop-8228.txt

This test-only patch does the following:
- refactors much of the TestZKFailoverController code into a "MiniZKFCCluster" class, which allows tests to start and stop a pair of electable dummy services. There is no semantic change to TestZKFailoverController, but the patch is big just to move to the new API
- adds TestZKFailoverControllerStress, which contains several new stress tests for failover, session expiration, etc
- adds a DummySharedResource which keeps track of its "owner". When one of the dummy services becomes active, it takes control of the resource. When it goes standby, it releases control. If two services try to take the resource at the same time, it will throw an exception and record the violation. The MiniZKFCCluster checks that there were no violations in its shutdown method.
                
> Auto HA: Refactor tests and add stress tests
> --------------------------------------------
>
>                 Key: HADOOP-8228
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8228
>             Project: Hadoop Common
>          Issue Type: Test
>          Components: auto-failover, ha, test
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8228.txt
>
>
> It's important that the ZKFailoverController be robust and not contain race conditions, etc. One strategy to find potential races is to add stress tests which exercise the code as fast as possible. This JIRA is to implement some test cases of this style.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (HADOOP-8228) Auto HA: Refactor tests and add stress tests

Posted by "Todd Lipcon (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon resolved HADOOP-8228.
---------------------------------

       Resolution: Fixed
    Fix Version/s: Auto Failover (HDFS-3042)
     Hadoop Flags: Reviewed

Committed to HDFS-3042 branch. Thanks for reviewing, Aaron. I also ran the tests here with STRESS_RUNTIME_SECS bumped to 120s a couple times before committing.
                
> Auto HA: Refactor tests and add stress tests
> --------------------------------------------
>
>                 Key: HADOOP-8228
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8228
>             Project: Hadoop Common
>          Issue Type: Test
>          Components: auto-failover, ha, test
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: Auto Failover (HDFS-3042)
>
>         Attachments: hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt
>
>
> It's important that the ZKFailoverController be robust and not contain race conditions, etc. One strategy to find potential races is to add stress tests which exercise the code as fast as possible. This JIRA is to implement some test cases of this style.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HADOOP-8228) Auto HA: Refactor tests and add stress tests

Posted by "Todd Lipcon (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-8228:
--------------------------------

    Attachment: hadoop-8228.txt

New rev includes code which covers the retry-on-disconnect mechanisms in the elector, as well as many of the error handling paths. While running a stress test, it has the ZK server disconnect all of its clients every 50ms. This causes outstanding ZK calls to get a {{Disconnected}} error and have to handle it.

I verified with Clover that this covers the retry code paths and various failure cases while writing/reading the breadcrumb node, etc.
                
> Auto HA: Refactor tests and add stress tests
> --------------------------------------------
>
>                 Key: HADOOP-8228
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8228
>             Project: Hadoop Common
>          Issue Type: Test
>          Components: auto-failover, ha, test
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8228.txt, hadoop-8228.txt
>
>
> It's important that the ZKFailoverController be robust and not contain race conditions, etc. One strategy to find potential races is to add stress tests which exercise the code as fast as possible. This JIRA is to implement some test cases of this style.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HADOOP-8228) Auto HA: Refactor tests and add stress tests

Posted by "Aaron T. Myers (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242682#comment-13242682 ] 

Aaron T. Myers commented on HADOOP-8228:
----------------------------------------

Thanks for addressing my comments, Todd.

+1, the latest patch looks good to me.
                
> Auto HA: Refactor tests and add stress tests
> --------------------------------------------
>
>                 Key: HADOOP-8228
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8228
>             Project: Hadoop Common
>          Issue Type: Test
>          Components: auto-failover, ha, test
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt, hadoop-8228.txt
>
>
> It's important that the ZKFailoverController be robust and not contain race conditions, etc. One strategy to find potential races is to add stress tests which exercise the code as fast as possible. This JIRA is to implement some test cases of this style.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira