You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Marshall McMullen (JIRA)" <ji...@apache.org> on 2011/07/13 18:19:00 UTC

[jira] [Created] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Multiop submitted to non-leader always fails due to timeout
-----------------------------------------------------------

                 Key: ZOOKEEPER-1124
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.4.0
         Environment: all
            Reporter: Marshall McMullen
            Priority: Critical
             Fix For: 3.4.0


The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.

It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Marshall McMullen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marshall McMullen updated ZOOKEEPER-1124:
-----------------------------------------

    Attachment:     (was: multi-non-observer.patch)

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Patrick Hunt (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt reassigned ZOOKEEPER-1124:
---------------------------------------

    Assignee: Marshall McMullen
    
> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Assignee: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1124.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Camille Fournier (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064727#comment-13064727 ] 

Camille Fournier commented on ZOOKEEPER-1124:
---------------------------------------------

The QuorumBase and associated tests are probably what you want, Marshall. Those tests can be used to set up a quorum of ZKs, and if you look at tests that extend it tehy will show you how to figure out if you are connected to a leader or follower.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: multi-non-observer.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064830#comment-13064830 ] 

Mahadev konar commented on ZOOKEEPER-1124:
------------------------------------------

Marshall, take a look at QuorumTest to see how to start a real quorom set of servers.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: multi-non-observer.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064683#comment-13064683 ] 

Ted Dunning commented on ZOOKEEPER-1124:
----------------------------------------

Marshall,

This fix is clearly important. Do you have any tests?

The role of these tests is not just to verify this bug, but also to provide a prototype for any later implementors of new operations.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: multi-non-observer.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Marshall McMullen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064710#comment-13064710 ] 

Marshall McMullen commented on ZOOKEEPER-1124:
----------------------------------------------

I agree with Ted's comment regarding the default here is to fail (drop on floor) when new op types are added. I thought of throwing an exception if it wasn't in the switch so it would fail at runtime instead of silently failing as it does now. But perhaps the better answer is to instead of enumerating all op types, just pass them all through to the appropriate Observer/Follower. Then let it deal with throwing an exception if the type isn't supported. That or if there are explicit types not supported by an observer/follower, deal with those explicitly then assume the rest are valid and pass them along.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: multi-non-observer.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Marshall McMullen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064707#comment-13064707 ] 

Marshall McMullen commented on ZOOKEEPER-1124:
----------------------------------------------

I wrote unit tests in our own local test harness to verify the fix. I also tested manually to verify the fix.

I wasn't able to figure out how to write a proper zookeeper unit test in either the C or Java unit tests as I couldn't find a single example where we spawn off multiple zookeeper servers in quorum mode. Is there an example of this somewhere that I've missed? I see lots of examples of starting up one zookeeper server, and various mock instances of servers, but nowhere that I could find where we start multiple real servers.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: multi-non-observer.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065043#comment-13065043 ] 

Hadoop QA commented on ZOOKEEPER-1124:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12486402/ZOOKEEPER-1124.patch
  against trunk revision 1146025.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/394//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/394//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/394//console

This message is automatically generated.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1124.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064689#comment-13064689 ] 

Hadoop QA commented on ZOOKEEPER-1124:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12486329/multi-non-observer.patch
  against trunk revision 1146025.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/390//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/390//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/390//console

This message is automatically generated.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: multi-non-observer.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064684#comment-13064684 ] 

Ted Dunning commented on ZOOKEEPER-1124:
----------------------------------------

All,

Is there a way to restructure the code so that naive implementors don't run into this situation?  Essentially, the code as it stands is default-fail and it would be nice to make it be default-succeed in the presence of new ops.

Or is the addition of new ops rare enough that this doesn't matter?


> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: multi-non-observer.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Marshall McMullen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marshall McMullen updated ZOOKEEPER-1124:
-----------------------------------------

    Attachment: ZOOKEEPER-1124.patch

Replacing earlier patch with a better one that contains a unit test that exhibits the fix works properly. The unit test essentially connects to a follower, then submits a multiop to the follower. It then verifies that the multiop succeeded properly. 

When I run this unit test WITHOUT the required fixes in FollowerRequestProcessor.java and ObserverRequestProcessor.java, then I get a nice failure that correctly replicates the failures I've seen in our integration of multi:

Testcase: testMultiToFollower took 28.451 sec                                                                                                                                                                                                                                 
    Caused an ERROR                                                                                                                                                                                                                                                           
KeeperErrorCode = ConnectionLoss                                                                                                                                                                                                                                              
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss                                                                                                                                                                                
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)                                                                                                                                                                                                   
    at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:886)                                                                                                                                                                                                       
    at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:876)                                                                                                                                                                                                               
    at org.apache.zookeeper.test.QuorumTest.testMultiToFollower(QuorumTest.java:89)                                                                                                                                                                                           
    at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)     

WITH the fixes in the included patch, the unit test passes correctly.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1124.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Marshall McMullen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065716#comment-13065716 ] 

Marshall McMullen commented on ZOOKEEPER-1124:
----------------------------------------------

Thanks for fixing the imports Benjamin, totally appreciate it. I was just sitting down to fix that, so I was happy to see an email from you that you'd already taken care of it :).

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1124.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Marshall McMullen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064805#comment-13064805 ] 

Marshall McMullen commented on ZOOKEEPER-1124:
----------------------------------------------

Thanks for the suggestion Camille. I'll try to add a unit test this evening that verifies behavior.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: multi-non-observer.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065695#comment-13065695 ] 

Benjamin Reed commented on ZOOKEEPER-1124:
------------------------------------------

looks great. this is ready to go. the only thing that needs to be fixed is the import org.apache.zookeeper.*; we never import *, so it needs to be expanded.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1124.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064831#comment-13064831 ] 

Mahadev konar commented on ZOOKEEPER-1124:
------------------------------------------

Marshall, take a look at QuorumTest to see how to start a real quorom set of servers.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: multi-non-observer.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065866#comment-13065866 ] 

Hudson commented on ZOOKEEPER-1124:
-----------------------------------

Integrated in ZooKeeper-trunk #1244 (See [https://builds.apache.org/job/ZooKeeper-trunk/1244/])
    ZOOKEEPER-1124. Multiop submitted to non-leader always fails due to timeout

breed : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146961
Files : 
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/FollowerRequestProcessor.java
* /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/QuorumTest.java
* /zookeeper/trunk/CHANGES.txt
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ObserverRequestProcessor.java


> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1124.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (ZOOKEEPER-1124) Multiop submitted to non-leader always fails due to timeout

Posted by "Marshall McMullen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marshall McMullen updated ZOOKEEPER-1124:
-----------------------------------------

    Attachment: multi-non-observer.patch

Patch to add OpCode.multi case statements into FollowerRequestProcessor.java and ObserverRequestProcessor.java.

> Multiop submitted to non-leader always fails due to timeout
> -----------------------------------------------------------
>
>                 Key: ZOOKEEPER-1124
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1124
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.0
>         Environment: all
>            Reporter: Marshall McMullen
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: multi-non-observer.patch
>
>
> The new Multiop support added under zookeeper-965 fails every single time if the multiop is submitted to a non-leader in quorum mode. In standalone mode it always works properly and this bug only presents itself in quorum mode (with 2 or more nodes). After 12 hours of debugging (*sigh*) it turns out to be a really simple fix. There are a couple of missing case statements inside FollowerRequestProcessor.java and ObserverRequestProcessor.java to ensure that multiop is forwarded to the leader for commit. I've attached a patch that fixes this problem.
> It's probably worth nothing that zookeeper-965 has already been committed to trunk. But this is a fatal flaw that will prevent multiop support from working properly and as such needs to get committed to 3.4.0 as well. Is there a way to tie these two cases together in some way?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira