You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Flavio Paiva Junqueira (JIRA)" <ji...@apache.org> on 2010/03/03 11:20:27 UTC

[jira] Created: (ZOOKEEPER-684) Race in LENonTerminateTest

Race in LENonTerminateTest
--------------------------

                 Key: ZOOKEEPER-684
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
             Project: Zookeeper
          Issue Type: Bug
            Reporter: Flavio Paiva Junqueira
             Fix For: 3.3.0
         Attachments: zookeeper-684-test-failure.rtf

testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Flavio Paiva Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842068#action_12842068 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-684:
--------------------------------------------------

The race we observe in the log attached should only happen if the socket times out after 200ms in LeaderElection.java. In regular runs, I wouldn't expect it to time out, but on a very slow machine it could happen. If we start observing it often, we might consider increasing the socket time out value to avoid false positives. 

For now, my recommendation is that we don't fix it.

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>            Reporter: Flavio Paiva Junqueira
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843299#action_12843299 ] 

Henry Robinson commented on ZOOKEEPER-684:
------------------------------------------

I have a number of options here, would appreciate thoughts on the right one to go with:

1. Write a tricky proxy that intercepts messages for Thread 0 and replies with the 'right' message for this interleaving. 
2. Subclass LeaderElection.java with a class that has an almost identical copy of lookForLeader, but puts the latch code just before the setCurrentVote call so that all threads have done one round simultaneously.
3. Deactivate the test for 3.3.0

I don't like 3, and 1 is too tricky. I'm going to go with 2 for now, although it is ugly, because it is correct and hopefully relatively quick to do. 

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843292#action_12843292 ] 

Patrick Hunt commented on ZOOKEEPER-684:
----------------------------------------

I'm seeing this as well - coreduo with karmic 64bit

java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)


> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed updated ZOOKEEPER-684:
------------------------------------

    Hadoop Flags: [Reviewed]

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843398#action_12843398 ] 

Hudson commented on ZOOKEEPER-684:
----------------------------------

Integrated in ZooKeeper-trunk #726 (See [http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/726/])
    . Race in LENonTerminateTest


> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843387#action_12843387 ] 

Benjamin Reed commented on ZOOKEEPER-684:
-----------------------------------------

+1 ok it works on my machine now and the code looks fine except for the copy of lookForLeader in the testcase. i support the commit if we open a related jira to fix it in the next release.

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henry Robinson updated ZOOKEEPER-684:
-------------------------------------

    Attachment: ZOOKEEPER-684.patch

Rebased patch off trunk (fix for new Factory constructor)

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843232#action_12843232 ] 

Patrick Hunt commented on ZOOKEEPER-684:
----------------------------------------

Ben is saying the test fails every third time he runs the test.

Flavio is saying we need to either rewrite the test  to elim race condition or address via his previous comment (incr timeout)


> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843316#action_12843316 ] 

Patrick Hunt commented on ZOOKEEPER-684:
----------------------------------------

You might find more detail here, hudson trunk just failed with this error: 
http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/724/

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Flavio Paiva Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843300#action_12843300 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-684:
--------------------------------------------------

I agree with your observation: "Thread 1 has received 2 votes for server 2 as the leader. It then exits, and this is the problem, I think. As a result, Thread 0 can never get a quorum." And, my interpretation is that it happens because server 1 is timing out before receiving the vote of server 2 in round 1. Server 1 then receives in the second round the vote of mock server 2 and the vote of server 0 (also supporting 2), which cause server 1 to leave prematurely.

I also don't think your patch works because "peer.getElectionAlg().lookForLeader()" won't return until the election is over. That method is not called for each round.

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843303#action_12843303 ] 

Henry Robinson commented on ZOOKEEPER-684:
------------------------------------------

I'm not totally clear that thread 1 is timing out. In the trace attached to this JIRA, the first timeout is Thread-0 which is timing out when talking either to thread 1 or thread 2. In particular, Thread-1 seems to be getting three votes in its first round. Can you help me find the UDP timeout you're talking about?

(you're right, the patch doesn't work for a number of reasons :))

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt updated ZOOKEEPER-684:
-----------------------------------

      Component/s: server
                   leaderElection
         Priority: Critical  (was: Major)
    Fix Version/s:     (was: 3.4.0)
                   3.3.0
         Assignee: Henry Robinson

Sorry, Flavio is saying this is the issue, not ZOOKEEPER-675

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843376#action_12843376 ] 

Patrick Hunt commented on ZOOKEEPER-684:
----------------------------------------

this addressed the problem on my system.

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henry Robinson updated ZOOKEEPER-684:
-------------------------------------

    Attachment: ZOOKEEPER-684.patch

This patch includes a fairly hacky mocking out of LeaderElection to fix the observed race, and another improvement to fix another race that cropped up (mockServer not starting in time, leading to a potential timeout like Flavio suggested).

I could reproduce a failure before this patch was finished, and can't now, but please test yourselves. 

In order to mock QuorumPeer and LeaderElection I made a number of member variables protected rather than private, and put 'running' behind a getter / setter interface. 

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843295#action_12843295 ] 

Henry Robinson commented on ZOOKEEPER-684:
------------------------------------------

Flavio - 

Looking at this excerpt from round one: 

2010-03-03 04:56:54,630 - INFO  [Thread-0:LeaderElection@155] - Server address: 0.0.0.0/0.0.0.0:11223
2010-03-03 04:56:54,631 - INFO  [Thread-0:LeaderElection@155] - Server address: 0.0.0.0/0.0.0.0:11225
2010-03-03 04:56:54,630 - INFO  [Thread-1:LeaderElection@155] - Server address: 0.0.0.0/0.0.0.0:11221
2010-03-03 04:56:54,632 - INFO  [Thread-0:LeaderElection@103] - Election tally: 
2010-03-03 04:56:54,632 - INFO  [Thread-0:LeaderElection@109] - 0	-> 1
2010-03-03 04:56:54,632 - INFO  [Thread-0:LeaderElection@109] - 2	-> 1
2010-03-03 04:56:54,632 - INFO  [Thread-0:LeaderElection@109] - 1	-> 1
2010-03-03 04:56:54,633 - INFO  [Thread-1:LeaderElection@155] - Server address: 0.0.0.0/0.0.0.0:11223
2010-03-03 04:56:54,633 - INFO  [Thread-1:LeaderElection@155] - Server address: 0.0.0.0/0.0.0.0:11225
2010-03-03 04:56:54,633 - INFO  [Thread-1:LeaderElection@103] - Election tally: 
2010-03-03 04:56:54,634 - INFO  [Thread-1:LeaderElection@109] - 2	-> 2
2010-03-03 04:56:54,634 - INFO  [Thread-1:LeaderElection@109] - 1	-> 1
2010-03-03 04:56:54,634 - INFO  [Thread-1:LeaderElection@218] - Found leader: my type is: PARTICIPANT

Thread 1 has received 2 votes for server 2 as the leader. It then exits, and this is the problem, I think. As a result, Thread 0 can never get a quorum. 

Therefore my thought was to put a barrier after all servers have done one round, but not updated their results. Unfortunately, it looks like line 212 in LeaderElection.java upsets this plan as the thread's current vote is set here. 

So Thread 0 gets one vote for every server. It sets its vote to 2 and then replies to Thread 1 which therefore sees two votes for server 2. What we want to happen is Thread 0 gets one vote for every server, but replies to Thread 1 before it does so. I'm unclear on an easy way to mock up this interleaving - the LeaderElection code is hard to stub out. 

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843286#action_12843286 ] 

Benjamin Reed commented on ZOOKEEPER-684:
-----------------------------------------

btw, i'm getting the failure on a coreduo laptop running ubuntu.

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Flavio Paiva Junqueira (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843279#action_12843279 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-684:
--------------------------------------------------

Hi Henry, if I understand correctly your patch, in the case we have an execution like the one of the log attached in this issue, then the test would fail because the latch would never count down to zero. Is this correct? If so, I don't understand how it improves the test.

My understanding of the race is that in the first round, server 0 receives a vote from mock server 2, but server 1 does not receive a vote from 2 (udp socket times out while waiting to receive). In the second round, server 1 receives vote from 0 and from 2, both voting for 2, and consequently server 1 elects 2. I think this is what you observe too in your last comment. If the receive call is timing out too soon, don't we have to increase the time out value? I understand that this is not desirable because it increases election time, but if it the current value is not sufficient, then I don't see a better option.



> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henry Robinson updated ZOOKEEPER-684:
-------------------------------------

    Attachment: ZOOKEEPER-684.patch

Here's a patch - Ben, can you try it and see if this does anything for the failures you're seeing? 

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt updated ZOOKEEPER-684:
-----------------------------------

    Fix Version/s:     (was: 3.3.0)
                   3.4.0

Pushing to 3.4.0 per Flavio's comment.

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>            Reporter: Flavio Paiva Junqueira
>             Fix For: 3.4.0
>
>         Attachments: zookeeper-684-test-failure.rtf
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Reed resolved ZOOKEEPER-684.
-------------------------------------

    Resolution: Fixed

Committed revision 921201.

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Flavio Paiva Junqueira (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flavio Paiva Junqueira updated ZOOKEEPER-684:
---------------------------------------------

    Attachment: zookeeper-684-test-failure.rtf

Uploading the output of the test from the Hudson run of ZOOKEEPER-59. 

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>            Reporter: Flavio Paiva Junqueira
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843285#action_12843285 ] 

Benjamin Reed commented on ZOOKEEPER-684:
-----------------------------------------

it failed on my machine with the patch. it looks identical:

2010-03-09 12:55:28,703 - INFO  [Thread-1:LeaderElection@109] - 0       -> 1
2010-03-09 12:55:28,704 - INFO  [Thread-1:LeaderElection@109] - 2       -> 1
2010-03-09 12:55:28,704 - INFO  [Thread-1:LeaderElection@109] - 1       -> 1
2010-03-09 12:55:28,704 - INFO  [Thread-2:LeaderElection@155] - Server address: /127.0.0.1:11221
2010-03-09 12:55:28,705 - INFO  [Thread-2:LeaderElection@155] - Server address: /127.0.0.1:11223
2010-03-09 12:55:28,706 - INFO  [Thread-2:LeaderElection@155] - Server address: /127.0.0.1:11225
2010-03-09 12:55:28,706 - INFO  [Thread-2:LeaderElection@103] - Election tally:
2010-03-09 12:55:28,707 - INFO  [Thread-2:LeaderElection@109] - 2       -> 2
2010-03-09 12:55:28,707 - INFO  [Thread-2:LeaderElection@109] - 1       -> 1
2010-03-09 12:55:28,708 - INFO  [Thread-2:LeaderElection@218] - Found leader: my type is: PARTICIPANT
2010-03-09 12:55:29,705 - INFO  [Thread-1:LeaderElection@155] - Server address: /127.0.0.1:11221
2010-03-09 12:55:29,706 - INFO  [Thread-1:LeaderElection@155] - Server address: /127.0.0.1:11223
2010-03-09 12:55:29,710 - INFO  [Thread-1:LeaderElection@155] - Server address: /127.0.0.1:11225
2010-03-09 12:55:29,911 - WARN  [Thread-1:LeaderElection@195] - Ignoring exception while looking for leader
java.net.SocketTimeoutException: Receive timed out
        at java.net.PlainDatagramSocketImpl.receive0(Native Method)
        at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
        at java.net.DatagramSocket.receive(DatagramSocket.java:712)
        at org.apache.zookeeper.server.quorum.LeaderElection.lookForLeader(LeaderElection.java:171)
        at org.apache.zookeeper.test.LENonTerminateTest$LEThread.run(LENonTerminateTest.java:87)
2010-03-09 12:55:29,912 - INFO  [Thread-1:LeaderElection@103] - Election tally:
2010-03-09 12:55:30,913 - INFO  [Thread-1:LeaderElection@155] - Server address: /127.0.0.1:11221
2010-03-09 12:55:30,914 - INFO  [Thread-1:LeaderElection@155] - Server address: /127.0.0.1:11223
2010-03-09 12:55:30,915 - INFO  [Thread-1:LeaderElection@155] - Server address: /127.0.0.1:11225
2010-03-09 12:55:31,116 - WARN  [Thread-1:LeaderElection@195] - Ignoring exception while looking for leader
java.net.SocketTimeoutException: Receive timed out
        at java.net.PlainDatagramSocketImpl.receive0(Native Method)
        at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
        at java.net.DatagramSocket.receive(DatagramSocket.java:712)
        at org.apache.zookeeper.server.quorum.LeaderElection.lookForLeader(LeaderElection.java:171)
        at org.apache.zookeeper.test.LENonTerminateTest$LEThread.run(LENonTerminateTest.java:87)
2010-03-09 12:55:31,117 - INFO  [Thread-1:LeaderElection@103] - Election tally:
2010-03-09 12:55:31,118 - INFO  [Thread-1:LeaderElection@109] - 0       -> 1
2010-03-09 12:55:32,119 - INFO  [Thread-1:LeaderElection@155] - Server address: /127.0.0.1:11221

the last part keeps repeating.

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf, ZOOKEEPER-684.patch
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (ZOOKEEPER-684) Race in LENonTerminateTest

Posted by "Henry Robinson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/ZOOKEEPER-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843241#action_12843241 ] 

Henry Robinson commented on ZOOKEEPER-684:
------------------------------------------

According to the logs, it looks like the problem is that thread 1 is getting two votes for server 2. This happens because thread 0 votes for 2 rather than itself as it has already received a full complement of votes.

The race is therefore that thread 0 collects all its votes and updates its own vote before thread 1 is responded to. 

To fix, I propose adding a CountdownLatch between lookForLeader and setCurrentVote to ensure that both thread 0 and thread 1 manage one round of voting before updating their choices. I can't recreate the error here, so I'll offer a patch for Ben to hopefully try.  

> Race in LENonTerminateTest
> --------------------------
>
>                 Key: ZOOKEEPER-684
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-684
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>            Reporter: Flavio Paiva Junqueira
>            Assignee: Henry Robinson
>            Priority: Critical
>             Fix For: 3.3.0
>
>         Attachments: zookeeper-684-test-failure.rtf
>
>
> testNonTermination failed during a Hudson run for ZOOKEEPER-59. After inspecting the output, it looks like server is electing 2 as a leader and leaving. Given that 2 is just a mock server, server 0 remains alone in leader election.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.