You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by Anthony Urso <an...@gmail.com> on 2008/08/19 03:49:17 UTC

Fast leader election algorithm throws NPE and hangs

I updated trunk to current to get the diff for ZOOKEEPER-122, and I
stopped being able to run my dev zookeeper cluster in distributed
mode.  In order to get it running again, I had to specify the election
algorithm to be 0.

One of the servers gets this NPE:

Exception in thread "Thread-2" java.lang.NullPointerException
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:518)

The rest just hang while running an election:

zoo.log:
2008-08-18 18:31:26,519 - INFO  [QuorumPeer:QuorumPeer@370] - LOOKING
2008-08-18 18:31:26,537 - WARN  [QuorumPeer:FastLeaderElection@397] -
Election tally: 0

command line:
java -cp /home/anthonyu/lib/zookeeper/trunk/zookeeper-3.0.0.jar:/home/anthonyu/lib/log4j-1.2.15.jar:/home/anthonyu/lib/zookeeper/trunk/conf
org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg &

original zoo.cfg:
tickTime=2000
dataDir=/home/anthonyu/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2182
server.2=zoo2:2182
server.3=zoo3:2182

I don't know if this is a bug or a misconfiguration.

Cheers,
Anthony

RE: Fast leader election algorithm throws NPE and hangs

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
I have also tried with your configuration, just changing the datadir
parameter and the servers, and all seem to work fine. I have also killed
followers and leaders multiple times and in different orders to see if I
observed any issue, and I still can't see anything. 

In your case, have you observed the problem just once, or it happened
consistently every time you tried?

-Flavio

> -----Original Message-----
> From: Anthony Urso [mailto:anthony.urso@gmail.com]
> Sent: Tuesday, August 19, 2008 3:49 AM
> To: zookeeper-dev@hadoop.apache.org
> Subject: Fast leader election algorithm throws NPE and hangs
> 
> I updated trunk to current to get the diff for ZOOKEEPER-122, and I
> stopped being able to run my dev zookeeper cluster in distributed
> mode.  In order to get it running again, I had to specify the election
> algorithm to be 0.
> 
> One of the servers gets this NPE:
> 
> Exception in thread "Thread-2" java.lang.NullPointerException
>         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumC
> nxManager.java:518)
> 
> The rest just hang while running an election:
> 
> zoo.log:
> 2008-08-18 18:31:26,519 - INFO  [QuorumPeer:QuorumPeer@370] - LOOKING
> 2008-08-18 18:31:26,537 - WARN  [QuorumPeer:FastLeaderElection@397] -
> Election tally: 0
> 
> command line:
> java -cp /home/anthonyu/lib/zookeeper/trunk/zookeeper-
> 3.0.0.jar:/home/anthonyu/lib/log4j-
> 1.2.15.jar:/home/anthonyu/lib/zookeeper/trunk/conf
> org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg &
> 
> original zoo.cfg:
> tickTime=2000
> dataDir=/home/anthonyu/zookeeper
> clientPort=2181
> initLimit=5
> syncLimit=2
> server.1=zoo1:2182
> server.2=zoo2:2182
> server.3=zoo3:2182
> 
> I don't know if this is a bug or a misconfiguration.
> 
> Cheers,
> Anthony


Re: Fast leader election algorithm throws NPE and hangs

Posted by Anthony Urso <an...@gmail.com>.
On Tue, Aug 19, 2008 at 3:24 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
> Anthony, Could you tell me how you're starting up the servers? Everything
> works fine in my setting, so I can't reproduce it. I'm starting up one
> server at a time, and my config is very similar to yours:

I run this command on all three servers at the same time with cssh:

java -cp /home/anthonyu/lib/zookeeper/trunk/zookeeper-3.0.0.jar:/home/anthonyu/lib/log4j-1.2.15.jar:/home/anthonyu/lib/zookeeper/trunk/conf
org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg &

>
> clientPort=2181
> quorumPort=1111
> electionPort=1112
> tickTime=2000
> initLimit=5
> syncLimit=5
> dataDir=/tmp/zookeeper
> server.1=xxx1:11111
> server.2=xxx2:11111
> server.3=xxx3:11111


I do not have the quorumPort and electionPort attributes in my config
file, are there defaults?  I will try them next time I need to bring
the cluster down.

>
> In any case, there seems to be a race condition in QuorumCnxManager, which
> I'll investigate.
>
> Thanks,
> -Flavio
>
>
>> -----Original Message-----
>> From: Anthony Urso [mailto:anthony.urso@gmail.com]
>> Sent: Tuesday, August 19, 2008 3:49 AM
>> To: zookeeper-dev@hadoop.apache.org
>> Subject: Fast leader election algorithm throws NPE and hangs
>>
>> I updated trunk to current to get the diff for ZOOKEEPER-122, and I
>> stopped being able to run my dev zookeeper cluster in distributed
>> mode.  In order to get it running again, I had to specify the election
>> algorithm to be 0.
>>
>> One of the servers gets this NPE:
>>
>> Exception in thread "Thread-2" java.lang.NullPointerException
>>         at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumC
>> nxManager.java:518)
>>
>> The rest just hang while running an election:
>>
>> zoo.log:
>> 2008-08-18 18:31:26,519 - INFO  [QuorumPeer:QuorumPeer@370] - LOOKING
>> 2008-08-18 18:31:26,537 - WARN  [QuorumPeer:FastLeaderElection@397] -
>> Election tally: 0
>>
>> command line:
>> java -cp /home/anthonyu/lib/zookeeper/trunk/zookeeper-
>> 3.0.0.jar:/home/anthonyu/lib/log4j-
>> 1.2.15.jar:/home/anthonyu/lib/zookeeper/trunk/conf
>> org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg &
>>
>> original zoo.cfg:
>> tickTime=2000
>> dataDir=/home/anthonyu/zookeeper
>> clientPort=2181
>> initLimit=5
>> syncLimit=2
>> server.1=zoo1:2182
>> server.2=zoo2:2182
>> server.3=zoo3:2182
>>
>> I don't know if this is a bug or a misconfiguration.
>>
>> Cheers,
>> Anthony
>
>

RE: Fast leader election algorithm throws NPE and hangs

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Anthony, Could you tell me how you're starting up the servers? Everything
works fine in my setting, so I can't reproduce it. I'm starting up one
server at a time, and my config is very similar to yours:

clientPort=2181
quorumPort=1111
electionPort=1112
tickTime=2000
initLimit=5
syncLimit=5
dataDir=/tmp/zookeeper
server.1=xxx1:11111
server.2=xxx2:11111
server.3=xxx3:11111

In any case, there seems to be a race condition in QuorumCnxManager, which
I'll investigate.

Thanks,
-Flavio


> -----Original Message-----
> From: Anthony Urso [mailto:anthony.urso@gmail.com]
> Sent: Tuesday, August 19, 2008 3:49 AM
> To: zookeeper-dev@hadoop.apache.org
> Subject: Fast leader election algorithm throws NPE and hangs
> 
> I updated trunk to current to get the diff for ZOOKEEPER-122, and I
> stopped being able to run my dev zookeeper cluster in distributed
> mode.  In order to get it running again, I had to specify the election
> algorithm to be 0.
> 
> One of the servers gets this NPE:
> 
> Exception in thread "Thread-2" java.lang.NullPointerException
>         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumC
> nxManager.java:518)
> 
> The rest just hang while running an election:
> 
> zoo.log:
> 2008-08-18 18:31:26,519 - INFO  [QuorumPeer:QuorumPeer@370] - LOOKING
> 2008-08-18 18:31:26,537 - WARN  [QuorumPeer:FastLeaderElection@397] -
> Election tally: 0
> 
> command line:
> java -cp /home/anthonyu/lib/zookeeper/trunk/zookeeper-
> 3.0.0.jar:/home/anthonyu/lib/log4j-
> 1.2.15.jar:/home/anthonyu/lib/zookeeper/trunk/conf
> org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg &
> 
> original zoo.cfg:
> tickTime=2000
> dataDir=/home/anthonyu/zookeeper
> clientPort=2181
> initLimit=5
> syncLimit=2
> server.1=zoo1:2182
> server.2=zoo2:2182
> server.3=zoo3:2182
> 
> I don't know if this is a bug or a misconfiguration.
> 
> Cheers,
> Anthony