You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Todd Greenwood <to...@audiencescience.com> on 2009/07/31 02:08:59 UTC

test failures in branch-3.2

The build succeeds, but not the all of the tests. In previous test runs,
I noticed an error in org.apache.zookeeper.test.FLETest. It was not able
to bind to a port or something. Now, after a machine reboot, I'm getting
different failures. 

branch-3.2 $ ant test

[junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
FAILED (crashed)
[junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED

Test logs for these two tests attached.

My goal here is to get to a known state (all tests succeeding or have
workarounds for the failures). Following that, I plan to apply the
patches Flavio recommended for a WAN deploy (479 and 481). After I
verify that the tests continue to run, I'll package this up and deploy
it to our WAN for testing. 

So, are these known issues? Do the tests normally run en masse, or do
some of the tests hold on to resources and prevent other tests from
passing?

-Todd

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
well try running these two tests individually and see if they always 
fail or just occassionally. that will be a good start (and the env detail).

Patrick

Todd Greenwood wrote:
> No edits to conf/log4j.properties.
> 
> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org] 
> Sent: Thursday, July 30, 2009 9:25 PM
> To: Patrick Hunt
> Cc: zookeeper-user@hadoop.apache.org
> Subject: Re: test failures in branch-3.2
> 
> btw QuorumPeerMainTest uses the CONSOLE appender which is setup in 
> conf/log4j.properties, now that I think of it perhaps not such a good 
> idea :-)
> 
> If you edited cong/log4j.properties it may be causing the test to fail, 
> did you do this? (if you run the test by itself using -Dtestcase does it
> 
> always fail?)
> 
> I've entered a jira to address this:
> https://issues.apache.org/jira/browse/ZOOKEEPER-492
> 
> Patrick
> 
> Patrick Hunt wrote:
>> Todd Greenwood wrote:
>>> The build succeeds, but not the all of the tests. In previous test
> runs,
>>> I noticed an error in org.apache.zookeeper.test.FLETest. It was not
> able
>>> to bind to a port or something. Now, after a machine reboot, I'm
> getting
>>> different failures. 
>> "address in use"? That's a problem in the test framework pre-3.3. In
> 3.3 
>> (current svn trunk) I fixed it but it's not in 3.2.x. This is a
> problem 
>> with the test framework though and not a real problem, it shows up 
>> occasionally (depends on timing).
>>
>>> branch-3.2 $ ant test
>>>
>>> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>> FAILED (crashed)
>>> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
>>>
>>> Test logs for these two tests attached.
>> This is unusual though - looking at the log it seems that the JVM
> itself 
>> crashed for the QPMainTest! for HQT we are seeing:
>>
>> junit.framework.AssertionFailedError: Threads didn't join
>>
>> which Flavio mentioned to me once is possible to happen but not a real
> 
>> problem (he can elaborate).
>>
>> What version of java are you using? OS, other environment that might
> be 
>> interesting? (vm? etc...) You might try looking at the jvm crash dump 
>> file (I think it's in /tmp)
>>
>> If you run each of these two tests individually do they run? example:
>> ant -Dtestcase=FLENewEpochTest test-core-java
>>
>>> My goal here is to get to a known state (all tests succeeding or have
>>> workarounds for the failures). Following that, I plan to apply the
>>> patches Flavio recommended for a WAN deploy (479 and 481). After I
>>> verify that the tests continue to run, I'll package this up and
> deploy
>>> it to our WAN for testing. 
>> Sounds like a good plan.
>>
>>> So, are these known issues? Do the tests normally run en masse, or do
>>> some of the tests hold on to resources and prevent other tests from
>>> passing?
>> Typically they do run to completion, but occasionally on my machine 
>> (java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
>> random failure due to address in use, or the same "didn't join" that
> you 
>> saw. Usually I see this if I'm multitasking (vs just letting the tests
> 
>> run w/o using the box). As I said this is addressed in 3.3 (address 
>> reuse at the very least, and I haven't see the other issues).
>>
>> Patrick
>>
>>

RE: test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
No edits to conf/log4j.properties.

-----Original Message-----
From: Patrick Hunt [mailto:phunt@apache.org] 
Sent: Thursday, July 30, 2009 9:25 PM
To: Patrick Hunt
Cc: zookeeper-user@hadoop.apache.org
Subject: Re: test failures in branch-3.2

btw QuorumPeerMainTest uses the CONSOLE appender which is setup in 
conf/log4j.properties, now that I think of it perhaps not such a good 
idea :-)

If you edited cong/log4j.properties it may be causing the test to fail, 
did you do this? (if you run the test by itself using -Dtestcase does it

always fail?)

I've entered a jira to address this:
https://issues.apache.org/jira/browse/ZOOKEEPER-492

Patrick

Patrick Hunt wrote:
> Todd Greenwood wrote:
>> The build succeeds, but not the all of the tests. In previous test
runs,
>> I noticed an error in org.apache.zookeeper.test.FLETest. It was not
able
>> to bind to a port or something. Now, after a machine reboot, I'm
getting
>> different failures. 
> 
> "address in use"? That's a problem in the test framework pre-3.3. In
3.3 
> (current svn trunk) I fixed it but it's not in 3.2.x. This is a
problem 
> with the test framework though and not a real problem, it shows up 
> occasionally (depends on timing).
> 
>> branch-3.2 $ ant test
>>
>> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>> FAILED (crashed)
>> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
>>
>> Test logs for these two tests attached.
> 
> This is unusual though - looking at the log it seems that the JVM
itself 
> crashed for the QPMainTest! for HQT we are seeing:
> 
> junit.framework.AssertionFailedError: Threads didn't join
> 
> which Flavio mentioned to me once is possible to happen but not a real

> problem (he can elaborate).
> 
> What version of java are you using? OS, other environment that might
be 
> interesting? (vm? etc...) You might try looking at the jvm crash dump 
> file (I think it's in /tmp)
> 
> If you run each of these two tests individually do they run? example:
> ant -Dtestcase=FLENewEpochTest test-core-java
> 
>> My goal here is to get to a known state (all tests succeeding or have
>> workarounds for the failures). Following that, I plan to apply the
>> patches Flavio recommended for a WAN deploy (479 and 481). After I
>> verify that the tests continue to run, I'll package this up and
deploy
>> it to our WAN for testing. 
> 
> Sounds like a good plan.
> 
>> So, are these known issues? Do the tests normally run en masse, or do
>> some of the tests hold on to resources and prevent other tests from
>> passing?
> 
> Typically they do run to completion, but occasionally on my machine 
> (java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
> random failure due to address in use, or the same "didn't join" that
you 
> saw. Usually I see this if I'm multitasking (vs just letting the tests

> run w/o using the box). As I said this is addressed in 3.3 (address 
> reuse at the very least, and I haven't see the other issues).
> 
> Patrick
> 
> 

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
btw QuorumPeerMainTest uses the CONSOLE appender which is setup in 
conf/log4j.properties, now that I think of it perhaps not such a good 
idea :-)

If you edited cong/log4j.properties it may be causing the test to fail, 
did you do this? (if you run the test by itself using -Dtestcase does it 
always fail?)

I've entered a jira to address this:
https://issues.apache.org/jira/browse/ZOOKEEPER-492

Patrick

Patrick Hunt wrote:
> Todd Greenwood wrote:
>> The build succeeds, but not the all of the tests. In previous test runs,
>> I noticed an error in org.apache.zookeeper.test.FLETest. It was not able
>> to bind to a port or something. Now, after a machine reboot, I'm getting
>> different failures. 
> 
> "address in use"? That's a problem in the test framework pre-3.3. In 3.3 
> (current svn trunk) I fixed it but it's not in 3.2.x. This is a problem 
> with the test framework though and not a real problem, it shows up 
> occasionally (depends on timing).
> 
>> branch-3.2 $ ant test
>>
>> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>> FAILED (crashed)
>> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
>>
>> Test logs for these two tests attached.
> 
> This is unusual though - looking at the log it seems that the JVM itself 
> crashed for the QPMainTest! for HQT we are seeing:
> 
> junit.framework.AssertionFailedError: Threads didn't join
> 
> which Flavio mentioned to me once is possible to happen but not a real 
> problem (he can elaborate).
> 
> What version of java are you using? OS, other environment that might be 
> interesting? (vm? etc...) You might try looking at the jvm crash dump 
> file (I think it's in /tmp)
> 
> If you run each of these two tests individually do they run? example:
> ant -Dtestcase=FLENewEpochTest test-core-java
> 
>> My goal here is to get to a known state (all tests succeeding or have
>> workarounds for the failures). Following that, I plan to apply the
>> patches Flavio recommended for a WAN deploy (479 and 481). After I
>> verify that the tests continue to run, I'll package this up and deploy
>> it to our WAN for testing. 
> 
> Sounds like a good plan.
> 
>> So, are these known issues? Do the tests normally run en masse, or do
>> some of the tests hold on to resources and prevent other tests from
>> passing?
> 
> Typically they do run to completion, but occasionally on my machine 
> (java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
> random failure due to address in use, or the same "didn't join" that you 
> saw. Usually I see this if I'm multitasking (vs just letting the tests 
> run w/o using the box). As I said this is addressed in 3.3 (address 
> reuse at the very least, and I haven't see the other issues).
> 
> Patrick
> 
> 

RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
IT says yes, there are firewalls, but that yes, there is full
connectivity between each of the zk servers.

> -----Original Message-----
> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> Sent: Tuesday, August 04, 2009 6:01 PM
> To: zookeeper-dev@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Hi todd,
>   I see a lot of
> 
> java.net.ConnectException: Connection refused
>         at sun.nio.ch.Net.connect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
>         at
java.nio.channels.SocketChannel.open(SocketChannel.java:146)
>         at
>
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnx
Ma
> na
> ger.java:324)
>         at
>
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxMana
ge
> r.
> java:304)
>         at
>
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSe
nd
> er
> .process(FastLeaderElection.java:317)
>         at
>
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSe
nd
> er
> .run(FastLeaderElection.java:290)
>         at java.lang.Thread.run(Thread.java:619)
> 
> 
> Is it possible that there is some firewall? Can all the servers 1-9
> connect
> to all the others using ports that you specified in zoo.cfg i.e
2888/3888?
> 
> 
> Thanks
> mahadev
> 
> 
> On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
> 
> > Looks like we're not getting *any* leader elected now.... Logs
attached.
> >
> >> -----Original Message-----
> >> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >> Sent: Tuesday, August 04, 2009 4:07 PM
> >> To: zookeeper-dev@hadoop.apache.org
> >> Subject: RE: Unending Leader Elections in WAN deploy
> >>
> >> Patrick, thanks! I'll forward on to IT and I'll report back to you
> >> shortly...
> >>
> >>> -----Original Message-----
> >>> From: Patrick Hunt [mailto:phunt@apache.org]
> >>> Sent: Tuesday, August 04, 2009 3:55 PM
> >>> To: zookeeper-dev@hadoop.apache.org
> >>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>
> >>> Todd, Mahadev and I looked at this and it turns out to be a
> >> regression.
> >>> Ironically a patch I created for 3.2 branch to add quorum tests
> >> actually
> >>> broke the quorum config -- a default value for a config parameter
> > was
> >>> lost. I'm going to submit a patch asap to get the default back,
but
> >> for
> >>> the time being you can set:
> >>>
> >>> electionAlg=3
> >>>
> >>> in each of your config files.
> >>>
> >>> You should see reference to FastLeaderElection in your log files
if
> >> this
> >>> parameter is set correctly.
> >>>
> >>> Sorry for the trouble,
> >>>
> >>> Patrick
> >>>
> >>> Todd Greenwood wrote:
> >>>> Mahadev,
> >>>>
> >>>> I just heard from IT that this build behaves in exactly the same
> > way
> >> as
> >>>> previous versions, e.g. we get continuous leader elections that
> >>>> disconnect the followers and then get re-elected, and
> >> disconnect...etc.
> >>>>
> >>>> This is from a fresh sync to the 3.2 branch:
> >>>>
> >>>> svn co
> >>>>
> > http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> >>>> ./branch-3.2
> >>>>
> >>>> CHANGES.TXT show the various fixes included:
> >>>>
> >>>>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
> >>>> Release 3.2.1
> >>>>
> >>>> Backward compatibile changes:
> >>>>
> >>>> BUGFIXES:
> >>>>   ZOOKEEPER-468. avoid compile warning in send_auth_info().
(chris
> >> via
> >>>> flavio)
> >>>>
> >>>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten
(chris
> >> via
> >>>> mahadev)
> >>>>
> >>>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
> > mahadev)
> >>>>
> >>>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
> > via
> >>>> mahadev)
> >>>>
> >>>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
> >>>>   (giri via mahadev)
> >>>>
> >>>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
> >> mahadev)
> >>>>
> >>>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
> >> immediate
> >>>>   failure. (chris via mahadev)
> >>>>
> >>>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
> >> via
> >>>> phunt)
> >>>>
> >>>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase
(and
> >>>> other)
> >>>>   embedded clients (ryan rawson via phunt)
> >>>>
> >>>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
> >> via
> >>>> mahadev)
> >>>>
> >>>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
> > correctly
> >>>>   (flavio via mahadev)
> >>>>
> >>>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
> >> empty
> >>>> cert
> >>>>   (Chris Darroch via phunt)
> >>>>
> >>>>   ZOOKEEPER-480. FLE should perform leader check when node is not
> >>>> leading and
> >>>>   add vote of follower (flavio via mahadev)
> >>>>
> >>>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
> >> (flavio
> >>>> via
> >>>>   mahadev)
> >>>>
> >>>> What can I do to assist you with this issue?
> >>>>
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>> Sent: Tuesday, August 04, 2009 12:43 PM
> >>>>> To: zookeeper-dev@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> Hi todd,
> >>>>>  comments in line
> >>>>>
> >>>>>
> >>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> >>>> wrote:
> >>>>>> Mahadev,
> >>>>>>
> >>>>>> Some quick questions:
> >>>>>>
> >>>>>> 1. Version
> >>>>>>
> >>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
> > is
> >>>> still
> >>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
> >>>> calling
> >>>>>> this release 3.2.1?
> >>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
> > we
> >>>> tag
> >>>>> the
> >>>>> release.
> >>>>>
> >>>>>> 2. Build targets
> >>>>>>
> >>>>>> The package target fails b/c the create-cppunit-configure
target
> >>>> fails
> >>>>>> due to various problems w/ respect to autoconf. Are these
> >>>> dependencies
> >>>>>> documented somewhere ? I'd like to have a fully building
system.
> >>>>>>
> >>>>>> create-cppunit-configure:
> >>>>>>      [exec] Can't exec "libtoolize": No such file or directory
> > at
> >>>>>> /usr/bin/autoreconf line 188.
> >>>>>>      [exec] Use of uninitialized value $libtoolize in pattern
> >> match
> >>>>>> (m//) at /usr/bin/autoreconf line 188.
> >>>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
> > not
> >>>> found
> >>>>>> in library
> >>>>>>      [exec] configure.ac:33: error: possibly undefined macro:
> >>>>>> AM_PATH_CPPUNIT
> >>>>>>      [exec]       If this token and others are legitimate,
> > please
> >>>> use
> >>>>>> m4_pattern_allow.
> >>>>>>      [exec]       See the Autoconf documentation.
> >>>>>>      [exec] configure.ac:53: error: possibly undefined macro:
> >>>>>> AC_PROG_LIBTOOL
> >>>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
> > status:
> >> 1
> >>>>>>
> >>>>> You need auto tools to run this. Please read the README for
> >> building c
> >>>>> client library at src/c/ for the installation requirements.
> >>>>>> 3. Sync failure:
> >>>>>>
> >>>>>> This is still failing.
> >>>>>>
> >>>>>> svn: URL
> >>>>>>
> > 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> >>>>>> doesn't exist
> >>>>>>
> >>>>> Yes this hasn't been fixed yet!
> >>>>>
> >>>>> Thanks
> >>>>> mahadev
> >>>>>> -Todd
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Todd Greenwood
> >>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
> >>>>>>> To: 'zookeeper-user@hadoop.apache.org'
> >>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>
> >>>>>>> Great news. Thank you Mahadev. I'll report our findings later
> >>>> today.
> >>>>>>> -Todd
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
> >>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>
> >>>>>>>> Hi Todd,
> >>>>>>>>  I just committed 480 and 491. You can checkout the 3.2
branch
> >>>> now.
> >>>>>>>> Thanks
> >>>>>>>> mahadev
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
> > <to...@audiencescience.com>
> >>>>>> wrote:
> >>>>>>>>> That'd be perfect. Thanks!
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
> >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>
> >>>>>>>>>> Hi Todd,
> >>>>>>>>>>   Most of the patches that you mention should be in the
> > branch
> >>>>>> 3.2 by
> >>>>>>>>> tomm
> >>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> >>>> tomm.
> >>>>>>>>> Would
> >>>>>>>>>> that
> >>>>>>>>>> suffice for you?
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> mahadev
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
> >> <to...@audiencescience.com>
> >>>>>>> wrote:
> >>>>>>>>>>> Another problem...I've reverted to the latest versions of
> > the
> >>>>>>>>> patches
> >>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
> >>>>>> compilation
> >>>>>>>>>>> errors:
> >>>>>>>>>>>
> >>>>>>>>>>> build-generated:
> >>>>>>>>>>>     [javac] Compiling 44 source files to
> >>>>>>>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>
> >>>>>>>>>>> compile-main:
> >>>>>>>>>>>     [javac] Compiling 2 source files to
> >>>>>>>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>     [javac]
> >>>>>>>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>> atched/branch-
> >>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
> >>>>>> getQuorumPeers()
> >>>>>>>>> have
> >>>>>>>>>>> the same erasure
> >>>>>>>>>>>     [javac]         public String[] getQuorumPeers();
> >>>>>>>>>>>     [javac]                         ^
> >>>>>>>>>>>     [javac]
> >>>>>>>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>> atched/branch-
> >>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>>>>> mStats.java:31: name clash: getServerState() and
> >>>>>> getServerState()
> >>>>>>>>> have
> >>>>>>>>>>> the same erasure
> >>>>>>>>>>>     [javac]         public String getServerState();
> >>>>>>>>>>>     [javac]                       ^
> >>>>>>>>>>>     [javac] 2 errors
> >>>>>>>>>>>
> >>>>>>>>>>> My build process is pretty simple:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
> >>>>>>>>>>> (src/patched/branch-3.2)
> >>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
> >>>>>>>>>>> 3. build zookeeper in the temp directory
> >>>>>>>>>>>
> >>>>>>>>>>> -Todd
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> Flavio,
> >>>>>>>>>>>> I notice that you've updated the patches referenced for
> > the
> >>>> WAN
> >>>>>>>>>>>> deployment. There appears to be an order dependency w/
> >> respect
> >>>>>> to
> >>>>>>>>>>> these
> >>>>>>>>>>>> four patches...
> >>>>>>>>>>>>
> >>>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>>>>>>>>>>>
> >>>>>>>>>>>> 473 -> 479 (479 fails)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
> >>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>>>>> patching file
> >>>>>>>>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >>>>>>>>>>>> ical.java
> >>>>>>>>>>>> patching file
> >>>>>>>>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >>>>>>>>>>>> patching file
> >>>>>>>>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >>>>>>>>>>>> .java
> >>>>>>>>>>>> patching file
> >>>>>>>>>>>>
> >>>>>>
> >> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >>>>>>>>>>>> Hunk #1 FAILED at 93.
> >>>>>>>>>>>> Hunk #2 FAILED at 145.
> >>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
> >>>>>>>>>>>>
> >>>>
> >>
> >
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >>>>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
> >>>>>>>>>>>>
> >>>>>>>>>>>> Could you advise as to which patches I need to apply, and
> > in
> >>>>>> what
> >>>>>>>>>>> order?
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> Perfect! Thanks for the update, Todd.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
> >>>>>> Compilation
> >>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
> > latest
> >>>>>> patches
> >>>>>>>>>>>> 473,
> >>>>>>>>>>>> 479, 481, and 491.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
> > of
> >>>> the
> >>>>>>>>>>> patch.
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Flavio,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm getting a compilation error for patch 491:
> >>>>>>>>>>>>
> >>>>>>>>>>>> compile-main:
> >>>>>>>>>>>>   [javac] Compiling 1 source file to
> >>>>>>>>>>>>
> >>>>>>
> >> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>> src/p
> >>>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>>   [javac]
> >>>>>>>>>>>>
> >>>>>>
> >> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>> src/p
> >>>>>>>>>>>>
> >>>>>>
> >> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>> FastL
> >>>>>>>>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>>>>>>>   [javac] location: interface
> >>>>>>>>>>>>
> >> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>>>>>>>   [javac]
> >>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>>   [javac]
> >>>>>> ^
> >>>>>>>>>>>>   [javac] 1 error
> >>>>>>>>>>>>
> >>>>>>>>>>>> I see a reference to getWeight in both
> >>>>>> FastLeaderElection.java
> >>>>>>>>>>> in
> >>>>>>>>>>>> patch
> >>>>>>>>>>>> 491:
> >>>>>>>>>>>>
> >>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>> FastLeaderElection.java
> >>>>>>>>>>>> :
> >>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>>>>>>>> 0)
> >>>>>>>>>>>>
> >>>>>>>>>>>> However, I don't see a reference to this method in
> >> patches
> >>>>>> 473,
> >>>>>>>>>>>> 479,
> >>>>>>>>>>>> or
> >>>>>>>>>>>> 481. I also don't see a reference to this method in
> > the
> >>>>>>>>> trunk...
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Todd Greenwood
> > [mailto:toddg@audiencescience.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> This repro's in both branch-3.2, and
> >>>>>> branch-3.2+patches(473,
> >>>>>>>>>>>> 479,
> >>>>>>>>>>>> 481).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Basically, it seems like the nodes are electing
> >>>>>> pd4-zook02
> >>>>>>>>> to
> >>>>>>>>>>>> be
> >>>>>>>>>>>> the
> >>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> >>>>>>>>>>> supposed
> >>>>>>>>>>>> to
> >>>>>>>>>>>> be
> >>>>>>>>>>>> and
> >>>>>>>>>>>> then disconnects everyone. Then they re-elect it
> > again,
> >>>>>> and
> >>>>>>>>>>> it
> >>>>>>>>>>>> loops
> >>>>>>>>>>>> over and over.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -------------
> >>>>>>>>>>>> Server config
> >>>>>>>>>>>> -------------
> >>>>>>>>>>>>
> >>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>
> >>>>>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>>>>> weight.1=1
> >>>>>>>>>>>> weight.2=1
> >>>>>>>>>>>> weight.3=1
> >>>>>>>>>>>> weight.4=1
> >>>>>>>>>>>> weight.5=1
> >>>>>>>>>>>>
> >>>>>>>>>>>> group.2:6:7:8:9
> >>>>>>>>>>>> weight.6=0
> >>>>>>>>>>>> weight.7=0
> >>>>>>>>>>>> weight.8=0
> >>>>>>>>>>>> weight.9=0
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> >>>>>>>>>>> different
> >>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> >>>>>>>>> machines
> >>>>>>>>>>>> in
> >>>>>>>>>>>> dc1
> >>>>>>>>>>>> have voting rights, and the ability to become a
> > leader.
> >>>>>> The
> >>>>>>>>>>>> machines
> >>>>>>>>>>>> in
> >>>>>>>>>>>> the pods all have a weight of zero, and are not
> >> expected
> >>>>>> to
> >>>>>>>>>>>> become
> >>>>>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Todd
> >>>>


Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Todd,
 Can you attach the files to the jira? I will takea look at this and will
get back to you by end of day today.

Thanks
mahadev


On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> Looks like we're not getting *any* leader elected now.... Logs attached.
> 
>> -----Original Message-----
>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>> Sent: Tuesday, August 04, 2009 4:07 PM
>> To: zookeeper-dev@hadoop.apache.org
>> Subject: RE: Unending Leader Elections in WAN deploy
>> 
>> Patrick, thanks! I'll forward on to IT and I'll report back to you
>> shortly...
>> 
>>> -----Original Message-----
>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>> Sent: Tuesday, August 04, 2009 3:55 PM
>>> To: zookeeper-dev@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>> 
>>> Todd, Mahadev and I looked at this and it turns out to be a
>> regression.
>>> Ironically a patch I created for 3.2 branch to add quorum tests
>> actually
>>> broke the quorum config -- a default value for a config parameter
> was
>>> lost. I'm going to submit a patch asap to get the default back, but
>> for
>>> the time being you can set:
>>> 
>>> electionAlg=3
>>> 
>>> in each of your config files.
>>> 
>>> You should see reference to FastLeaderElection in your log files if
>> this
>>> parameter is set correctly.
>>> 
>>> Sorry for the trouble,
>>> 
>>> Patrick
>>> 
>>> Todd Greenwood wrote:
>>>> Mahadev,
>>>> 
>>>> I just heard from IT that this build behaves in exactly the same
> way
>> as
>>>> previous versions, e.g. we get continuous leader elections that
>>>> disconnect the followers and then get re-elected, and
>> disconnect...etc.
>>>> 
>>>> This is from a fresh sync to the 3.2 branch:
>>>> 
>>>> svn co
>>>> 
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
>>>> ./branch-3.2
>>>> 
>>>> CHANGES.TXT show the various fixes included:
>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
>>>> Release 3.2.1
>>>> 
>>>> Backward compatibile changes:
>>>> 
>>>> BUGFIXES:
>>>>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
>> via
>>>> flavio)
>>>> 
>>>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
>> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
> mahadev)
>>>> 
>>>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>>>>   (giri via mahadev)
>>>> 
>>>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
>> mahadev)
>>>> 
>>>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
>> immediate
>>>>   failure. (chris via mahadev)
>>>> 
>>>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
>> via
>>>> phunt)
>>>> 
>>>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
>>>> other)
>>>>   embedded clients (ryan rawson via phunt)
>>>> 
>>>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
>> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
> correctly
>>>>   (flavio via mahadev)
>>>> 
>>>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
>> empty
>>>> cert
>>>>   (Chris Darroch via phunt)
>>>> 
>>>>   ZOOKEEPER-480. FLE should perform leader check when node is not
>>>> leading and
>>>>   add vote of follower (flavio via mahadev)
>>>> 
>>>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
>> (flavio
>>>> via
>>>>   mahadev)
>>>> 
>>>> What can I do to assist you with this issue?
>>>> 
>>>> -Todd
>>>> 
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>> Sent: Tuesday, August 04, 2009 12:43 PM
>>>>> To: zookeeper-dev@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> Hi todd,
>>>>>  comments in line
>>>>> 
>>>>> 
>>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>> wrote:
>>>>>> Mahadev,
>>>>>> 
>>>>>> Some quick questions:
>>>>>> 
>>>>>> 1. Version
>>>>>> 
>>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
> is
>>>> still
>>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
>>>> calling
>>>>>> this release 3.2.1?
>>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
> we
>>>> tag
>>>>> the
>>>>> release.
>>>>> 
>>>>>> 2. Build targets
>>>>>> 
>>>>>> The package target fails b/c the create-cppunit-configure target
>>>> fails
>>>>>> due to various problems w/ respect to autoconf. Are these
>>>> dependencies
>>>>>> documented somewhere ? I'd like to have a fully building system.
>>>>>> 
>>>>>> create-cppunit-configure:
>>>>>>      [exec] Can't exec "libtoolize": No such file or directory
> at
>>>>>> /usr/bin/autoreconf line 188.
>>>>>>      [exec] Use of uninitialized value $libtoolize in pattern
>> match
>>>>>> (m//) at /usr/bin/autoreconf line 188.
>>>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
> not
>>>> found
>>>>>> in library
>>>>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>>>>> AM_PATH_CPPUNIT
>>>>>>      [exec]       If this token and others are legitimate,
> please
>>>> use
>>>>>> m4_pattern_allow.
>>>>>>      [exec]       See the Autoconf documentation.
>>>>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>>>>> AC_PROG_LIBTOOL
>>>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
> status:
>> 1
>>>>>> 
>>>>> You need auto tools to run this. Please read the README for
>> building c
>>>>> client library at src/c/ for the installation requirements.
>>>>>> 3. Sync failure:
>>>>>> 
>>>>>> This is still failing.
>>>>>> 
>>>>>> svn: URL
>>>>>> 
> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>>>>> doesn't exist
>>>>>> 
>>>>> Yes this hasn't been fixed yet!
>>>>> 
>>>>> Thanks
>>>>> mahadev
>>>>>> -Todd
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Todd Greenwood
>>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> Great news. Thank you Mahadev. I'll report our findings later
>>>> today.
>>>>>>> -Todd
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>> 
>>>>>>>> Hi Todd,
>>>>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
>>>> now.
>>>>>>>> Thanks
>>>>>>>> mahadev
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
> <to...@audiencescience.com>
>>>>>> wrote:
>>>>>>>>> That'd be perfect. Thanks!
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>> 
>>>>>>>>>> Hi Todd,
>>>>>>>>>>   Most of the patches that you mention should be in the
> branch
>>>>>> 3.2 by
>>>>>>>>> tomm
>>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
>>>> tomm.
>>>>>>>>> Would
>>>>>>>>>> that
>>>>>>>>>> suffice for you?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> mahadev
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
>> <to...@audiencescience.com>
>>>>>>> wrote:
>>>>>>>>>>> Another problem...I've reverted to the latest versions of
> the
>>>>>>>>> patches
>>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>>>>> compilation
>>>>>>>>>>> errors:
>>>>>>>>>>> 
>>>>>>>>>>> build-generated:
>>>>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>> 
>>>>>>>>>>> compile-main:
>>>>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>     [javac]
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-
>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>>>>> getQuorumPeers()
>>>>>>>>> have
>>>>>>>>>>> the same erasure
>>>>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>>>>     [javac]                         ^
>>>>>>>>>>>     [javac]
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-
>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>> mStats.java:31: name clash: getServerState() and
>>>>>> getServerState()
>>>>>>>>> have
>>>>>>>>>>> the same erasure
>>>>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>>>>     [javac]                       ^
>>>>>>>>>>>     [javac] 2 errors
>>>>>>>>>>> 
>>>>>>>>>>> My build process is pretty simple:
>>>>>>>>>>> 
>>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>>>>> (src/patched/branch-3.2)
>>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>>>> 
>>>>>>>>>>> -Todd
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Flavio,
>>>>>>>>>>>> I notice that you've updated the patches referenced for
> the
>>>> WAN
>>>>>>>>>>>> deployment. There appears to be an order dependency w/
>> respect
>>>>>> to
>>>>>>>>>>> these
>>>>>>>>>>>> four patches...
>>>>>>>>>>>> 
>>>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>>>> 
>>>>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>>>>> ical.java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>>>>> .java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>>>> 
>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>>>> 
>>>>>>>>>>>> Could you advise as to which patches I need to apply, and
> in
>>>>>> what
>>>>>>>>>>> order?
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>>>>> Compilation
>>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
> latest
>>>>>> patches
>>>>>>>>>>>> 473,
>>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
> of
>>>> the
>>>>>>>>>>> patch.
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Flavio,
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>> 
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> 
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> 
>>>>>> 
>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastL
>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>> 
>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>   [javac]
>>>>>> ^
>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>> 
>>>>>>>>>>>> I see a reference to getWeight in both
>>>>>> FastLeaderElection.java
>>>>>>>>>>> in
>>>>>>>>>>>> patch
>>>>>>>>>>>> 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>> :
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>> 0)
>>>>>>>>>>>> 
>>>>>>>>>>>> However, I don't see a reference to this method in
>> patches
>>>>>> 473,
>>>>>>>>>>>> 479,
>>>>>>>>>>>> or
>>>>>>>>>>>> 481. I also don't see a reference to this method in
> the
>>>>>>>>> trunk...
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood
> [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This repro's in both branch-3.2, and
>>>>>> branch-3.2+patches(473,
>>>>>>>>>>>> 479,
>>>>>>>>>>>> 481).
>>>>>>>>>>>> 
>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>>>>> pd4-zook02
>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> the
>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>>>>> supposed
>>>>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> and
>>>>>>>>>>>> then disconnects everyone. Then they re-elect it
> again,
>>>>>> and
>>>>>>>>>>> it
>>>>>>>>>>>> loops
>>>>>>>>>>>> over and over.
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------
>>>>>>>>>>>> Server config
>>>>>>>>>>>> -------------
>>>>>>>>>>>> 
>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>> 
>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>> 
>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>> 
>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>>>>> different
>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> dc1
>>>>>>>>>>>> have voting rights, and the ability to become a
> leader.
>>>>>> The
>>>>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> the pods all have a weight of zero, and are not
>> expected
>>>>>> to
>>>>>>>>>>>> become
>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>> 


Re: Unending Leader Elections in WAN deploy

Posted by Patrick Hunt <ph...@apache.org>.
(I see the same error in fle0weighttest using latest 3.2 btw)

Patrick Hunt wrote:
> Mahadev/Flavio -- looks like 0 weight is still busted, fle0weighttest is 
> actually failing on my machine, however it's reported as success:
> ------------- Standard Error -----------------
> Exception in thread "Thread-108" junit.framework.AssertionFailedError: 
> Elected zero-weight server
>     at junit.framework.Assert.fail(Assert.java:47)
>     at 
> org.apache.zookeeper.test.FLEZeroWeightTest$LEThread.run(FLEZeroWeightTest.java:138) 
> 
> ------------- ---------------- ---------------
> 
> this is probably due because the test is calling assert in a thread 
> other than the main test thread - which junit will not track/knowabout.
> 
> One problem I see with these tests (0weight test I looked at) -- it 
> doesn't have a client attempt to connect to the various servers as part 
> of declaring success. Really we should only consider "success"ful test 
> (ie assert that) if a client can connect to each server in the cluster 
> and change/seechanges. As part of fixing this we really need to do a 
> sanity check by testing the various command lines and checking that a 
> client can connect.
> 
> I'm not even sure FLEnewepochtest/fletest/etc... are passing either. new 
> epoch seems to just thrash...
> 
> Also I tried 3 & 5 server quorums "by hand from the command line" with 0 
> weight and they see similar issues to what Todd is seeing.
> 
> I'm using the latest code in mainline btw.
> 
> Patrick
> 
> Mahadev Konar wrote:
>> Hi todd,
>>   I see a lot of
>> java.net.ConnectException: Connection refused
>>         at sun.nio.ch.Net.connect(Native Method)
>>         at 
>> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
>>         at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
>>         at 
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxMana 
>>
>> ger.java:324)
>>         at 
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager. 
>>
>> java:304)
>>         at 
>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender 
>>
>> .process(FastLeaderElection.java:317)
>>         at 
>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender 
>>
>> .run(FastLeaderElection.java:290)
>>         at java.lang.Thread.run(Thread.java:619)
>>
>>
>> Is it possible that there is some firewall? Can all the servers 1-9 
>> connect
>> to all the others using ports that you specified in zoo.cfg i.e 
>> 2888/3888?
>>
>>
>> Thanks
>> mahadev
>>
>>
>> On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
>>
>>> Looks like we're not getting *any* leader elected now.... Logs attached.
>>>
>>>> -----Original Message-----
>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>> Sent: Tuesday, August 04, 2009 4:07 PM
>>>> To: zookeeper-dev@hadoop.apache.org
>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>
>>>> Patrick, thanks! I'll forward on to IT and I'll report back to you
>>>> shortly...
>>>>
>>>>> -----Original Message-----
>>>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>>>> Sent: Tuesday, August 04, 2009 3:55 PM
>>>>> To: zookeeper-dev@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>
>>>>> Todd, Mahadev and I looked at this and it turns out to be a
>>>> regression.
>>>>> Ironically a patch I created for 3.2 branch to add quorum tests
>>>> actually
>>>>> broke the quorum config -- a default value for a config parameter
>>> was
>>>>> lost. I'm going to submit a patch asap to get the default back, but
>>>> for
>>>>> the time being you can set:
>>>>>
>>>>> electionAlg=3
>>>>>
>>>>> in each of your config files.
>>>>>
>>>>> You should see reference to FastLeaderElection in your log files if
>>>> this
>>>>> parameter is set correctly.
>>>>>
>>>>> Sorry for the trouble,
>>>>>
>>>>> Patrick
>>>>>
>>>>> Todd Greenwood wrote:
>>>>>> Mahadev,
>>>>>>
>>>>>> I just heard from IT that this build behaves in exactly the same
>>> way
>>>> as
>>>>>> previous versions, e.g. we get continuous leader elections that
>>>>>> disconnect the followers and then get re-elected, and
>>>> disconnect...etc.
>>>>>> This is from a fresh sync to the 3.2 branch:
>>>>>>
>>>>>> svn co
>>>>>>
>>> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
>>>>>> ./branch-3.2
>>>>>>
>>>>>> CHANGES.TXT show the various fixes included:
>>>>>>
>>>>>>
>>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
>>>>>> Release 3.2.1
>>>>>>
>>>>>> Backward compatibile changes:
>>>>>>
>>>>>> BUGFIXES:
>>>>>>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
>>>> via
>>>>>> flavio)
>>>>>>
>>>>>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
>>>> via
>>>>>> mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
>>> mahadev)
>>>>>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
>>> via
>>>>>> mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>>>>>>   (giri via mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
>>>> mahadev)
>>>>>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
>>>> immediate
>>>>>>   failure. (chris via mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
>>>> via
>>>>>> phunt)
>>>>>>
>>>>>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
>>>>>> other)
>>>>>>   embedded clients (ryan rawson via phunt)
>>>>>>
>>>>>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
>>>> via
>>>>>> mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
>>> correctly
>>>>>>   (flavio via mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
>>>> empty
>>>>>> cert
>>>>>>   (Chris Darroch via phunt)
>>>>>>
>>>>>>   ZOOKEEPER-480. FLE should perform leader check when node is not
>>>>>> leading and
>>>>>>   add vote of follower (flavio via mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
>>>> (flavio
>>>>>> via
>>>>>>   mahadev)
>>>>>>
>>>>>> What can I do to assist you with this issue?
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>> Sent: Tuesday, August 04, 2009 12:43 PM
>>>>>>> To: zookeeper-dev@hadoop.apache.org
>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>
>>>>>>> Hi todd,
>>>>>>>  comments in line
>>>>>>>
>>>>>>>
>>>>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>>>> wrote:
>>>>>>>> Mahadev,
>>>>>>>>
>>>>>>>> Some quick questions:
>>>>>>>>
>>>>>>>> 1. Version
>>>>>>>>
>>>>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
>>> is
>>>>>> still
>>>>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
>>>>>> calling
>>>>>>>> this release 3.2.1?
>>>>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
>>> we
>>>>>> tag
>>>>>>> the
>>>>>>> release.
>>>>>>>
>>>>>>>> 2. Build targets
>>>>>>>>
>>>>>>>> The package target fails b/c the create-cppunit-configure target
>>>>>> fails
>>>>>>>> due to various problems w/ respect to autoconf. Are these
>>>>>> dependencies
>>>>>>>> documented somewhere ? I'd like to have a fully building system.
>>>>>>>>
>>>>>>>> create-cppunit-configure:
>>>>>>>>      [exec] Can't exec "libtoolize": No such file or directory
>>> at
>>>>>>>> /usr/bin/autoreconf line 188.
>>>>>>>>      [exec] Use of uninitialized value $libtoolize in pattern
>>>> match
>>>>>>>> (m//) at /usr/bin/autoreconf line 188.
>>>>>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
>>> not
>>>>>> found
>>>>>>>> in library
>>>>>>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>>>>>>> AM_PATH_CPPUNIT
>>>>>>>>      [exec]       If this token and others are legitimate,
>>> please
>>>>>> use
>>>>>>>> m4_pattern_allow.
>>>>>>>>      [exec]       See the Autoconf documentation.
>>>>>>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>>>>>>> AC_PROG_LIBTOOL
>>>>>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
>>> status:
>>>> 1
>>>>>>> You need auto tools to run this. Please read the README for
>>>> building c
>>>>>>> client library at src/c/ for the installation requirements.
>>>>>>>> 3. Sync failure:
>>>>>>>>
>>>>>>>> This is still failing.
>>>>>>>>
>>>>>>>> svn: URL
>>>>>>>>
>>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>>>>>>> doesn't exist
>>>>>>>>
>>>>>>> Yes this hasn't been fixed yet!
>>>>>>>
>>>>>>> Thanks
>>>>>>> mahadev
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Todd Greenwood
>>>>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>>>>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>
>>>>>>>>> Great news. Thank you Mahadev. I'll report our findings later
>>>>>> today.
>>>>>>>>> -Todd
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>
>>>>>>>>>> Hi Todd,
>>>>>>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
>>>>>> now.
>>>>>>>>>> Thanks
>>>>>>>>>> mahadev
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
>>> <to...@audiencescience.com>
>>>>>>>> wrote:
>>>>>>>>>>> That'd be perfect. Thanks!
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Todd,
>>>>>>>>>>>>   Most of the patches that you mention should be in the
>>> branch
>>>>>>>> 3.2 by
>>>>>>>>>>> tomm
>>>>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
>>>>>> tomm.
>>>>>>>>>>> Would
>>>>>>>>>>>> that
>>>>>>>>>>>> suffice for you?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> mahadev
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
>>>> <to...@audiencescience.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> Another problem...I've reverted to the latest versions of
>>> the
>>>>>>>>>>> patches
>>>>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>>>>>>> compilation
>>>>>>>>>>>>> errors:
>>>>>>>>>>>>>
>>>>>>>>>>>>> build-generated:
>>>>>>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>
>>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>     [javac]
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-
>>>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>>>>>>> getQuorumPeers()
>>>>>>>>>>> have
>>>>>>>>>>>>> the same erasure
>>>>>>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>>>>>>     [javac]                         ^
>>>>>>>>>>>>>     [javac]
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-
>>>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>>>> mStats.java:31: name clash: getServerState() and
>>>>>>>> getServerState()
>>>>>>>>>>> have
>>>>>>>>>>>>> the same erasure
>>>>>>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>>>>>>     [javac]                       ^
>>>>>>>>>>>>>     [javac] 2 errors
>>>>>>>>>>>>>
>>>>>>>>>>>>> My build process is pretty simple:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>>>>>>> (src/patched/branch-3.2)
>>>>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>>> I notice that you've updated the patches referenced for
>>> the
>>>>>> WAN
>>>>>>>>>>>>>> deployment. There appears to be an order dependency w/
>>>> respect
>>>>>>>> to
>>>>>>>>>>>>> these
>>>>>>>>>>>>>> four patches...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>>
>>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>>>>>>> ical.java
>>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>>
>>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>>
>>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>>>>>>> .java
>>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>>
>>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>>>>>>
>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Could you advise as to which patches I need to apply, and
>>> in
>>>>>>>> what
>>>>>>>>>>>>> order?
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>>>>>>> Compilation
>>>>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
>>> latest
>>>>>>>> patches
>>>>>>>>>>>>>> 473,
>>>>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
>>> of
>>>>>> the
>>>>>>>>>>>>> patch.
>>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>>>>
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>>> src/p
>>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>>>
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>>> src/p
>>>>>>>>>>>>>>
>>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>>> FastL
>>>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>>>>
>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>>>   [javac]
>>>>>>>> ^
>>>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see a reference to getWeight in both
>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> patch
>>>>>>>>>>>>>> 491:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>>>> :
>>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>>>> 0)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, I don't see a reference to this method in
>>>> patches
>>>>>>>> 473,
>>>>>>>>>>>>>> 479,
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> 481. I also don't see a reference to this method in
>>> the
>>>>>>>>>>> trunk...
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Todd Greenwood
>>> [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This repro's in both branch-3.2, and
>>>>>>>> branch-3.2+patches(473,
>>>>>>>>>>>>>> 479,
>>>>>>>>>>>>>> 481).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>>>>>>> pd4-zook02
>>>>>>>>>>> to
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>>>>>>> supposed
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it
>>> again,
>>>>>>>> and
>>>>>>>>>>>>> it
>>>>>>>>>>>>>> loops
>>>>>>>>>>>>>> over and over.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>> Server config
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>>>>>>> different
>>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>>>>>>> machines
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> dc1
>>>>>>>>>>>>>> have voting rights, and the ability to become a
>>> leader.
>>>>>>>> The
>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the pods all have a weight of zero, and are not
>>>> expected
>>>>>>>> to
>>>>>>>>>>>>>> become
>>>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Todd
>>

Re: Unending Leader Elections in WAN deploy

Posted by Patrick Hunt <ph...@apache.org>.
Mahadev/Flavio -- looks like 0 weight is still busted, fle0weighttest is 
actually failing on my machine, however it's reported as success:
------------- Standard Error -----------------
Exception in thread "Thread-108" junit.framework.AssertionFailedError: 
Elected zero-weight server
	at junit.framework.Assert.fail(Assert.java:47)
	at 
org.apache.zookeeper.test.FLEZeroWeightTest$LEThread.run(FLEZeroWeightTest.java:138)
------------- ---------------- ---------------

this is probably due because the test is calling assert in a thread 
other than the main test thread - which junit will not track/knowabout.

One problem I see with these tests (0weight test I looked at) -- it 
doesn't have a client attempt to connect to the various servers as part 
of declaring success. Really we should only consider "success"ful test 
(ie assert that) if a client can connect to each server in the cluster 
and change/seechanges. As part of fixing this we really need to do a 
sanity check by testing the various command lines and checking that a 
client can connect.

I'm not even sure FLEnewepochtest/fletest/etc... are passing either. new 
epoch seems to just thrash...

Also I tried 3 & 5 server quorums "by hand from the command line" with 0 
weight and they see similar issues to what Todd is seeing.

I'm using the latest code in mainline btw.

Patrick

Mahadev Konar wrote:
> Hi todd,
>   I see a lot of 
> 
> java.net.ConnectException: Connection refused
>         at sun.nio.ch.Net.connect(Native Method)
>         at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
>         at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxMana
> ger.java:324)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.
> java:304)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender
> .process(FastLeaderElection.java:317)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender
> .run(FastLeaderElection.java:290)
>         at java.lang.Thread.run(Thread.java:619)
> 
> 
> Is it possible that there is some firewall? Can all the servers 1-9 connect
> to all the others using ports that you specified in zoo.cfg i.e 2888/3888?
> 
> 
> Thanks
> mahadev
> 
> 
> On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
> 
>> Looks like we're not getting *any* leader elected now.... Logs attached.
>>
>>> -----Original Message-----
>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>> Sent: Tuesday, August 04, 2009 4:07 PM
>>> To: zookeeper-dev@hadoop.apache.org
>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>
>>> Patrick, thanks! I'll forward on to IT and I'll report back to you
>>> shortly...
>>>
>>>> -----Original Message-----
>>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>>> Sent: Tuesday, August 04, 2009 3:55 PM
>>>> To: zookeeper-dev@hadoop.apache.org
>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>
>>>> Todd, Mahadev and I looked at this and it turns out to be a
>>> regression.
>>>> Ironically a patch I created for 3.2 branch to add quorum tests
>>> actually
>>>> broke the quorum config -- a default value for a config parameter
>> was
>>>> lost. I'm going to submit a patch asap to get the default back, but
>>> for
>>>> the time being you can set:
>>>>
>>>> electionAlg=3
>>>>
>>>> in each of your config files.
>>>>
>>>> You should see reference to FastLeaderElection in your log files if
>>> this
>>>> parameter is set correctly.
>>>>
>>>> Sorry for the trouble,
>>>>
>>>> Patrick
>>>>
>>>> Todd Greenwood wrote:
>>>>> Mahadev,
>>>>>
>>>>> I just heard from IT that this build behaves in exactly the same
>> way
>>> as
>>>>> previous versions, e.g. we get continuous leader elections that
>>>>> disconnect the followers and then get re-elected, and
>>> disconnect...etc.
>>>>> This is from a fresh sync to the 3.2 branch:
>>>>>
>>>>> svn co
>>>>>
>> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
>>>>> ./branch-3.2
>>>>>
>>>>> CHANGES.TXT show the various fixes included:
>>>>>
>>>>>
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
>>>>> Release 3.2.1
>>>>>
>>>>> Backward compatibile changes:
>>>>>
>>>>> BUGFIXES:
>>>>>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
>>> via
>>>>> flavio)
>>>>>
>>>>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
>>> via
>>>>> mahadev)
>>>>>
>>>>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
>> mahadev)
>>>>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
>> via
>>>>> mahadev)
>>>>>
>>>>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>>>>>   (giri via mahadev)
>>>>>
>>>>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
>>> mahadev)
>>>>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
>>> immediate
>>>>>   failure. (chris via mahadev)
>>>>>
>>>>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
>>> via
>>>>> phunt)
>>>>>
>>>>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
>>>>> other)
>>>>>   embedded clients (ryan rawson via phunt)
>>>>>
>>>>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
>>> via
>>>>> mahadev)
>>>>>
>>>>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
>> correctly
>>>>>   (flavio via mahadev)
>>>>>
>>>>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
>>> empty
>>>>> cert
>>>>>   (Chris Darroch via phunt)
>>>>>
>>>>>   ZOOKEEPER-480. FLE should perform leader check when node is not
>>>>> leading and
>>>>>   add vote of follower (flavio via mahadev)
>>>>>
>>>>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
>>> (flavio
>>>>> via
>>>>>   mahadev)
>>>>>
>>>>> What can I do to assist you with this issue?
>>>>>
>>>>> -Todd
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>> Sent: Tuesday, August 04, 2009 12:43 PM
>>>>>> To: zookeeper-dev@hadoop.apache.org
>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>
>>>>>> Hi todd,
>>>>>>  comments in line
>>>>>>
>>>>>>
>>>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>>> wrote:
>>>>>>> Mahadev,
>>>>>>>
>>>>>>> Some quick questions:
>>>>>>>
>>>>>>> 1. Version
>>>>>>>
>>>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
>> is
>>>>> still
>>>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
>>>>> calling
>>>>>>> this release 3.2.1?
>>>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
>> we
>>>>> tag
>>>>>> the
>>>>>> release.
>>>>>>
>>>>>>> 2. Build targets
>>>>>>>
>>>>>>> The package target fails b/c the create-cppunit-configure target
>>>>> fails
>>>>>>> due to various problems w/ respect to autoconf. Are these
>>>>> dependencies
>>>>>>> documented somewhere ? I'd like to have a fully building system.
>>>>>>>
>>>>>>> create-cppunit-configure:
>>>>>>>      [exec] Can't exec "libtoolize": No such file or directory
>> at
>>>>>>> /usr/bin/autoreconf line 188.
>>>>>>>      [exec] Use of uninitialized value $libtoolize in pattern
>>> match
>>>>>>> (m//) at /usr/bin/autoreconf line 188.
>>>>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
>> not
>>>>> found
>>>>>>> in library
>>>>>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>>>>>> AM_PATH_CPPUNIT
>>>>>>>      [exec]       If this token and others are legitimate,
>> please
>>>>> use
>>>>>>> m4_pattern_allow.
>>>>>>>      [exec]       See the Autoconf documentation.
>>>>>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>>>>>> AC_PROG_LIBTOOL
>>>>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
>> status:
>>> 1
>>>>>> You need auto tools to run this. Please read the README for
>>> building c
>>>>>> client library at src/c/ for the installation requirements.
>>>>>>> 3. Sync failure:
>>>>>>>
>>>>>>> This is still failing.
>>>>>>>
>>>>>>> svn: URL
>>>>>>>
>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>>>>>> doesn't exist
>>>>>>>
>>>>>> Yes this hasn't been fixed yet!
>>>>>>
>>>>>> Thanks
>>>>>> mahadev
>>>>>>> -Todd
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Todd Greenwood
>>>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>>>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>
>>>>>>>> Great news. Thank you Mahadev. I'll report our findings later
>>>>> today.
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>
>>>>>>>>> Hi Todd,
>>>>>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
>>>>> now.
>>>>>>>>> Thanks
>>>>>>>>> mahadev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
>> <to...@audiencescience.com>
>>>>>>> wrote:
>>>>>>>>>> That'd be perfect. Thanks!
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>
>>>>>>>>>>> Hi Todd,
>>>>>>>>>>>   Most of the patches that you mention should be in the
>> branch
>>>>>>> 3.2 by
>>>>>>>>>> tomm
>>>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
>>>>> tomm.
>>>>>>>>>> Would
>>>>>>>>>>> that
>>>>>>>>>>> suffice for you?
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> mahadev
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
>>> <to...@audiencescience.com>
>>>>>>>> wrote:
>>>>>>>>>>>> Another problem...I've reverted to the latest versions of
>> the
>>>>>>>>>> patches
>>>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>>>>>> compilation
>>>>>>>>>>>> errors:
>>>>>>>>>>>>
>>>>>>>>>>>> build-generated:
>>>>>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>
>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>     [javac]
>>>>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>> atched/branch-
>>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>>>>>> getQuorumPeers()
>>>>>>>>>> have
>>>>>>>>>>>> the same erasure
>>>>>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>>>>>     [javac]                         ^
>>>>>>>>>>>>     [javac]
>>>>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>> atched/branch-
>>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>>> mStats.java:31: name clash: getServerState() and
>>>>>>> getServerState()
>>>>>>>>>> have
>>>>>>>>>>>> the same erasure
>>>>>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>>>>>     [javac]                       ^
>>>>>>>>>>>>     [javac] 2 errors
>>>>>>>>>>>>
>>>>>>>>>>>> My build process is pretty simple:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>>>>>> (src/patched/branch-3.2)
>>>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>>>>>
>>>>>>>>>>>> -Todd
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>> I notice that you've updated the patches referenced for
>> the
>>>>> WAN
>>>>>>>>>>>>> deployment. There appears to be an order dependency w/
>>> respect
>>>>>>> to
>>>>>>>>>>>> these
>>>>>>>>>>>>> four patches...
>>>>>>>>>>>>>
>>>>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>>>>>
>>>>>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>>>>>> ical.java
>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>>>>>> .java
>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>
>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>>>>>
>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could you advise as to which patches I need to apply, and
>> in
>>>>>>> what
>>>>>>>>>>>> order?
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>>>>>> Compilation
>>>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
>> latest
>>>>>>> patches
>>>>>>>>>>>>> 473,
>>>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
>> of
>>>>> the
>>>>>>>>>>>> patch.
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>>
>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastL
>>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>>>
>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>>   [javac]
>>>>>>> ^
>>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see a reference to getWeight in both
>>>>>>> FastLeaderElection.java
>>>>>>>>>>>> in
>>>>>>>>>>>>> patch
>>>>>>>>>>>>> 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>>> :
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>>> 0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, I don't see a reference to this method in
>>> patches
>>>>>>> 473,
>>>>>>>>>>>>> 479,
>>>>>>>>>>>>> or
>>>>>>>>>>>>> 481. I also don't see a reference to this method in
>> the
>>>>>>>>>> trunk...
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Todd Greenwood
>> [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> This repro's in both branch-3.2, and
>>>>>>> branch-3.2+patches(473,
>>>>>>>>>>>>> 479,
>>>>>>>>>>>>> 481).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>>>>>> pd4-zook02
>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>> the
>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>>>>>> supposed
>>>>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>> and
>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it
>> again,
>>>>>>> and
>>>>>>>>>>>> it
>>>>>>>>>>>>> loops
>>>>>>>>>>>>> over and over.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>> Server config
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>
>>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>>>
>>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>>>>>> different
>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>>>>>> machines
>>>>>>>>>>>>> in
>>>>>>>>>>>>> dc1
>>>>>>>>>>>>> have voting rights, and the ability to become a
>> leader.
>>>>>>> The
>>>>>>>>>>>>> machines
>>>>>>>>>>>>> in
>>>>>>>>>>>>> the pods all have a weight of zero, and are not
>>> expected
>>>>>>> to
>>>>>>>>>>>>> become
>>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Todd
> 

Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi todd,
  I see a lot of 

java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.connect(Native Method)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
        at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxMana
ger.java:324)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.
java:304)
        at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender
.process(FastLeaderElection.java:317)
        at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender
.run(FastLeaderElection.java:290)
        at java.lang.Thread.run(Thread.java:619)


Is it possible that there is some firewall? Can all the servers 1-9 connect
to all the others using ports that you specified in zoo.cfg i.e 2888/3888?


Thanks
mahadev


On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> Looks like we're not getting *any* leader elected now.... Logs attached.
> 
>> -----Original Message-----
>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>> Sent: Tuesday, August 04, 2009 4:07 PM
>> To: zookeeper-dev@hadoop.apache.org
>> Subject: RE: Unending Leader Elections in WAN deploy
>> 
>> Patrick, thanks! I'll forward on to IT and I'll report back to you
>> shortly...
>> 
>>> -----Original Message-----
>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>> Sent: Tuesday, August 04, 2009 3:55 PM
>>> To: zookeeper-dev@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>> 
>>> Todd, Mahadev and I looked at this and it turns out to be a
>> regression.
>>> Ironically a patch I created for 3.2 branch to add quorum tests
>> actually
>>> broke the quorum config -- a default value for a config parameter
> was
>>> lost. I'm going to submit a patch asap to get the default back, but
>> for
>>> the time being you can set:
>>> 
>>> electionAlg=3
>>> 
>>> in each of your config files.
>>> 
>>> You should see reference to FastLeaderElection in your log files if
>> this
>>> parameter is set correctly.
>>> 
>>> Sorry for the trouble,
>>> 
>>> Patrick
>>> 
>>> Todd Greenwood wrote:
>>>> Mahadev,
>>>> 
>>>> I just heard from IT that this build behaves in exactly the same
> way
>> as
>>>> previous versions, e.g. we get continuous leader elections that
>>>> disconnect the followers and then get re-elected, and
>> disconnect...etc.
>>>> 
>>>> This is from a fresh sync to the 3.2 branch:
>>>> 
>>>> svn co
>>>> 
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
>>>> ./branch-3.2
>>>> 
>>>> CHANGES.TXT show the various fixes included:
>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
>>>> Release 3.2.1
>>>> 
>>>> Backward compatibile changes:
>>>> 
>>>> BUGFIXES:
>>>>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
>> via
>>>> flavio)
>>>> 
>>>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
>> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
> mahadev)
>>>> 
>>>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>>>>   (giri via mahadev)
>>>> 
>>>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
>> mahadev)
>>>> 
>>>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
>> immediate
>>>>   failure. (chris via mahadev)
>>>> 
>>>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
>> via
>>>> phunt)
>>>> 
>>>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
>>>> other)
>>>>   embedded clients (ryan rawson via phunt)
>>>> 
>>>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
>> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
> correctly
>>>>   (flavio via mahadev)
>>>> 
>>>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
>> empty
>>>> cert
>>>>   (Chris Darroch via phunt)
>>>> 
>>>>   ZOOKEEPER-480. FLE should perform leader check when node is not
>>>> leading and
>>>>   add vote of follower (flavio via mahadev)
>>>> 
>>>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
>> (flavio
>>>> via
>>>>   mahadev)
>>>> 
>>>> What can I do to assist you with this issue?
>>>> 
>>>> -Todd
>>>> 
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>> Sent: Tuesday, August 04, 2009 12:43 PM
>>>>> To: zookeeper-dev@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> Hi todd,
>>>>>  comments in line
>>>>> 
>>>>> 
>>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>> wrote:
>>>>>> Mahadev,
>>>>>> 
>>>>>> Some quick questions:
>>>>>> 
>>>>>> 1. Version
>>>>>> 
>>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
> is
>>>> still
>>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
>>>> calling
>>>>>> this release 3.2.1?
>>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
> we
>>>> tag
>>>>> the
>>>>> release.
>>>>> 
>>>>>> 2. Build targets
>>>>>> 
>>>>>> The package target fails b/c the create-cppunit-configure target
>>>> fails
>>>>>> due to various problems w/ respect to autoconf. Are these
>>>> dependencies
>>>>>> documented somewhere ? I'd like to have a fully building system.
>>>>>> 
>>>>>> create-cppunit-configure:
>>>>>>      [exec] Can't exec "libtoolize": No such file or directory
> at
>>>>>> /usr/bin/autoreconf line 188.
>>>>>>      [exec] Use of uninitialized value $libtoolize in pattern
>> match
>>>>>> (m//) at /usr/bin/autoreconf line 188.
>>>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
> not
>>>> found
>>>>>> in library
>>>>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>>>>> AM_PATH_CPPUNIT
>>>>>>      [exec]       If this token and others are legitimate,
> please
>>>> use
>>>>>> m4_pattern_allow.
>>>>>>      [exec]       See the Autoconf documentation.
>>>>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>>>>> AC_PROG_LIBTOOL
>>>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
> status:
>> 1
>>>>>> 
>>>>> You need auto tools to run this. Please read the README for
>> building c
>>>>> client library at src/c/ for the installation requirements.
>>>>>> 3. Sync failure:
>>>>>> 
>>>>>> This is still failing.
>>>>>> 
>>>>>> svn: URL
>>>>>> 
> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>>>>> doesn't exist
>>>>>> 
>>>>> Yes this hasn't been fixed yet!
>>>>> 
>>>>> Thanks
>>>>> mahadev
>>>>>> -Todd
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Todd Greenwood
>>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> Great news. Thank you Mahadev. I'll report our findings later
>>>> today.
>>>>>>> -Todd
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>> 
>>>>>>>> Hi Todd,
>>>>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
>>>> now.
>>>>>>>> Thanks
>>>>>>>> mahadev
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
> <to...@audiencescience.com>
>>>>>> wrote:
>>>>>>>>> That'd be perfect. Thanks!
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>> 
>>>>>>>>>> Hi Todd,
>>>>>>>>>>   Most of the patches that you mention should be in the
> branch
>>>>>> 3.2 by
>>>>>>>>> tomm
>>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
>>>> tomm.
>>>>>>>>> Would
>>>>>>>>>> that
>>>>>>>>>> suffice for you?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> mahadev
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
>> <to...@audiencescience.com>
>>>>>>> wrote:
>>>>>>>>>>> Another problem...I've reverted to the latest versions of
> the
>>>>>>>>> patches
>>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>>>>> compilation
>>>>>>>>>>> errors:
>>>>>>>>>>> 
>>>>>>>>>>> build-generated:
>>>>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>> 
>>>>>>>>>>> compile-main:
>>>>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>     [javac]
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-
>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>>>>> getQuorumPeers()
>>>>>>>>> have
>>>>>>>>>>> the same erasure
>>>>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>>>>     [javac]                         ^
>>>>>>>>>>>     [javac]
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-
>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>> mStats.java:31: name clash: getServerState() and
>>>>>> getServerState()
>>>>>>>>> have
>>>>>>>>>>> the same erasure
>>>>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>>>>     [javac]                       ^
>>>>>>>>>>>     [javac] 2 errors
>>>>>>>>>>> 
>>>>>>>>>>> My build process is pretty simple:
>>>>>>>>>>> 
>>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>>>>> (src/patched/branch-3.2)
>>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>>>> 
>>>>>>>>>>> -Todd
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Flavio,
>>>>>>>>>>>> I notice that you've updated the patches referenced for
> the
>>>> WAN
>>>>>>>>>>>> deployment. There appears to be an order dependency w/
>> respect
>>>>>> to
>>>>>>>>>>> these
>>>>>>>>>>>> four patches...
>>>>>>>>>>>> 
>>>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>>>> 
>>>>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>>>>> ical.java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>>>>> .java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>>>> 
>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>>>> 
>>>>>>>>>>>> Could you advise as to which patches I need to apply, and
> in
>>>>>> what
>>>>>>>>>>> order?
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>>>>> Compilation
>>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
> latest
>>>>>> patches
>>>>>>>>>>>> 473,
>>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
> of
>>>> the
>>>>>>>>>>> patch.
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Flavio,
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>> 
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> 
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> 
>>>>>> 
>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastL
>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>> 
>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>   [javac]
>>>>>> ^
>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>> 
>>>>>>>>>>>> I see a reference to getWeight in both
>>>>>> FastLeaderElection.java
>>>>>>>>>>> in
>>>>>>>>>>>> patch
>>>>>>>>>>>> 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>> :
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>> 0)
>>>>>>>>>>>> 
>>>>>>>>>>>> However, I don't see a reference to this method in
>> patches
>>>>>> 473,
>>>>>>>>>>>> 479,
>>>>>>>>>>>> or
>>>>>>>>>>>> 481. I also don't see a reference to this method in
> the
>>>>>>>>> trunk...
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood
> [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This repro's in both branch-3.2, and
>>>>>> branch-3.2+patches(473,
>>>>>>>>>>>> 479,
>>>>>>>>>>>> 481).
>>>>>>>>>>>> 
>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>>>>> pd4-zook02
>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> the
>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>>>>> supposed
>>>>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> and
>>>>>>>>>>>> then disconnects everyone. Then they re-elect it
> again,
>>>>>> and
>>>>>>>>>>> it
>>>>>>>>>>>> loops
>>>>>>>>>>>> over and over.
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------
>>>>>>>>>>>> Server config
>>>>>>>>>>>> -------------
>>>>>>>>>>>> 
>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>> 
>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>> 
>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>> 
>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>>>>> different
>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> dc1
>>>>>>>>>>>> have voting rights, and the ability to become a
> leader.
>>>>>> The
>>>>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> the pods all have a weight of zero, and are not
>> expected
>>>>>> to
>>>>>>>>>>>> become
>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>> 


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Looks like we're not getting *any* leader elected now.... Logs attached.

> -----Original Message-----
> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> Sent: Tuesday, August 04, 2009 4:07 PM
> To: zookeeper-dev@hadoop.apache.org
> Subject: RE: Unending Leader Elections in WAN deploy
> 
> Patrick, thanks! I'll forward on to IT and I'll report back to you
> shortly...
> 
> > -----Original Message-----
> > From: Patrick Hunt [mailto:phunt@apache.org]
> > Sent: Tuesday, August 04, 2009 3:55 PM
> > To: zookeeper-dev@hadoop.apache.org
> > Subject: Re: Unending Leader Elections in WAN deploy
> >
> > Todd, Mahadev and I looked at this and it turns out to be a
> regression.
> > Ironically a patch I created for 3.2 branch to add quorum tests
> actually
> > broke the quorum config -- a default value for a config parameter
was
> > lost. I'm going to submit a patch asap to get the default back, but
> for
> > the time being you can set:
> >
> > electionAlg=3
> >
> > in each of your config files.
> >
> > You should see reference to FastLeaderElection in your log files if
> this
> > parameter is set correctly.
> >
> > Sorry for the trouble,
> >
> > Patrick
> >
> > Todd Greenwood wrote:
> > > Mahadev,
> > >
> > > I just heard from IT that this build behaves in exactly the same
way
> as
> > > previous versions, e.g. we get continuous leader elections that
> > > disconnect the followers and then get re-elected, and
> disconnect...etc.
> > >
> > > This is from a fresh sync to the 3.2 branch:
> > >
> > > svn co
> > >
http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> > > ./branch-3.2
> > >
> > > CHANGES.TXT show the various fixes included:
> > >
> > >
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > > /src/original$ head -n 50 branch-3.2/CHANGES.txt
> > > Release 3.2.1
> > >
> > > Backward compatibile changes:
> > >
> > > BUGFIXES:
> > >   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
> via
> > > flavio)
> > >
> > >   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
> via
> > > mahadev)
> > >
> > >   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
mahadev)
> > >
> > >   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
via
> > > mahadev)
> > >
> > >   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
> > >   (giri via mahadev)
> > >
> > >   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
> mahadev)
> > >
> > >   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
> immediate
> > >   failure. (chris via mahadev)
> > >
> > >   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
> via
> > > phunt)
> > >
> > >   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
> > > other)
> > >   embedded clients (ryan rawson via phunt)
> > >
> > >   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
> via
> > > mahadev)
> > >
> > >   ZOOKEEPER-479.  QuorumHierarchical does not count groups
correctly
> > >   (flavio via mahadev)
> > >
> > >   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
> empty
> > > cert
> > >   (Chris Darroch via phunt)
> > >
> > >   ZOOKEEPER-480. FLE should perform leader check when node is not
> > > leading and
> > >   add vote of follower (flavio via mahadev)
> > >
> > >   ZOOKEEPER-491. Prevent zero-weight servers from being elected
> (flavio
> > > via
> > >   mahadev)
> > >
> > > What can I do to assist you with this issue?
> > >
> > > -Todd
> > >
> > >> -----Original Message-----
> > >> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> > >> Sent: Tuesday, August 04, 2009 12:43 PM
> > >> To: zookeeper-dev@hadoop.apache.org
> > >> Subject: Re: Unending Leader Elections in WAN deploy
> > >>
> > >> Hi todd,
> > >>  comments in line
> > >>
> > >>
> > >> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> > > wrote:
> > >>> Mahadev,
> > >>>
> > >>> Some quick questions:
> > >>>
> > >>> 1. Version
> > >>>
> > >>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
is
> > > still
> > >>> calling this 3.2.0. Should this be rev'd, and am I correct in
> > > calling
> > >>> this release 3.2.1?
> > >> Yes the release is 3.2.1. The build.xml will be fixed as soon as
we
> > > tag
> > >> the
> > >> release.
> > >>
> > >>> 2. Build targets
> > >>>
> > >>> The package target fails b/c the create-cppunit-configure target
> > > fails
> > >>> due to various problems w/ respect to autoconf. Are these
> > > dependencies
> > >>> documented somewhere ? I'd like to have a fully building system.
> > >>>
> > >>> create-cppunit-configure:
> > >>>      [exec] Can't exec "libtoolize": No such file or directory
at
> > >>> /usr/bin/autoreconf line 188.
> > >>>      [exec] Use of uninitialized value $libtoolize in pattern
> match
> > >>> (m//) at /usr/bin/autoreconf line 188.
> > >>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
not
> > > found
> > >>> in library
> > >>>      [exec] configure.ac:33: error: possibly undefined macro:
> > >>> AM_PATH_CPPUNIT
> > >>>      [exec]       If this token and others are legitimate,
please
> > > use
> > >>> m4_pattern_allow.
> > >>>      [exec]       See the Autoconf documentation.
> > >>>      [exec] configure.ac:53: error: possibly undefined macro:
> > >>> AC_PROG_LIBTOOL
> > >>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
status:
> 1
> > >>>
> > >> You need auto tools to run this. Please read the README for
> building c
> > >> client library at src/c/ for the installation requirements.
> > >>> 3. Sync failure:
> > >>>
> > >>> This is still failing.
> > >>>
> > >>> svn: URL
> > >>>
'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> > >>> doesn't exist
> > >>>
> > >> Yes this hasn't been fixed yet!
> > >>
> > >> Thanks
> > >> mahadev
> > >>> -Todd
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Todd Greenwood
> > >>>> Sent: Tuesday, August 04, 2009 11:26 AM
> > >>>> To: 'zookeeper-user@hadoop.apache.org'
> > >>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>
> > >>>> Great news. Thank you Mahadev. I'll report our findings later
> > > today.
> > >>>> -Todd
> > >>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> > >>>>> Sent: Tuesday, August 04, 2009 11:20 AM
> > >>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>
> > >>>>> Hi Todd,
> > >>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
> > > now.
> > >>>>> Thanks
> > >>>>> mahadev
> > >>>>>
> > >>>>>
> > >>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
<to...@audiencescience.com>
> > >>> wrote:
> > >>>>>> That'd be perfect. Thanks!
> > >>>>>>
> > >>>>>>> -----Original Message-----
> > >>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> > >>>>>>> Sent: Monday, August 03, 2009 4:24 PM
> > >>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>
> > >>>>>>> Hi Todd,
> > >>>>>>>   Most of the patches that you mention should be in the
branch
> > >>> 3.2 by
> > >>>>>> tomm
> > >>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> > > tomm.
> > >>>>>> Would
> > >>>>>>> that
> > >>>>>>> suffice for you?
> > >>>>>>>
> > >>>>>>> Thanks
> > >>>>>>> mahadev
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
> <to...@audiencescience.com>
> > >>>> wrote:
> > >>>>>>>> Another problem...I've reverted to the latest versions of
the
> > >>>>>> patches
> > >>>>>>>> that are not specific to branch-3.2, and I'm getting two
> > >>> compilation
> > >>>>>>>> errors:
> > >>>>>>>>
> > >>>>>>>> build-generated:
> > >>>>>>>>     [javac] Compiling 44 source files to
> > >>>>>>>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>>>>>> atched/branch-3.2/build/classes
> > >>>>>>>>
> > >>>>>>>> compile-main:
> > >>>>>>>>     [javac] Compiling 2 source files to
> > >>>>>>>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>>>>>> atched/branch-3.2/build/classes
> > >>>>>>>>     [javac]
> > >>>>>>>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>>>> atched/branch-
> > >>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > >>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
> > >>> getQuorumPeers()
> > >>>>>> have
> > >>>>>>>> the same erasure
> > >>>>>>>>     [javac]         public String[] getQuorumPeers();
> > >>>>>>>>     [javac]                         ^
> > >>>>>>>>     [javac]
> > >>>>>>>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>>>> atched/branch-
> > >>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > >>>>>>>> mStats.java:31: name clash: getServerState() and
> > >>> getServerState()
> > >>>>>> have
> > >>>>>>>> the same erasure
> > >>>>>>>>     [javac]         public String getServerState();
> > >>>>>>>>     [javac]                       ^
> > >>>>>>>>     [javac] 2 errors
> > >>>>>>>>
> > >>>>>>>> My build process is pretty simple:
> > >>>>>>>>
> > >>>>>>>> 1. copy the branch-3.2 source to a temp directory
> > >>>>>>>> (src/patched/branch-3.2)
> > >>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
> > >>>>>>>> 3. build zookeeper in the temp directory
> > >>>>>>>>
> > >>>>>>>> -Todd
> > >>>>>>>>> -----Original Message-----
> > >>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> > >>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
> > >>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>>>>>>
> > >>>>>>>>> Flavio,
> > >>>>>>>>> I notice that you've updated the patches referenced for
the
> > > WAN
> > >>>>>>>>> deployment. There appears to be an order dependency w/
> respect
> > >>> to
> > >>>>>>>> these
> > >>>>>>>>> four patches...
> > >>>>>>>>>
> > >>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> > >>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> > >>>>>>>>>
> > >>>>>>>>> 473 -> 479 (479 fails)
> > >>>>>>>>>
> > >>>>>>>>>
> > >
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > >>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
> > >>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> > >>>>>>>>> patching file
> > >>>>>>>>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> > >>>>>>>>> ical.java
> > >>>>>>>>> patching file
> > >>>>>>>>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> > >>>>>>>>> patching file
> > >>>>>>>>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> > >>>>>>>>> .java
> > >>>>>>>>> patching file
> > >>>>>>>>>
> > >>>
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> > >>>>>>>>> Hunk #1 FAILED at 93.
> > >>>>>>>>> Hunk #2 FAILED at 145.
> > >>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
> > >>>>>>>>>
> > >
>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> > >
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > >>>>>>>>> /src/patched/branch-3.2$ h ../patches/
> > >>>>>>>>>
> > >>>>>>>>> Could you advise as to which patches I need to apply, and
in
> > >>> what
> > >>>>>>>> order?
> > >>>>>>>>> -Todd
> > >>>>>>>>>
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
> > >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>>>>
> > >>>>>>>>>> Perfect! Thanks for the update, Todd.
> > >>>>>>>>>>
> > >>>>>>>>>> -Flavio
> > >>>>>>>>>>
> > >>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
> > >>> Compilation
> > >>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
latest
> > >>> patches
> > >>>>>>>>> 473,
> > >>>>>>>>>>> 479, 481, and 491.
> > >>>>>>>>>>>
> > >>>>>>>>>>> -Todd
> > >>>>>>>>>>>
> > >>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> > >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
of
> > > the
> > >>>>>>>> patch.
> > >>>>>>>>>>>> -Flavio
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Flavio,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'm getting a compilation error for patch 491:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> compile-main:
> > >>>>>>>>>>>>>   [javac] Compiling 1 source file to
> > >>>>>>>>>>>>>
> > >>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>>>>>>>>>>>> src/p
> > >>>>>>>>>>>>> atched/branch-3.2/build/classes
> > >>>>>>>>>>>>>   [javac]
> > >>>>>>>>>>>>>
> > >>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>>>>>>>>>>>> src/p
> > >>>>>>>>>>>>>
> > >>>
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> > >>>>>>>>>>>>> FastL
> > >>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
> > >>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
> > >>>>>>>>>>>>>   [javac] location: interface
> > >>>>>>>>>>>>>
> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> > >>>>>>>>>>>>>   [javac]
> > >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>>>>>>>>>>>>   [javac]
> > >>> ^
> > >>>>>>>>>>>>>   [javac] 1 error
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I see a reference to getWeight in both
> > >>> FastLeaderElection.java
> > >>>>>>>> in
> > >>>>>>>>>>>>> patch
> > >>>>>>>>>>>>> 491:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
> > >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> > >>>>>>>>>>>>> FastLeaderElection.java
> > >>>>>>>>>>>>> :
> > >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> > >>>>>>>>>>>>> 0)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> However, I don't see a reference to this method in
> patches
> > >>> 473,
> > >>>>>>>>> 479,
> > >>>>>>>>>>>>> or
> > >>>>>>>>>>>>> 481. I also don't see a reference to this method in
the
> > >>>>>> trunk...
> > >>>>>>>>>>>>> -Todd
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>>>> From: Todd Greenwood
[mailto:toddg@audiencescience.com]
> > >>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> > >>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
> > >>>>>>>>>>>>>> -Todd
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> > >>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> You're missing 491 from your set of patches.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -Flavio
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> This repro's in both branch-3.2, and
> > >>> branch-3.2+patches(473,
> > >>>>>>>>> 479,
> > >>>>>>>>>>>>>> 481).
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Basically, it seems like the nodes are electing
> > >>> pd4-zook02
> > >>>>>> to
> > >>>>>>>>> be
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> > >>>>>>>> supposed
> > >>>>>>>>> to
> > >>>>>>>>>>>>> be
> > >>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it
again,
> > >>> and
> > >>>>>>>> it
> > >>>>>>>>>>>>> loops
> > >>>>>>>>>>>>>> over and over.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -------------
> > >>>>>>>>>>>>>> Server config
> > >>>>>>>>>>>>>> -------------
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> > >>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> group.1:1:2:3:4:5
> > >>>>>>>>>>>>>> weight.1=1
> > >>>>>>>>>>>>>> weight.2=1
> > >>>>>>>>>>>>>> weight.3=1
> > >>>>>>>>>>>>>> weight.4=1
> > >>>>>>>>>>>>>> weight.5=1
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> group.2:6:7:8:9
> > >>>>>>>>>>>>>> weight.6=0
> > >>>>>>>>>>>>>> weight.7=0
> > >>>>>>>>>>>>>> weight.8=0
> > >>>>>>>>>>>>>> weight.9=0
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> > >>>>>>>> different
> > >>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> > >>>>>> machines
> > >>>>>>>>> in
> > >>>>>>>>>>>>> dc1
> > >>>>>>>>>>>>>> have voting rights, and the ability to become a
leader.
> > >>> The
> > >>>>>>>>>>>>> machines
> > >>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>> the pods all have a weight of zero, and are not
> expected
> > >>> to
> > >>>>>>>>>>> become
> > >>>>>>>>>>>>>> leaders, or to vote on transactions.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -Todd
> > >

RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Patrick, thanks! I'll forward on to IT and I'll report back to you
shortly...

> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org]
> Sent: Tuesday, August 04, 2009 3:55 PM
> To: zookeeper-dev@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Todd, Mahadev and I looked at this and it turns out to be a
regression.
> Ironically a patch I created for 3.2 branch to add quorum tests
actually
> broke the quorum config -- a default value for a config parameter was
> lost. I'm going to submit a patch asap to get the default back, but
for
> the time being you can set:
> 
> electionAlg=3
> 
> in each of your config files.
> 
> You should see reference to FastLeaderElection in your log files if
this
> parameter is set correctly.
> 
> Sorry for the trouble,
> 
> Patrick
> 
> Todd Greenwood wrote:
> > Mahadev,
> >
> > I just heard from IT that this build behaves in exactly the same way
as
> > previous versions, e.g. we get continuous leader elections that
> > disconnect the followers and then get re-elected, and
disconnect...etc.
> >
> > This is from a fresh sync to the 3.2 branch:
> >
> > svn co
> > http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> > ./branch-3.2
> >
> > CHANGES.TXT show the various fixes included:
> >
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > /src/original$ head -n 50 branch-3.2/CHANGES.txt
> > Release 3.2.1
> >
> > Backward compatibile changes:
> >
> > BUGFIXES:
> >   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
via
> > flavio)
> >
> >   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
via
> > mahadev)
> >
> >   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev)
> >
> >   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
> > mahadev)
> >
> >   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
> >   (giri via mahadev)
> >
> >   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
mahadev)
> >
> >   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
immediate
> >   failure. (chris via mahadev)
> >
> >   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
via
> > phunt)
> >
> >   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
> > other)
> >   embedded clients (ryan rawson via phunt)
> >
> >   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
via
> > mahadev)
> >
> >   ZOOKEEPER-479.  QuorumHierarchical does not count groups correctly
> >   (flavio via mahadev)
> >
> >   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
empty
> > cert
> >   (Chris Darroch via phunt)
> >
> >   ZOOKEEPER-480. FLE should perform leader check when node is not
> > leading and
> >   add vote of follower (flavio via mahadev)
> >
> >   ZOOKEEPER-491. Prevent zero-weight servers from being elected
(flavio
> > via
> >   mahadev)
> >
> > What can I do to assist you with this issue?
> >
> > -Todd
> >
> >> -----Original Message-----
> >> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >> Sent: Tuesday, August 04, 2009 12:43 PM
> >> To: zookeeper-dev@hadoop.apache.org
> >> Subject: Re: Unending Leader Elections in WAN deploy
> >>
> >> Hi todd,
> >>  comments in line
> >>
> >>
> >> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> > wrote:
> >>> Mahadev,
> >>>
> >>> Some quick questions:
> >>>
> >>> 1. Version
> >>>
> >>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
> > still
> >>> calling this 3.2.0. Should this be rev'd, and am I correct in
> > calling
> >>> this release 3.2.1?
> >> Yes the release is 3.2.1. The build.xml will be fixed as soon as we
> > tag
> >> the
> >> release.
> >>
> >>> 2. Build targets
> >>>
> >>> The package target fails b/c the create-cppunit-configure target
> > fails
> >>> due to various problems w/ respect to autoconf. Are these
> > dependencies
> >>> documented somewhere ? I'd like to have a fully building system.
> >>>
> >>> create-cppunit-configure:
> >>>      [exec] Can't exec "libtoolize": No such file or directory at
> >>> /usr/bin/autoreconf line 188.
> >>>      [exec] Use of uninitialized value $libtoolize in pattern
match
> >>> (m//) at /usr/bin/autoreconf line 188.
> >>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
> > found
> >>> in library
> >>>      [exec] configure.ac:33: error: possibly undefined macro:
> >>> AM_PATH_CPPUNIT
> >>>      [exec]       If this token and others are legitimate, please
> > use
> >>> m4_pattern_allow.
> >>>      [exec]       See the Autoconf documentation.
> >>>      [exec] configure.ac:53: error: possibly undefined macro:
> >>> AC_PROG_LIBTOOL
> >>>      [exec] autoreconf: /usr/bin/autoconf failed with exit status:
1
> >>>
> >> You need auto tools to run this. Please read the README for
building c
> >> client library at src/c/ for the installation requirements.
> >>> 3. Sync failure:
> >>>
> >>> This is still failing.
> >>>
> >>> svn: URL
> >>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> >>> doesn't exist
> >>>
> >> Yes this hasn't been fixed yet!
> >>
> >> Thanks
> >> mahadev
> >>> -Todd
> >>>
> >>>> -----Original Message-----
> >>>> From: Todd Greenwood
> >>>> Sent: Tuesday, August 04, 2009 11:26 AM
> >>>> To: 'zookeeper-user@hadoop.apache.org'
> >>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>
> >>>> Great news. Thank you Mahadev. I'll report our findings later
> > today.
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>> Sent: Tuesday, August 04, 2009 11:20 AM
> >>>>> To: zookeeper-user@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> Hi Todd,
> >>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
> > now.
> >>>>> Thanks
> >>>>> mahadev
> >>>>>
> >>>>>
> >>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
> >>> wrote:
> >>>>>> That'd be perfect. Thanks!
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>>>> Sent: Monday, August 03, 2009 4:24 PM
> >>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>
> >>>>>>> Hi Todd,
> >>>>>>>   Most of the patches that you mention should be in the branch
> >>> 3.2 by
> >>>>>> tomm
> >>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> > tomm.
> >>>>>> Would
> >>>>>>> that
> >>>>>>> suffice for you?
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> mahadev
> >>>>>>>
> >>>>>>>
> >>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
<to...@audiencescience.com>
> >>>> wrote:
> >>>>>>>> Another problem...I've reverted to the latest versions of the
> >>>>>> patches
> >>>>>>>> that are not specific to branch-3.2, and I'm getting two
> >>> compilation
> >>>>>>>> errors:
> >>>>>>>>
> >>>>>>>> build-generated:
> >>>>>>>>     [javac] Compiling 44 source files to
> >>>>>>>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>
> >>>>>>>> compile-main:
> >>>>>>>>     [javac] Compiling 2 source files to
> >>>>>>>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>     [javac]
> >>>>>>>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>> atched/branch-
> >>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
> >>> getQuorumPeers()
> >>>>>> have
> >>>>>>>> the same erasure
> >>>>>>>>     [javac]         public String[] getQuorumPeers();
> >>>>>>>>     [javac]                         ^
> >>>>>>>>     [javac]
> >>>>>>>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>> atched/branch-
> >>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>> mStats.java:31: name clash: getServerState() and
> >>> getServerState()
> >>>>>> have
> >>>>>>>> the same erasure
> >>>>>>>>     [javac]         public String getServerState();
> >>>>>>>>     [javac]                       ^
> >>>>>>>>     [javac] 2 errors
> >>>>>>>>
> >>>>>>>> My build process is pretty simple:
> >>>>>>>>
> >>>>>>>> 1. copy the branch-3.2 source to a temp directory
> >>>>>>>> (src/patched/branch-3.2)
> >>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
> >>>>>>>> 3. build zookeeper in the temp directory
> >>>>>>>>
> >>>>>>>> -Todd
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
> >>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>
> >>>>>>>>> Flavio,
> >>>>>>>>> I notice that you've updated the patches referenced for the
> > WAN
> >>>>>>>>> deployment. There appears to be an order dependency w/
respect
> >>> to
> >>>>>>>> these
> >>>>>>>>> four patches...
> >>>>>>>>>
> >>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>>>>>>>>
> >>>>>>>>> 473 -> 479 (479 fails)
> >>>>>>>>>
> >>>>>>>>>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
> >>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>> patching file
> >>>>>>>>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >>>>>>>>> ical.java
> >>>>>>>>> patching file
> >>>>>>>>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >>>>>>>>> patching file
> >>>>>>>>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >>>>>>>>> .java
> >>>>>>>>> patching file
> >>>>>>>>>
> >>>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >>>>>>>>> Hunk #1 FAILED at 93.
> >>>>>>>>> Hunk #2 FAILED at 145.
> >>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
> >>>>>>>>>
> >
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>> /src/patched/branch-3.2$ h ../patches/
> >>>>>>>>>
> >>>>>>>>> Could you advise as to which patches I need to apply, and in
> >>> what
> >>>>>>>> order?
> >>>>>>>>> -Todd
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
> >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>
> >>>>>>>>>> Perfect! Thanks for the update, Todd.
> >>>>>>>>>>
> >>>>>>>>>> -Flavio
> >>>>>>>>>>
> >>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
> >>> Compilation
> >>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
> >>> patches
> >>>>>>>>> 473,
> >>>>>>>>>>> 479, 481, and 491.
> >>>>>>>>>>>
> >>>>>>>>>>> -Todd
> >>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of
> > the
> >>>>>>>> patch.
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Flavio,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm getting a compilation error for patch 491:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> compile-main:
> >>>>>>>>>>>>>   [javac] Compiling 1 source file to
> >>>>>>>>>>>>>
> >>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>>> src/p
> >>>>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>>>   [javac]
> >>>>>>>>>>>>>
> >>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>>> src/p
> >>>>>>>>>>>>>
> >>>
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>>> FastL
> >>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>>>>>>>>   [javac] location: interface
> >>>>>>>>>>>>>
org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>>>>>>>>   [javac]
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>>>   [javac]
> >>> ^
> >>>>>>>>>>>>>   [javac] 1 error
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I see a reference to getWeight in both
> >>> FastLeaderElection.java
> >>>>>>>> in
> >>>>>>>>>>>>> patch
> >>>>>>>>>>>>> 491:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>>> FastLeaderElection.java
> >>>>>>>>>>>>> :
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>>>>>>>>> 0)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> However, I don't see a reference to this method in
patches
> >>> 473,
> >>>>>>>>> 479,
> >>>>>>>>>>>>> or
> >>>>>>>>>>>>> 481. I also don't see a reference to this method in the
> >>>>>> trunk...
> >>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This repro's in both branch-3.2, and
> >>> branch-3.2+patches(473,
> >>>>>>>>> 479,
> >>>>>>>>>>>>>> 481).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Basically, it seems like the nodes are electing
> >>> pd4-zook02
> >>>>>> to
> >>>>>>>>> be
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> >>>>>>>> supposed
> >>>>>>>>> to
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
> >>> and
> >>>>>>>> it
> >>>>>>>>>>>>> loops
> >>>>>>>>>>>>>> over and over.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -------------
> >>>>>>>>>>>>>> Server config
> >>>>>>>>>>>>>> -------------
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>>>>>>> weight.1=1
> >>>>>>>>>>>>>> weight.2=1
> >>>>>>>>>>>>>> weight.3=1
> >>>>>>>>>>>>>> weight.4=1
> >>>>>>>>>>>>>> weight.5=1
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> group.2:6:7:8:9
> >>>>>>>>>>>>>> weight.6=0
> >>>>>>>>>>>>>> weight.7=0
> >>>>>>>>>>>>>> weight.8=0
> >>>>>>>>>>>>>> weight.9=0
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> >>>>>>>> different
> >>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> >>>>>> machines
> >>>>>>>>> in
> >>>>>>>>>>>>> dc1
> >>>>>>>>>>>>>> have voting rights, and the ability to become a leader.
> >>> The
> >>>>>>>>>>>>> machines
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>> the pods all have a weight of zero, and are not
expected
> >>> to
> >>>>>>>>>>> become
> >>>>>>>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Todd
> >

Re: Unending Leader Elections in WAN deploy

Posted by Patrick Hunt <ph...@apache.org>.
Todd, Mahadev and I looked at this and it turns out to be a regression. 
Ironically a patch I created for 3.2 branch to add quorum tests actually 
broke the quorum config -- a default value for a config parameter was 
lost. I'm going to submit a patch asap to get the default back, but for 
the time being you can set:

electionAlg=3

in each of your config files.

You should see reference to FastLeaderElection in your log files if this 
parameter is set correctly.

Sorry for the trouble,

Patrick

Todd Greenwood wrote:
> Mahadev,
> 
> I just heard from IT that this build behaves in exactly the same way as
> previous versions, e.g. we get continuous leader elections that
> disconnect the followers and then get re-elected, and disconnect...etc.
> 
> This is from a fresh sync to the 3.2 branch:
> 
> svn co
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> ./branch-3.2
> 
> CHANGES.TXT show the various fixes included:
> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> /src/original$ head -n 50 branch-3.2/CHANGES.txt
> Release 3.2.1
> 
> Backward compatibile changes:
> 
> BUGFIXES:
>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris via
> flavio)
> 
>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris via
> mahadev)
> 
>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev)
> 
>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
> mahadev)
> 
>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>   (giri via mahadev)
>   
>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via mahadev)
> 
>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent immediate
>   failure. (chris via mahadev) 
> 
>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev via
> phunt)
> 
>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
> other)
>   embedded clients (ryan rawson via phunt)
> 
>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio via
> mahadev)
> 
>   ZOOKEEPER-479.  QuorumHierarchical does not count groups correctly
>   (flavio via mahadev)
> 
>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with empty
> cert
>   (Chris Darroch via phunt)
> 
>   ZOOKEEPER-480. FLE should perform leader check when node is not
> leading and
>   add vote of follower (flavio via mahadev)
> 
>   ZOOKEEPER-491. Prevent zero-weight servers from being elected (flavio
> via
>   mahadev)
> 
> What can I do to assist you with this issue?
> 
> -Todd
> 
>> -----Original Message-----
>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>> Sent: Tuesday, August 04, 2009 12:43 PM
>> To: zookeeper-dev@hadoop.apache.org
>> Subject: Re: Unending Leader Elections in WAN deploy
>>
>> Hi todd,
>>  comments in line
>>
>>
>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> wrote:
>>> Mahadev,
>>>
>>> Some quick questions:
>>>
>>> 1. Version
>>>
>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
> still
>>> calling this 3.2.0. Should this be rev'd, and am I correct in
> calling
>>> this release 3.2.1?
>> Yes the release is 3.2.1. The build.xml will be fixed as soon as we
> tag
>> the
>> release.
>>
>>> 2. Build targets
>>>
>>> The package target fails b/c the create-cppunit-configure target
> fails
>>> due to various problems w/ respect to autoconf. Are these
> dependencies
>>> documented somewhere ? I'd like to have a fully building system.
>>>
>>> create-cppunit-configure:
>>>      [exec] Can't exec "libtoolize": No such file or directory at
>>> /usr/bin/autoreconf line 188.
>>>      [exec] Use of uninitialized value $libtoolize in pattern match
>>> (m//) at /usr/bin/autoreconf line 188.
>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
> found
>>> in library
>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>> AM_PATH_CPPUNIT
>>>      [exec]       If this token and others are legitimate, please
> use
>>> m4_pattern_allow.
>>>      [exec]       See the Autoconf documentation.
>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>> AC_PROG_LIBTOOL
>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1
>>>
>> You need auto tools to run this. Please read the README for building c
>> client library at src/c/ for the installation requirements.
>>> 3. Sync failure:
>>>
>>> This is still failing.
>>>
>>> svn: URL
>>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>> doesn't exist
>>>
>> Yes this hasn't been fixed yet!
>>
>> Thanks
>> mahadev
>>> -Todd
>>>
>>>> -----Original Message-----
>>>> From: Todd Greenwood
>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>
>>>> Great news. Thank you Mahadev. I'll report our findings later
> today.
>>>> -Todd
>>>>
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>
>>>>> Hi Todd,
>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
> now.
>>>>> Thanks
>>>>> mahadev
>>>>>
>>>>>
>>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
>>> wrote:
>>>>>> That'd be perfect. Thanks!
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>
>>>>>>> Hi Todd,
>>>>>>>   Most of the patches that you mention should be in the branch
>>> 3.2 by
>>>>>> tomm
>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> tomm.
>>>>>> Would
>>>>>>> that
>>>>>>> suffice for you?
>>>>>>>
>>>>>>> Thanks
>>>>>>> mahadev
>>>>>>>
>>>>>>>
>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>> wrote:
>>>>>>>> Another problem...I've reverted to the latest versions of the
>>>>>> patches
>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>> compilation
>>>>>>>> errors:
>>>>>>>>
>>>>>>>> build-generated:
>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>
>>>>>>>> compile-main:
>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>     [javac]
>>>>>>>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> atched/branch-
>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>> getQuorumPeers()
>>>>>> have
>>>>>>>> the same erasure
>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>     [javac]                         ^
>>>>>>>>     [javac]
>>>>>>>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> atched/branch-
>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>> mStats.java:31: name clash: getServerState() and
>>> getServerState()
>>>>>> have
>>>>>>>> the same erasure
>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>     [javac]                       ^
>>>>>>>>     [javac] 2 errors
>>>>>>>>
>>>>>>>> My build process is pretty simple:
>>>>>>>>
>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>> (src/patched/branch-3.2)
>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>
>>>>>>>> -Todd
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>
>>>>>>>>> Flavio,
>>>>>>>>> I notice that you've updated the patches referenced for the
> WAN
>>>>>>>>> deployment. There appears to be an order dependency w/ respect
>>> to
>>>>>>>> these
>>>>>>>>> four patches...
>>>>>>>>>
>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>
>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>
>>>>>>>>>
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>> patching file
>>>>>>>>>
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>> ical.java
>>>>>>>>> patching file
>>>>>>>>>
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>> patching file
>>>>>>>>>
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>> .java
>>>>>>>>> patching file
>>>>>>>>>
>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>
>>>>>>>>> Could you advise as to which patches I need to apply, and in
>>> what
>>>>>>>> order?
>>>>>>>>> -Todd
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>
>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>
>>>>>>>>>> -Flavio
>>>>>>>>>>
>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>> Compilation
>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
>>> patches
>>>>>>>>> 473,
>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>
>>>>>>>>>>> -Todd
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>
>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of
> the
>>>>>>>> patch.
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>
>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>>
>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastL
>>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>>   [javac]
>>> ^
>>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see a reference to getWeight in both
>>> FastLeaderElection.java
>>>>>>>> in
>>>>>>>>>>>>> patch
>>>>>>>>>>>>> 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>>> :
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>>> 0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, I don't see a reference to this method in patches
>>> 473,
>>>>>>>>> 479,
>>>>>>>>>>>>> or
>>>>>>>>>>>>> 481. I also don't see a reference to this method in the
>>>>>> trunk...
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This repro's in both branch-3.2, and
>>> branch-3.2+patches(473,
>>>>>>>>> 479,
>>>>>>>>>>>>>> 481).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>> pd4-zook02
>>>>>> to
>>>>>>>>> be
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>> supposed
>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
>>> and
>>>>>>>> it
>>>>>>>>>>>>> loops
>>>>>>>>>>>>>> over and over.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>> Server config
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>> different
>>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>> machines
>>>>>>>>> in
>>>>>>>>>>>>> dc1
>>>>>>>>>>>>>> have voting rights, and the ability to become a leader.
>>> The
>>>>>>>>>>>>> machines
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the pods all have a weight of zero, and are not expected
>>> to
>>>>>>>>>>> become
>>>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Todd
> 

RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Will do.

> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org]
> Sent: Tuesday, August 04, 2009 1:34 PM
> To: zookeeper-dev@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> It would be better to create a JIRA with configs as well as logs.
> 
> Patrick
> 
> Mahadev Konar wrote:
> > Hi Todd,
> >
> >   What is the synclimit you are using? Can you post your config? For
> WAN's
> > you will have to use much bigger values for synclimit and others.
> >
> > Thanks
> > mahadev
> >
> >
> > On 8/4/09 1:24 PM, "Todd Greenwood" <to...@audiencescience.com>
wrote:
> >
> >> Mahadev,
> >>
> >> I just heard from IT that this build behaves in exactly the same
way as
> >> previous versions, e.g. we get continuous leader elections that
> >> disconnect the followers and then get re-elected, and
disconnect...etc.
> >>
> >> This is from a fresh sync to the 3.2 branch:
> >>
> >> svn co
> >>
http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> >> ./branch-3.2
> >>
> >> CHANGES.TXT show the various fixes included:
> >>
> >>
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >> /src/original$ head -n 50 branch-3.2/CHANGES.txt
> >> Release 3.2.1
> >>
> >> Backward compatibile changes:
> >>
> >> BUGFIXES:
> >>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
via
> >> flavio)
> >>
> >>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
via
> >> mahadev)
> >>
> >>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
mahadev)
> >>
> >>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
> >> mahadev)
> >>
> >>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
> >>   (giri via mahadev)
> >>
> >>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
mahadev)
> >>
> >>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
immediate
> >>   failure. (chris via mahadev)
> >>
> >>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
via
> >> phunt)
> >>
> >>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
> >> other)
> >>   embedded clients (ryan rawson via phunt)
> >>
> >>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
via
> >> mahadev)
> >>
> >>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
correctly
> >>   (flavio via mahadev)
> >>
> >>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
empty
> >> cert
> >>   (Chris Darroch via phunt)
> >>
> >>   ZOOKEEPER-480. FLE should perform leader check when node is not
> >> leading and
> >>   add vote of follower (flavio via mahadev)
> >>
> >>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
(flavio
> >> via
> >>   mahadev)
> >>
> >> What can I do to assist you with this issue?
> >>
> >> -Todd
> >>
> >>> -----Original Message-----
> >>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>> Sent: Tuesday, August 04, 2009 12:43 PM
> >>> To: zookeeper-dev@hadoop.apache.org
> >>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>
> >>> Hi todd,
> >>>  comments in line
> >>>
> >>>
> >>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> >> wrote:
> >>>> Mahadev,
> >>>>
> >>>> Some quick questions:
> >>>>
> >>>> 1. Version
> >>>>
> >>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
> >> still
> >>>> calling this 3.2.0. Should this be rev'd, and am I correct in
> >> calling
> >>>> this release 3.2.1?
> >>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
we
> >> tag
> >>> the
> >>> release.
> >>>
> >>>> 2. Build targets
> >>>>
> >>>> The package target fails b/c the create-cppunit-configure target
> >> fails
> >>>> due to various problems w/ respect to autoconf. Are these
> >> dependencies
> >>>> documented somewhere ? I'd like to have a fully building system.
> >>>>
> >>>> create-cppunit-configure:
> >>>>      [exec] Can't exec "libtoolize": No such file or directory at
> >>>> /usr/bin/autoreconf line 188.
> >>>>      [exec] Use of uninitialized value $libtoolize in pattern
match
> >>>> (m//) at /usr/bin/autoreconf line 188.
> >>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
> >> found
> >>>> in library
> >>>>      [exec] configure.ac:33: error: possibly undefined macro:
> >>>> AM_PATH_CPPUNIT
> >>>>      [exec]       If this token and others are legitimate, please
> >> use
> >>>> m4_pattern_allow.
> >>>>      [exec]       See the Autoconf documentation.
> >>>>      [exec] configure.ac:53: error: possibly undefined macro:
> >>>> AC_PROG_LIBTOOL
> >>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
status: 1
> >>>>
> >>> You need auto tools to run this. Please read the README for
building c
> >>> client library at src/c/ for the installation requirements.
> >>>> 3. Sync failure:
> >>>>
> >>>> This is still failing.
> >>>>
> >>>> svn: URL
> >>>>
'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> >>>> doesn't exist
> >>>>
> >>> Yes this hasn't been fixed yet!
> >>>
> >>> Thanks
> >>> mahadev
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Todd Greenwood
> >>>>> Sent: Tuesday, August 04, 2009 11:26 AM
> >>>>> To: 'zookeeper-user@hadoop.apache.org'
> >>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> Great news. Thank you Mahadev. I'll report our findings later
> >> today.
> >>>>> -Todd
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
> >>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>
> >>>>>> Hi Todd,
> >>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
> >> now.
> >>>>>> Thanks
> >>>>>> mahadev
> >>>>>>
> >>>>>>
> >>>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
> >>>> wrote:
> >>>>>>> That'd be perfect. Thanks!
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
> >>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>
> >>>>>>>> Hi Todd,
> >>>>>>>>   Most of the patches that you mention should be in the
branch
> >>>> 3.2 by
> >>>>>>> tomm
> >>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> >> tomm.
> >>>>>>> Would
> >>>>>>>> that
> >>>>>>>> suffice for you?
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> mahadev
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
<to...@audiencescience.com>
> >>>>> wrote:
> >>>>>>>>> Another problem...I've reverted to the latest versions of
the
> >>>>>>> patches
> >>>>>>>>> that are not specific to branch-3.2, and I'm getting two
> >>>> compilation
> >>>>>>>>> errors:
> >>>>>>>>>
> >>>>>>>>> build-generated:
> >>>>>>>>>     [javac] Compiling 44 source files to
> >>>>>>>>>
> >>
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>
> >>>>>>>>> compile-main:
> >>>>>>>>>     [javac] Compiling 2 source files to
> >>>>>>>>>
> >>
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>     [javac]
> >>>>>>>>>
> >>
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>> atched/branch-
> >>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
> >>>> getQuorumPeers()
> >>>>>>> have
> >>>>>>>>> the same erasure
> >>>>>>>>>     [javac]         public String[] getQuorumPeers();
> >>>>>>>>>     [javac]                         ^
> >>>>>>>>>     [javac]
> >>>>>>>>>
> >>
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>> atched/branch-
> >>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>>> mStats.java:31: name clash: getServerState() and
> >>>> getServerState()
> >>>>>>> have
> >>>>>>>>> the same erasure
> >>>>>>>>>     [javac]         public String getServerState();
> >>>>>>>>>     [javac]                       ^
> >>>>>>>>>     [javac] 2 errors
> >>>>>>>>>
> >>>>>>>>> My build process is pretty simple:
> >>>>>>>>>
> >>>>>>>>> 1. copy the branch-3.2 source to a temp directory
> >>>>>>>>> (src/patched/branch-3.2)
> >>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
> >>>>>>>>> 3. build zookeeper in the temp directory
> >>>>>>>>>
> >>>>>>>>> -Todd
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
> >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>
> >>>>>>>>>> Flavio,
> >>>>>>>>>> I notice that you've updated the patches referenced for the
> >> WAN
> >>>>>>>>>> deployment. There appears to be an order dependency w/
respect
> >>>> to
> >>>>>>>>> these
> >>>>>>>>>> four patches...
> >>>>>>>>>>
> >>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>>>>>>>>>
> >>>>>>>>>> 473 -> 479 (479 fails)
> >>>>>>>>>>
> >>>>>>>>>>
> >>
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
> >>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>>> patching file
> >>>>>>>>>>
> >>
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >>>>>>>>>> ical.java
> >>>>>>>>>> patching file
> >>>>>>>>>>
> >>
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >>>>>>>>>> patching file
> >>>>>>>>>>
> >>
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >>>>>>>>>> .java
> >>>>>>>>>> patching file
> >>>>>>>>>>
> >>>>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >>>>>>>>>> Hunk #1 FAILED at 93.
> >>>>>>>>>> Hunk #2 FAILED at 145.
> >>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
> >>>>>>>>>>
> >>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >>
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
> >>>>>>>>>>
> >>>>>>>>>> Could you advise as to which patches I need to apply, and
in
> >>>> what
> >>>>>>>>> order?
> >>>>>>>>>> -Todd
> >>>>>>>>>>
> >>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
> >>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>
> >>>>>>>>>>> Perfect! Thanks for the update, Todd.
> >>>>>>>>>>>
> >>>>>>>>>>> -Flavio
> >>>>>>>>>>>
> >>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
> >>>> Compilation
> >>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
> >>>> patches
> >>>>>>>>>> 473,
> >>>>>>>>>>>> 479, 481, and 491.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of
> >> the
> >>>>>>>>> patch.
> >>>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Flavio,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm getting a compilation error for patch 491:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> compile-main:
> >>>>>>>>>>>>>   [javac] Compiling 1 source file to
> >>>>>>>>>>>>>
> >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>>> src/p
> >>>>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>>>   [javac]
> >>>>>>>>>>>>>
> >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>>> src/p
> >>>>>>>>>>>>>
> >>>>
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>>> FastL
> >>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>>>>>>>>   [javac] location: interface
> >>>>>>>>>>>>>
org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>>>>>>>>   [javac]
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>>>   [javac]
> >>>> ^
> >>>>>>>>>>>>>   [javac] 1 error
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I see a reference to getWeight in both
> >>>> FastLeaderElection.java
> >>>>>>>>> in
> >>>>>>>>>>>>> patch
> >>>>>>>>>>>>> 491:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>>> FastLeaderElection.java
> >>>>>>>>>>>>> :
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>>>>>>>>> 0)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> However, I don't see a reference to this method in
patches
> >>>> 473,
> >>>>>>>>>> 479,
> >>>>>>>>>>>>> or
> >>>>>>>>>>>>> 481. I also don't see a reference to this method in the
> >>>>>>> trunk...
> >>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This repro's in both branch-3.2, and
> >>>> branch-3.2+patches(473,
> >>>>>>>>>> 479,
> >>>>>>>>>>>>> 481).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Basically, it seems like the nodes are electing
> >>>> pd4-zook02
> >>>>>>> to
> >>>>>>>>>> be
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> >>>>>>>>> supposed
> >>>>>>>>>> to
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
> >>>> and
> >>>>>>>>> it
> >>>>>>>>>>>>> loops
> >>>>>>>>>>>>> over and over.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------------
> >>>>>>>>>>>>> Server config
> >>>>>>>>>>>>> -------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>>>>>> weight.1=1
> >>>>>>>>>>>>> weight.2=1
> >>>>>>>>>>>>> weight.3=1
> >>>>>>>>>>>>> weight.4=1
> >>>>>>>>>>>>> weight.5=1
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> group.2:6:7:8:9
> >>>>>>>>>>>>> weight.6=0
> >>>>>>>>>>>>> weight.7=0
> >>>>>>>>>>>>> weight.8=0
> >>>>>>>>>>>>> weight.9=0
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> >>>>>>>>> different
> >>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> >>>>>>> machines
> >>>>>>>>>> in
> >>>>>>>>>>>>> dc1
> >>>>>>>>>>>>> have voting rights, and the ability to become a leader.
> >>>> The
> >>>>>>>>>>>>> machines
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>> the pods all have a weight of zero, and are not expected
> >>>> to
> >>>>>>>>>>>> become
> >>>>>>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>
> >

Re: Unending Leader Elections in WAN deploy

Posted by Patrick Hunt <ph...@apache.org>.
It would be better to create a JIRA with configs as well as logs.

Patrick

Mahadev Konar wrote:
> Hi Todd,
> 
>   What is the synclimit you are using? Can you post your config? For WAN's
> you will have to use much bigger values for synclimit and others.
> 
> Thanks
> mahadev
> 
> 
> On 8/4/09 1:24 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
> 
>> Mahadev,
>>
>> I just heard from IT that this build behaves in exactly the same way as
>> previous versions, e.g. we get continuous leader elections that
>> disconnect the followers and then get re-elected, and disconnect...etc.
>>
>> This is from a fresh sync to the 3.2 branch:
>>
>> svn co
>> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
>> ./branch-3.2
>>
>> CHANGES.TXT show the various fixes included:
>>
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
>> Release 3.2.1
>>
>> Backward compatibile changes:
>>
>> BUGFIXES:
>>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris via
>> flavio)
>>
>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris via
>> mahadev)
>>
>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev)
>>
>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
>> mahadev)
>>
>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>>   (giri via mahadev)
>>   
>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via mahadev)
>>
>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent immediate
>>   failure. (chris via mahadev)
>>
>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev via
>> phunt)
>>
>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
>> other)
>>   embedded clients (ryan rawson via phunt)
>>
>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio via
>> mahadev)
>>
>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups correctly
>>   (flavio via mahadev)
>>
>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with empty
>> cert
>>   (Chris Darroch via phunt)
>>
>>   ZOOKEEPER-480. FLE should perform leader check when node is not
>> leading and
>>   add vote of follower (flavio via mahadev)
>>
>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected (flavio
>> via
>>   mahadev)
>>
>> What can I do to assist you with this issue?
>>
>> -Todd
>>
>>> -----Original Message-----
>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>> Sent: Tuesday, August 04, 2009 12:43 PM
>>> To: zookeeper-dev@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>
>>> Hi todd,
>>>  comments in line
>>>
>>>
>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
>> wrote:
>>>> Mahadev,
>>>>
>>>> Some quick questions:
>>>>
>>>> 1. Version
>>>>
>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
>> still
>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
>> calling
>>>> this release 3.2.1?
>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as we
>> tag
>>> the
>>> release.
>>>
>>>> 2. Build targets
>>>>
>>>> The package target fails b/c the create-cppunit-configure target
>> fails
>>>> due to various problems w/ respect to autoconf. Are these
>> dependencies
>>>> documented somewhere ? I'd like to have a fully building system.
>>>>
>>>> create-cppunit-configure:
>>>>      [exec] Can't exec "libtoolize": No such file or directory at
>>>> /usr/bin/autoreconf line 188.
>>>>      [exec] Use of uninitialized value $libtoolize in pattern match
>>>> (m//) at /usr/bin/autoreconf line 188.
>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
>> found
>>>> in library
>>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>>> AM_PATH_CPPUNIT
>>>>      [exec]       If this token and others are legitimate, please
>> use
>>>> m4_pattern_allow.
>>>>      [exec]       See the Autoconf documentation.
>>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>>> AC_PROG_LIBTOOL
>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1
>>>>
>>> You need auto tools to run this. Please read the README for building c
>>> client library at src/c/ for the installation requirements.
>>>> 3. Sync failure:
>>>>
>>>> This is still failing.
>>>>
>>>> svn: URL
>>>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>>> doesn't exist
>>>>
>>> Yes this hasn't been fixed yet!
>>>
>>> Thanks
>>> mahadev
>>>> -Todd
>>>>
>>>>> -----Original Message-----
>>>>> From: Todd Greenwood
>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>
>>>>> Great news. Thank you Mahadev. I'll report our findings later
>> today.
>>>>> -Todd
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>
>>>>>> Hi Todd,
>>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
>> now.
>>>>>> Thanks
>>>>>> mahadev
>>>>>>
>>>>>>
>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>> wrote:
>>>>>>> That'd be perfect. Thanks!
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>
>>>>>>>> Hi Todd,
>>>>>>>>   Most of the patches that you mention should be in the branch
>>>> 3.2 by
>>>>>>> tomm
>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
>> tomm.
>>>>>>> Would
>>>>>>>> that
>>>>>>>> suffice for you?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> mahadev
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>>> wrote:
>>>>>>>>> Another problem...I've reverted to the latest versions of the
>>>>>>> patches
>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>>> compilation
>>>>>>>>> errors:
>>>>>>>>>
>>>>>>>>> build-generated:
>>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>
>>>>>>>>> compile-main:
>>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>     [javac]
>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>> atched/branch-
>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>>> getQuorumPeers()
>>>>>>> have
>>>>>>>>> the same erasure
>>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>>     [javac]                         ^
>>>>>>>>>     [javac]
>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>> atched/branch-
>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>> mStats.java:31: name clash: getServerState() and
>>>> getServerState()
>>>>>>> have
>>>>>>>>> the same erasure
>>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>>     [javac]                       ^
>>>>>>>>>     [javac] 2 errors
>>>>>>>>>
>>>>>>>>> My build process is pretty simple:
>>>>>>>>>
>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>>> (src/patched/branch-3.2)
>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>>
>>>>>>>>> -Todd
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>
>>>>>>>>>> Flavio,
>>>>>>>>>> I notice that you've updated the patches referenced for the
>> WAN
>>>>>>>>>> deployment. There appears to be an order dependency w/ respect
>>>> to
>>>>>>>>> these
>>>>>>>>>> four patches...
>>>>>>>>>>
>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>>
>>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>>
>>>>>>>>>>
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>> patching file
>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>>> ical.java
>>>>>>>>>> patching file
>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>>> patching file
>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>>> .java
>>>>>>>>>> patching file
>>>>>>>>>>
>>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>>
>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>>
>>>>>>>>>> Could you advise as to which patches I need to apply, and in
>>>> what
>>>>>>>>> order?
>>>>>>>>>> -Todd
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>
>>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>>
>>>>>>>>>>> -Flavio
>>>>>>>>>>>
>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>>> Compilation
>>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
>>>> patches
>>>>>>>>>> 473,
>>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>>
>>>>>>>>>>>> -Todd
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of
>> the
>>>>>>>>> patch.
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>>>
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>>
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>>
>>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastL
>>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>>   [javac]
>>>> ^
>>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see a reference to getWeight in both
>>>> FastLeaderElection.java
>>>>>>>>> in
>>>>>>>>>>>>> patch
>>>>>>>>>>>>> 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>>> :
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>>> 0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, I don't see a reference to this method in patches
>>>> 473,
>>>>>>>>>> 479,
>>>>>>>>>>>>> or
>>>>>>>>>>>>> 481. I also don't see a reference to this method in the
>>>>>>> trunk...
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> This repro's in both branch-3.2, and
>>>> branch-3.2+patches(473,
>>>>>>>>>> 479,
>>>>>>>>>>>>> 481).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>>> pd4-zook02
>>>>>>> to
>>>>>>>>>> be
>>>>>>>>>>>>> the
>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>>> supposed
>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>> and
>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
>>>> and
>>>>>>>>> it
>>>>>>>>>>>>> loops
>>>>>>>>>>>>> over and over.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>> Server config
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>
>>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>>>
>>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>>> different
>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>>> machines
>>>>>>>>>> in
>>>>>>>>>>>>> dc1
>>>>>>>>>>>>> have voting rights, and the ability to become a leader.
>>>> The
>>>>>>>>>>>>> machines
>>>>>>>>>>>>> in
>>>>>>>>>>>>> the pods all have a weight of zero, and are not expected
>>>> to
>>>>>>>>>>>> become
>>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
> 

Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Todd,

  What is the synclimit you are using? Can you post your config? For WAN's
you will have to use much bigger values for synclimit and others.

Thanks
mahadev


On 8/4/09 1:24 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> Mahadev,
> 
> I just heard from IT that this build behaves in exactly the same way as
> previous versions, e.g. we get continuous leader elections that
> disconnect the followers and then get re-elected, and disconnect...etc.
> 
> This is from a fresh sync to the 3.2 branch:
> 
> svn co
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> ./branch-3.2
> 
> CHANGES.TXT show the various fixes included:
> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> /src/original$ head -n 50 branch-3.2/CHANGES.txt
> Release 3.2.1
> 
> Backward compatibile changes:
> 
> BUGFIXES:
>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris via
> flavio)
> 
>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris via
> mahadev)
> 
>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev)
> 
>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
> mahadev)
> 
>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>   (giri via mahadev)
>   
>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via mahadev)
> 
>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent immediate
>   failure. (chris via mahadev)
> 
>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev via
> phunt)
> 
>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
> other)
>   embedded clients (ryan rawson via phunt)
> 
>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio via
> mahadev)
> 
>   ZOOKEEPER-479.  QuorumHierarchical does not count groups correctly
>   (flavio via mahadev)
> 
>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with empty
> cert
>   (Chris Darroch via phunt)
> 
>   ZOOKEEPER-480. FLE should perform leader check when node is not
> leading and
>   add vote of follower (flavio via mahadev)
> 
>   ZOOKEEPER-491. Prevent zero-weight servers from being elected (flavio
> via
>   mahadev)
> 
> What can I do to assist you with this issue?
> 
> -Todd
> 
>> -----Original Message-----
>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>> Sent: Tuesday, August 04, 2009 12:43 PM
>> To: zookeeper-dev@hadoop.apache.org
>> Subject: Re: Unending Leader Elections in WAN deploy
>> 
>> Hi todd,
>>  comments in line
>> 
>> 
>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> wrote:
>> 
>>> Mahadev,
>>> 
>>> Some quick questions:
>>> 
>>> 1. Version
>>> 
>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
> still
>>> calling this 3.2.0. Should this be rev'd, and am I correct in
> calling
>>> this release 3.2.1?
>> Yes the release is 3.2.1. The build.xml will be fixed as soon as we
> tag
>> the
>> release.
>> 
>>> 
>>> 2. Build targets
>>> 
>>> The package target fails b/c the create-cppunit-configure target
> fails
>>> due to various problems w/ respect to autoconf. Are these
> dependencies
>>> documented somewhere ? I'd like to have a fully building system.
>>> 
>>> create-cppunit-configure:
>>>      [exec] Can't exec "libtoolize": No such file or directory at
>>> /usr/bin/autoreconf line 188.
>>>      [exec] Use of uninitialized value $libtoolize in pattern match
>>> (m//) at /usr/bin/autoreconf line 188.
>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
> found
>>> in library
>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>> AM_PATH_CPPUNIT
>>>      [exec]       If this token and others are legitimate, please
> use
>>> m4_pattern_allow.
>>>      [exec]       See the Autoconf documentation.
>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>> AC_PROG_LIBTOOL
>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1
>>> 
>> You need auto tools to run this. Please read the README for building c
>> client library at src/c/ for the installation requirements.
>>> 
>>> 3. Sync failure:
>>> 
>>> This is still failing.
>>> 
>>> svn: URL
>>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>> doesn't exist
>>> 
>> 
>> Yes this hasn't been fixed yet!
>> 
>> Thanks
>> mahadev
>>> -Todd
>>> 
>>>> -----Original Message-----
>>>> From: Todd Greenwood
>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>> 
>>>> Great news. Thank you Mahadev. I'll report our findings later
> today.
>>>> -Todd
>>>> 
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> Hi Todd,
>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
> now.
>>>>> 
>>>>> Thanks
>>>>> mahadev
>>>>> 
>>>>> 
>>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
>>> wrote:
>>>>> 
>>>>>> That'd be perfect. Thanks!
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> Hi Todd,
>>>>>>>   Most of the patches that you mention should be in the branch
>>> 3.2 by
>>>>>> tomm
>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> tomm.
>>>>>> Would
>>>>>>> that
>>>>>>> suffice for you?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> mahadev
>>>>>>> 
>>>>>>> 
>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> Another problem...I've reverted to the latest versions of the
>>>>>> patches
>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>> compilation
>>>>>>>> errors:
>>>>>>>> 
>>>>>>>> build-generated:
>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>> 
>>>>>>>> compile-main:
>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>     [javac]
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> 
>>>>>> atched/branch-
>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>> getQuorumPeers()
>>>>>> have
>>>>>>>> the same erasure
>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>     [javac]                         ^
>>>>>>>>     [javac]
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> 
>>>>>> atched/branch-
>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>> mStats.java:31: name clash: getServerState() and
>>> getServerState()
>>>>>> have
>>>>>>>> the same erasure
>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>     [javac]                       ^
>>>>>>>>     [javac] 2 errors
>>>>>>>> 
>>>>>>>> My build process is pretty simple:
>>>>>>>> 
>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>> (src/patched/branch-3.2)
>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>> 
>>>>>>>> -Todd
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>> 
>>>>>>>>> Flavio,
>>>>>>>>> I notice that you've updated the patches referenced for the
> WAN
>>>>>>>>> deployment. There appears to be an order dependency w/ respect
>>> to
>>>>>>>> these
>>>>>>>>> four patches...
>>>>>>>>> 
>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>> 
>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>> patching file
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>> ical.java
>>>>>>>>> patching file
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>> patching file
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>> .java
>>>>>>>>> patching file
>>>>>>>>> 
>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>> 
>>>>>>>>> Could you advise as to which patches I need to apply, and in
>>> what
>>>>>>>> order?
>>>>>>>>> 
>>>>>>>>> -Todd
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>> 
>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>> 
>>>>>>>>>> -Flavio
>>>>>>>>>> 
>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>> Compilation
>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
>>> patches
>>>>>>>>> 473,
>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>> 
>>>>>>>>>>> -Todd
>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of
> the
>>>>>>>> patch.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Flavio,
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>> 
>>>>>>>>> 
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> 
>>>>>>>>> 
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> 
>>>>>>>>> 
>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastL
>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>   [javac]
>>> ^
>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>> 
>>>>>>>>>>>> I see a reference to getWeight in both
>>> FastLeaderElection.java
>>>>>>>> in
>>>>>>>>>>>> patch
>>>>>>>>>>>> 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>> :
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>> 0)
>>>>>>>>>>>> 
>>>>>>>>>>>> However, I don't see a reference to this method in patches
>>> 473,
>>>>>>>>> 479,
>>>>>>>>>>>> or
>>>>>>>>>>>> 481. I also don't see a reference to this method in the
>>>>>> trunk...
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This repro's in both branch-3.2, and
>>> branch-3.2+patches(473,
>>>>>>>>> 479,
>>>>>>>>>>>> 481).
>>>>>>>>>>>> 
>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>> pd4-zook02
>>>>>> to
>>>>>>>>> be
>>>>>>>>>>>> the
>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>> supposed
>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> and
>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
>>> and
>>>>>>>> it
>>>>>>>>>>>> loops
>>>>>>>>>>>> over and over.
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------
>>>>>>>>>>>> Server config
>>>>>>>>>>>> -------------
>>>>>>>>>>>> 
>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>> 
>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>> 
>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>> 
>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>> different
>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>> machines
>>>>>>>>> in
>>>>>>>>>>>> dc1
>>>>>>>>>>>> have voting rights, and the ability to become a leader.
>>> The
>>>>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> the pods all have a weight of zero, and are not expected
>>> to
>>>>>>>>>>> become
>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>> 
> 


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Mahadev,

I just heard from IT that this build behaves in exactly the same way as
previous versions, e.g. we get continuous leader elections that
disconnect the followers and then get re-elected, and disconnect...etc.

This is from a fresh sync to the 3.2 branch:

svn co
http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
./branch-3.2

CHANGES.TXT show the various fixes included:

toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/original$ head -n 50 branch-3.2/CHANGES.txt
Release 3.2.1

Backward compatibile changes:

BUGFIXES:
  ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris via
flavio)

  ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris via
mahadev)

  ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev)

  ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
mahadev)

  ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
  (giri via mahadev)
  
  ZOOKEEPER-467.  Change log level in BookieHandle (flavio via mahadev)

  ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent immediate
  failure. (chris via mahadev) 

  ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev via
phunt)

  ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
other)
  embedded clients (ryan rawson via phunt)

  ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio via
mahadev)

  ZOOKEEPER-479.  QuorumHierarchical does not count groups correctly
  (flavio via mahadev)

  ZOOKEEPER-466. crash on zookeeper_close() when using auth with empty
cert
  (Chris Darroch via phunt)

  ZOOKEEPER-480. FLE should perform leader check when node is not
leading and
  add vote of follower (flavio via mahadev)

  ZOOKEEPER-491. Prevent zero-weight servers from being elected (flavio
via
  mahadev)

What can I do to assist you with this issue?

-Todd

> -----Original Message-----
> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> Sent: Tuesday, August 04, 2009 12:43 PM
> To: zookeeper-dev@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Hi todd,
>  comments in line
> 
> 
> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
wrote:
> 
> > Mahadev,
> >
> > Some quick questions:
> >
> > 1. Version
> >
> > I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
still
> > calling this 3.2.0. Should this be rev'd, and am I correct in
calling
> > this release 3.2.1?
> Yes the release is 3.2.1. The build.xml will be fixed as soon as we
tag
> the
> release.
> 
> >
> > 2. Build targets
> >
> > The package target fails b/c the create-cppunit-configure target
fails
> > due to various problems w/ respect to autoconf. Are these
dependencies
> > documented somewhere ? I'd like to have a fully building system.
> >
> > create-cppunit-configure:
> >      [exec] Can't exec "libtoolize": No such file or directory at
> > /usr/bin/autoreconf line 188.
> >      [exec] Use of uninitialized value $libtoolize in pattern match
> > (m//) at /usr/bin/autoreconf line 188.
> >      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
found
> > in library
> >      [exec] configure.ac:33: error: possibly undefined macro:
> > AM_PATH_CPPUNIT
> >      [exec]       If this token and others are legitimate, please
use
> > m4_pattern_allow.
> >      [exec]       See the Autoconf documentation.
> >      [exec] configure.ac:53: error: possibly undefined macro:
> > AC_PROG_LIBTOOL
> >      [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1
> >
> You need auto tools to run this. Please read the README for building c
> client library at src/c/ for the installation requirements.
> >
> > 3. Sync failure:
> >
> > This is still failing.
> >
> > svn: URL
> > 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> > doesn't exist
> >
> 
> Yes this hasn't been fixed yet!
> 
> Thanks
> mahadev
> > -Todd
> >
> >> -----Original Message-----
> >> From: Todd Greenwood
> >> Sent: Tuesday, August 04, 2009 11:26 AM
> >> To: 'zookeeper-user@hadoop.apache.org'
> >> Subject: RE: Unending Leader Elections in WAN deploy
> >>
> >> Great news. Thank you Mahadev. I'll report our findings later
today.
> >> -Todd
> >>
> >>> -----Original Message-----
> >>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>> Sent: Tuesday, August 04, 2009 11:20 AM
> >>> To: zookeeper-user@hadoop.apache.org
> >>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>
> >>> Hi Todd,
> >>>  I just committed 480 and 491. You can checkout the 3.2 branch
now.
> >>>
> >>> Thanks
> >>> mahadev
> >>>
> >>>
> >>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
> > wrote:
> >>>
> >>>> That'd be perfect. Thanks!
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>> Sent: Monday, August 03, 2009 4:24 PM
> >>>>> To: zookeeper-user@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> Hi Todd,
> >>>>>   Most of the patches that you mention should be in the branch
> > 3.2 by
> >>>> tomm
> >>>>> or so. 481, 479 are already in. 480 and 491 should be in by
tomm.
> >>>> Would
> >>>>> that
> >>>>> suffice for you?
> >>>>>
> >>>>> Thanks
> >>>>> mahadev
> >>>>>
> >>>>>
> >>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
> >> wrote:
> >>>>>
> >>>>>> Another problem...I've reverted to the latest versions of the
> >>>> patches
> >>>>>> that are not specific to branch-3.2, and I'm getting two
> > compilation
> >>>>>> errors:
> >>>>>>
> >>>>>> build-generated:
> >>>>>>     [javac] Compiling 44 source files to
> >>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>> atched/branch-3.2/build/classes
> >>>>>>
> >>>>>> compile-main:
> >>>>>>     [javac] Compiling 2 source files to
> >>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>> atched/branch-3.2/build/classes
> >>>>>>     [javac]
> >>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>
> >>>> atched/branch-
> >> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>> mStats.java:30: name clash: getQuorumPeers() and
> > getQuorumPeers()
> >>>> have
> >>>>>> the same erasure
> >>>>>>     [javac]         public String[] getQuorumPeers();
> >>>>>>     [javac]                         ^
> >>>>>>     [javac]
> >>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>
> >>>> atched/branch-
> >> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>> mStats.java:31: name clash: getServerState() and
> > getServerState()
> >>>> have
> >>>>>> the same erasure
> >>>>>>     [javac]         public String getServerState();
> >>>>>>     [javac]                       ^
> >>>>>>     [javac] 2 errors
> >>>>>>
> >>>>>> My build process is pretty simple:
> >>>>>>
> >>>>>> 1. copy the branch-3.2 source to a temp directory
> >>>>>> (src/patched/branch-3.2)
> >>>>>> 2. apply the ZOOKEEPER patches in my patches directory
> >>>>>> 3. build zookeeper in the temp directory
> >>>>>>
> >>>>>> -Todd
> >>>>>>> -----Original Message-----
> >>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>> Sent: Monday, August 03, 2009 4:09 PM
> >>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>
> >>>>>>> Flavio,
> >>>>>>> I notice that you've updated the patches referenced for the
WAN
> >>>>>>> deployment. There appears to be an order dependency w/ respect
> > to
> >>>>>> these
> >>>>>>> four patches...
> >>>>>>>
> >>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>>>>>>
> >>>>>>> 473 -> 479 (479 fails)
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>> /src/patched/branch-3.2$ patch -p0 <
> >>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> >>>>>>> patching file
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >>>>>>> ical.java
> >>>>>>> patching file
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >>>>>>> patching file
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >>>>>>> .java
> >>>>>>> patching file
> >>>>>>>
> > src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >>>>>>> Hunk #1 FAILED at 93.
> >>>>>>> Hunk #2 FAILED at 145.
> >>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>> /src/patched/branch-3.2$ h ../patches/
> >>>>>>>
> >>>>>>> Could you advise as to which patches I need to apply, and in
> > what
> >>>>>> order?
> >>>>>>>
> >>>>>>> -Todd
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
> >>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>
> >>>>>>>> Perfect! Thanks for the update, Todd.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>>>>>>
> >>>>>>>>> Thanks. You were right, I had a stale version of 479.
> > Compilation
> >>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
> > patches
> >>>>>>> 473,
> >>>>>>>>> 479, 481, and 491.
> >>>>>>>>>
> >>>>>>>>> -Todd
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>
> >>>>>>>>>> It should be in 479. Perhaps you have a stale version of
the
> >>>>>> patch.
> >>>>>>>>>>
> >>>>>>>>>> -Flavio
> >>>>>>>>>>
> >>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Flavio,
> >>>>>>>>>>>
> >>>>>>>>>>> I'm getting a compilation error for patch 491:
> >>>>>>>>>>>
> >>>>>>>>>>> compile-main:
> >>>>>>>>>>>   [javac] Compiling 1 source file to
> >>>>>>>>>>>
> >>>>>>>
> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>> src/p
> >>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>   [javac]
> >>>>>>>>>>>
> >>>>>>>
> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>> src/p
> >>>>>>>>>>>
> >>>>>>>
> > atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>> FastL
> >>>>>>>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>>>>>>   [javac] location: interface
> >>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>>>>>>   [javac]
> >>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>   [javac]
> > ^
> >>>>>>>>>>>   [javac] 1 error
> >>>>>>>>>>>
> >>>>>>>>>>> I see a reference to getWeight in both
> > FastLeaderElection.java
> >>>>>> in
> >>>>>>>>>>> patch
> >>>>>>>>>>> 491:
> >>>>>>>>>>>
> >>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>> FastLeaderElection.java
> >>>>>>>>>>> :
> >>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>>>>>>> 0)
> >>>>>>>>>>>
> >>>>>>>>>>> However, I don't see a reference to this method in patches
> > 473,
> >>>>>>> 479,
> >>>>>>>>>>> or
> >>>>>>>>>>> 481. I also don't see a reference to this method in the
> >>>> trunk...
> >>>>>>>>>>>
> >>>>>>>>>>> -Todd
> >>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> This repro's in both branch-3.2, and
> > branch-3.2+patches(473,
> >>>>>>> 479,
> >>>>>>>>>>>> 481).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Basically, it seems like the nodes are electing
> > pd4-zook02
> >>>> to
> >>>>>>> be
> >>>>>>>>>>> the
> >>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> >>>>>> supposed
> >>>>>>> to
> >>>>>>>>>>> be
> >>>>>>>>>>>> and
> >>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
> > and
> >>>>>> it
> >>>>>>>>>>> loops
> >>>>>>>>>>>> over and over.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -------------
> >>>>>>>>>>>> Server config
> >>>>>>>>>>>> -------------
> >>>>>>>>>>>>
> >>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>
> >>>>>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>>>>> weight.1=1
> >>>>>>>>>>>> weight.2=1
> >>>>>>>>>>>> weight.3=1
> >>>>>>>>>>>> weight.4=1
> >>>>>>>>>>>> weight.5=1
> >>>>>>>>>>>>
> >>>>>>>>>>>> group.2:6:7:8:9
> >>>>>>>>>>>> weight.6=0
> >>>>>>>>>>>> weight.7=0
> >>>>>>>>>>>> weight.8=0
> >>>>>>>>>>>> weight.9=0
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> >>>>>> different
> >>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> >>>> machines
> >>>>>>> in
> >>>>>>>>>>> dc1
> >>>>>>>>>>>> have voting rights, and the ability to become a leader.
> > The
> >>>>>>>>>>> machines
> >>>>>>>>>>>> in
> >>>>>>>>>>>> the pods all have a weight of zero, and are not expected
> > to
> >>>>>>>>> become
> >>>>>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>
> >


Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi todd, 
 comments in line


On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> Mahadev,
> 
> Some quick questions:
> 
> 1. Version
> 
> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is still
> calling this 3.2.0. Should this be rev'd, and am I correct in calling
> this release 3.2.1?
Yes the release is 3.2.1. The build.xml will be fixed as soon as we tag the
release.

> 
> 2. Build targets
> 
> The package target fails b/c the create-cppunit-configure target fails
> due to various problems w/ respect to autoconf. Are these dependencies
> documented somewhere ? I'd like to have a fully building system.
> 
> create-cppunit-configure:
>      [exec] Can't exec "libtoolize": No such file or directory at
> /usr/bin/autoreconf line 188.
>      [exec] Use of uninitialized value $libtoolize in pattern match
> (m//) at /usr/bin/autoreconf line 188.
>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not found
> in library
>      [exec] configure.ac:33: error: possibly undefined macro:
> AM_PATH_CPPUNIT
>      [exec]       If this token and others are legitimate, please use
> m4_pattern_allow.
>      [exec]       See the Autoconf documentation.
>      [exec] configure.ac:53: error: possibly undefined macro:
> AC_PROG_LIBTOOL
>      [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1
> 
You need auto tools to run this. Please read the README for building c
client library at src/c/ for the installation requirements.
> 
> 3. Sync failure:
> 
> This is still failing.
> 
> svn: URL
> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> doesn't exist
> 

Yes this hasn't been fixed yet!

Thanks
mahadev
> -Todd
> 
>> -----Original Message-----
>> From: Todd Greenwood
>> Sent: Tuesday, August 04, 2009 11:26 AM
>> To: 'zookeeper-user@hadoop.apache.org'
>> Subject: RE: Unending Leader Elections in WAN deploy
>> 
>> Great news. Thank you Mahadev. I'll report our findings later today.
>> -Todd
>> 
>>> -----Original Message-----
>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>> To: zookeeper-user@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>> 
>>> Hi Todd,
>>>  I just committed 480 and 491. You can checkout the 3.2 branch now.
>>> 
>>> Thanks
>>> mahadev
>>> 
>>> 
>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
> wrote:
>>> 
>>>> That'd be perfect. Thanks!
>>>> 
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> Hi Todd,
>>>>>   Most of the patches that you mention should be in the branch
> 3.2 by
>>>> tomm
>>>>> or so. 481, 479 are already in. 480 and 491 should be in by tomm.
>>>> Would
>>>>> that
>>>>> suffice for you?
>>>>> 
>>>>> Thanks
>>>>> mahadev
>>>>> 
>>>>> 
>>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
>> wrote:
>>>>> 
>>>>>> Another problem...I've reverted to the latest versions of the
>>>> patches
>>>>>> that are not specific to branch-3.2, and I'm getting two
> compilation
>>>>>> errors:
>>>>>> 
>>>>>> build-generated:
>>>>>>     [javac] Compiling 44 source files to
>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> atched/branch-3.2/build/classes
>>>>>> 
>>>>>> compile-main:
>>>>>>     [javac] Compiling 2 source files to
>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> atched/branch-3.2/build/classes
>>>>>>     [javac]
>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> 
>>>> atched/branch-
>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>> mStats.java:30: name clash: getQuorumPeers() and
> getQuorumPeers()
>>>> have
>>>>>> the same erasure
>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>     [javac]                         ^
>>>>>>     [javac]
>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> 
>>>> atched/branch-
>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>> mStats.java:31: name clash: getServerState() and
> getServerState()
>>>> have
>>>>>> the same erasure
>>>>>>     [javac]         public String getServerState();
>>>>>>     [javac]                       ^
>>>>>>     [javac] 2 errors
>>>>>> 
>>>>>> My build process is pretty simple:
>>>>>> 
>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>> (src/patched/branch-3.2)
>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>> 3. build zookeeper in the temp directory
>>>>>> 
>>>>>> -Todd
>>>>>>> -----Original Message-----
>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> Flavio,
>>>>>>> I notice that you've updated the patches referenced for the WAN
>>>>>>> deployment. There appears to be an order dependency w/ respect
> to
>>>>>> these
>>>>>>> four patches...
>>>>>>> 
>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>> 
>>>>>>> 473 -> 479 (479 fails)
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>> patching file
>>>>>>> 
>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>> ical.java
>>>>>>> patching file
>>>>>>> 
>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>> patching file
>>>>>>> 
>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>> .java
>>>>>>> patching file
>>>>>>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>> Hunk #1 FAILED at 93.
>>>>>>> Hunk #2 FAILED at 145.
>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>> 
>>>>>> 
>>>> 
>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>>>>>> 
>>>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>> 
>>>>>>> Could you advise as to which patches I need to apply, and in
> what
>>>>>> order?
>>>>>>> 
>>>>>>> -Todd
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>> 
>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>> 
>>>>>>>>> Thanks. You were right, I had a stale version of 479.
> Compilation
>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
> patches
>>>>>>> 473,
>>>>>>>>> 479, 481, and 491.
>>>>>>>>> 
>>>>>>>>> -Todd
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>> 
>>>>>>>>>> It should be in 479. Perhaps you have a stale version of the
>>>>>> patch.
>>>>>>>>>> 
>>>>>>>>>> -Flavio
>>>>>>>>>> 
>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>> 
>>>>>>>>>>> Flavio,
>>>>>>>>>>> 
>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>> 
>>>>>>>>>>> compile-main:
>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>> 
>>>>>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>> src/p
>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>   [javac]
>>>>>>>>>>> 
>>>>>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>> src/p
>>>>>>>>>>> 
>>>>>>> 
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>> FastL
>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>   [javac]
>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>   [javac]
> ^
>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>> 
>>>>>>>>>>> I see a reference to getWeight in both
> FastLeaderElection.java
>>>>>> in
>>>>>>>>>>> patch
>>>>>>>>>>> 491:
>>>>>>>>>>> 
>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>> :
>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>> 0)
>>>>>>>>>>> 
>>>>>>>>>>> However, I don't see a reference to this method in patches
> 473,
>>>>>>> 479,
>>>>>>>>>>> or
>>>>>>>>>>> 481. I also don't see a reference to this method in the
>>>> trunk...
>>>>>>>>>>> 
>>>>>>>>>>> -Todd
>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This repro's in both branch-3.2, and
> branch-3.2+patches(473,
>>>>>>> 479,
>>>>>>>>>>>> 481).
>>>>>>>>>>>> 
>>>>>>>>>>>> Basically, it seems like the nodes are electing
> pd4-zook02
>>>> to
>>>>>>> be
>>>>>>>>>>> the
>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>> supposed
>>>>>>> to
>>>>>>>>>>> be
>>>>>>>>>>>> and
>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
> and
>>>>>> it
>>>>>>>>>>> loops
>>>>>>>>>>>> over and over.
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------
>>>>>>>>>>>> Server config
>>>>>>>>>>>> -------------
>>>>>>>>>>>> 
>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>> 
>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>> 
>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>> 
>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>> different
>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>> machines
>>>>>>> in
>>>>>>>>>>> dc1
>>>>>>>>>>>> have voting rights, and the ability to become a leader.
> The
>>>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> the pods all have a weight of zero, and are not expected
> to
>>>>>>>>> become
>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>> 
> 


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Mahadev,

Some quick questions:

1. Version

I see that the CHANGES.txt calls this 3.2.1, but the build.xml is still
calling this 3.2.0. Should this be rev'd, and am I correct in calling
this release 3.2.1? 

2. Build targets

The package target fails b/c the create-cppunit-configure target fails
due to various problems w/ respect to autoconf. Are these dependencies
documented somewhere ? I'd like to have a fully building system.

create-cppunit-configure:
     [exec] Can't exec "libtoolize": No such file or directory at
/usr/bin/autoreconf line 188.
     [exec] Use of uninitialized value $libtoolize in pattern match
(m//) at /usr/bin/autoreconf line 188.
     [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not found
in library
     [exec] configure.ac:33: error: possibly undefined macro:
AM_PATH_CPPUNIT
     [exec]       If this token and others are legitimate, please use
m4_pattern_allow.
     [exec]       See the Autoconf documentation.
     [exec] configure.ac:53: error: possibly undefined macro:
AC_PROG_LIBTOOL
     [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1


3. Sync failure:

This is still failing.

svn: URL
'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
doesn't exist

-Todd

> -----Original Message-----
> From: Todd Greenwood
> Sent: Tuesday, August 04, 2009 11:26 AM
> To: 'zookeeper-user@hadoop.apache.org'
> Subject: RE: Unending Leader Elections in WAN deploy
> 
> Great news. Thank you Mahadev. I'll report our findings later today.
> -Todd
> 
> > -----Original Message-----
> > From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> > Sent: Tuesday, August 04, 2009 11:20 AM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: Unending Leader Elections in WAN deploy
> >
> > Hi Todd,
> >  I just committed 480 and 491. You can checkout the 3.2 branch now.
> >
> > Thanks
> > mahadev
> >
> >
> > On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
wrote:
> >
> > > That'd be perfect. Thanks!
> > >
> > >> -----Original Message-----
> > >> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> > >> Sent: Monday, August 03, 2009 4:24 PM
> > >> To: zookeeper-user@hadoop.apache.org
> > >> Subject: Re: Unending Leader Elections in WAN deploy
> > >>
> > >> Hi Todd,
> > >>   Most of the patches that you mention should be in the branch
3.2 by
> > > tomm
> > >> or so. 481, 479 are already in. 480 and 491 should be in by tomm.
> > > Would
> > >> that
> > >> suffice for you?
> > >>
> > >> Thanks
> > >> mahadev
> > >>
> > >>
> > >> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
> wrote:
> > >>
> > >>> Another problem...I've reverted to the latest versions of the
> > > patches
> > >>> that are not specific to branch-3.2, and I'm getting two
compilation
> > >>> errors:
> > >>>
> > >>> build-generated:
> > >>>     [javac] Compiling 44 source files to
> > >>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>> atched/branch-3.2/build/classes
> > >>>
> > >>> compile-main:
> > >>>     [javac] Compiling 2 source files to
> > >>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>> atched/branch-3.2/build/classes
> > >>>     [javac]
> > >>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>
> > > atched/branch-
> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > >>> mStats.java:30: name clash: getQuorumPeers() and
getQuorumPeers()
> > > have
> > >>> the same erasure
> > >>>     [javac]         public String[] getQuorumPeers();
> > >>>     [javac]                         ^
> > >>>     [javac]
> > >>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>
> > > atched/branch-
> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > >>> mStats.java:31: name clash: getServerState() and
getServerState()
> > > have
> > >>> the same erasure
> > >>>     [javac]         public String getServerState();
> > >>>     [javac]                       ^
> > >>>     [javac] 2 errors
> > >>>
> > >>> My build process is pretty simple:
> > >>>
> > >>> 1. copy the branch-3.2 source to a temp directory
> > >>> (src/patched/branch-3.2)
> > >>> 2. apply the ZOOKEEPER patches in my patches directory
> > >>> 3. build zookeeper in the temp directory
> > >>>
> > >>> -Todd
> > >>>> -----Original Message-----
> > >>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> > >>>> Sent: Monday, August 03, 2009 4:09 PM
> > >>>> To: zookeeper-user@hadoop.apache.org
> > >>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>
> > >>>> Flavio,
> > >>>> I notice that you've updated the patches referenced for the WAN
> > >>>> deployment. There appears to be an order dependency w/ respect
to
> > >>> these
> > >>>> four patches...
> > >>>>
> > >>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> > >>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> > >>>>
> > >>>> 473 -> 479 (479 fails)
> > >>>>
> > >>>>
> > >>>
> > >
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > >>>> /src/patched/branch-3.2$ patch -p0 <
> > >>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> > >>>> patching file
> > >>>>
> > >>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> > >>>> ical.java
> > >>>> patching file
> > >>>>
> > >>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> > >>>> patching file
> > >>>>
> > >>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> > >>>> .java
> > >>>> patching file
> > >>>>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> > >>>> Hunk #1 FAILED at 93.
> > >>>> Hunk #2 FAILED at 145.
> > >>>> 2 out of 2 hunks FAILED -- saving rejects to file
> > >>>>
> > >>>
> > >
>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> > >>>>
> > >>>
> > >
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > >>>> /src/patched/branch-3.2$ h ../patches/
> > >>>>
> > >>>> Could you advise as to which patches I need to apply, and in
what
> > >>> order?
> > >>>>
> > >>>> -Todd
> > >>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>> Sent: Friday, July 31, 2009 9:51 PM
> > >>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>
> > >>>>> Perfect! Thanks for the update, Todd.
> > >>>>>
> > >>>>> -Flavio
> > >>>>>
> > >>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> > >>>>>
> > >>>>>> Thanks. You were right, I had a stale version of 479.
Compilation
> > >>>>>> succeeds and all tests pass on branch-3.2 with the latest
patches
> > >>>> 473,
> > >>>>>> 479, 481, and 491.
> > >>>>>>
> > >>>>>> -Todd
> > >>>>>>
> > >>>>>>> -----Original Message-----
> > >>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> > >>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>
> > >>>>>>> It should be in 479. Perhaps you have a stale version of the
> > >>> patch.
> > >>>>>>>
> > >>>>>>> -Flavio
> > >>>>>>>
> > >>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> > >>>>>>>
> > >>>>>>>> Flavio,
> > >>>>>>>>
> > >>>>>>>> I'm getting a compilation error for patch 491:
> > >>>>>>>>
> > >>>>>>>> compile-main:
> > >>>>>>>>   [javac] Compiling 1 source file to
> > >>>>>>>>
> > >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>>>>>>> src/p
> > >>>>>>>> atched/branch-3.2/build/classes
> > >>>>>>>>   [javac]
> > >>>>>>>>
> > >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>>>>>>> src/p
> > >>>>>>>>
> > >>>>
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> > >>>>>>>> FastL
> > >>>>>>>> eaderElection.java:601: cannot find symbol
> > >>>>>>>>   [javac] symbol  : method getWeight(long)
> > >>>>>>>>   [javac] location: interface
> > >>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> > >>>>>>>>   [javac]
> > >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>>>>>>>   [javac]
^
> > >>>>>>>>   [javac] 1 error
> > >>>>>>>>
> > >>>>>>>> I see a reference to getWeight in both
FastLeaderElection.java
> > >>> in
> > >>>>>>>> patch
> > >>>>>>>> 491:
> > >>>>>>>>
> > >>>>>>>> patches/ZOOKEEPER-491.patch:+
> > >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> > >>>>>>>> FastLeaderElection.java
> > >>>>>>>> :
> > >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> > >>>>>>>> 0)
> > >>>>>>>>
> > >>>>>>>> However, I don't see a reference to this method in patches
473,
> > >>>> 479,
> > >>>>>>>> or
> > >>>>>>>> 481. I also don't see a reference to this method in the
> > > trunk...
> > >>>>>>>>
> > >>>>>>>> -Todd
> > >>>>>>>>
> > >>>>>>>>> -----Original Message-----
> > >>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> > >>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> > >>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>>>>>>
> > >>>>>>>>> Ok, I'll apply that patch and report back.
> > >>>>>>>>> -Todd
> > >>>>>>>>>
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> > >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>>>>
> > >>>>>>>>>> You're missing 491 from your set of patches.
> > >>>>>>>>>>
> > >>>>>>>>>> -Flavio
> > >>>>>>>>>>
> > >>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> This repro's in both branch-3.2, and
branch-3.2+patches(473,
> > >>>> 479,
> > >>>>>>>>>>> 481).
> > >>>>>>>>>>>
> > >>>>>>>>>>> Basically, it seems like the nodes are electing
pd4-zook02
> > > to
> > >>>> be
> > >>>>>>>> the
> > >>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> > >>> supposed
> > >>>> to
> > >>>>>>>> be
> > >>>>>>>>>>> and
> > >>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
and
> > >>> it
> > >>>>>>>> loops
> > >>>>>>>>>>> over and over.
> > >>>>>>>>>>>
> > >>>>>>>>>>> -------------
> > >>>>>>>>>>> Server config
> > >>>>>>>>>>> -------------
> > >>>>>>>>>>>
> > >>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> > >>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> > >>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> > >>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> > >>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> > >>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> > >>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> > >>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> > >>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> > >>>>>>>>>>>
> > >>>>>>>>>>> group.1:1:2:3:4:5
> > >>>>>>>>>>> weight.1=1
> > >>>>>>>>>>> weight.2=1
> > >>>>>>>>>>> weight.3=1
> > >>>>>>>>>>> weight.4=1
> > >>>>>>>>>>> weight.5=1
> > >>>>>>>>>>>
> > >>>>>>>>>>> group.2:6:7:8:9
> > >>>>>>>>>>> weight.6=0
> > >>>>>>>>>>> weight.7=0
> > >>>>>>>>>>> weight.8=0
> > >>>>>>>>>>> weight.9=0
> > >>>>>>>>>>>
> > >>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> > >>> different
> > >>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> > > machines
> > >>>> in
> > >>>>>>>> dc1
> > >>>>>>>>>>> have voting rights, and the ability to become a leader.
The
> > >>>>>>>> machines
> > >>>>>>>>>>> in
> > >>>>>>>>>>> the pods all have a weight of zero, and are not expected
to
> > >>>>>> become
> > >>>>>>>>>>> leaders, or to vote on transactions.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Let me know what I can do to help resolve this issue.
> > >>>>>>>>>>>
> > >>>>>>>>>>> -Todd
> > >>>>>>>>
> > >>>>>>
> > >>>
> > >


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Great news. Thank you Mahadev. I'll report our findings later today.
-Todd

> -----Original Message-----
> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> Sent: Tuesday, August 04, 2009 11:20 AM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Hi Todd,
>  I just committed 480 and 491. You can checkout the 3.2 branch now.
> 
> Thanks
> mahadev
> 
> 
> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
> 
> > That'd be perfect. Thanks!
> >
> >> -----Original Message-----
> >> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >> Sent: Monday, August 03, 2009 4:24 PM
> >> To: zookeeper-user@hadoop.apache.org
> >> Subject: Re: Unending Leader Elections in WAN deploy
> >>
> >> Hi Todd,
> >>   Most of the patches that you mention should be in the branch 3.2
by
> > tomm
> >> or so. 481, 479 are already in. 480 and 491 should be in by tomm.
> > Would
> >> that
> >> suffice for you?
> >>
> >> Thanks
> >> mahadev
> >>
> >>
> >> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
wrote:
> >>
> >>> Another problem...I've reverted to the latest versions of the
> > patches
> >>> that are not specific to branch-3.2, and I'm getting two
compilation
> >>> errors:
> >>>
> >>> build-generated:
> >>>     [javac] Compiling 44 source files to
> >>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>> atched/branch-3.2/build/classes
> >>>
> >>> compile-main:
> >>>     [javac] Compiling 2 source files to
> >>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>> atched/branch-3.2/build/classes
> >>>     [javac]
> >>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>
> >
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>> mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers()
> > have
> >>> the same erasure
> >>>     [javac]         public String[] getQuorumPeers();
> >>>     [javac]                         ^
> >>>     [javac]
> >>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>
> >
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>> mStats.java:31: name clash: getServerState() and getServerState()
> > have
> >>> the same erasure
> >>>     [javac]         public String getServerState();
> >>>     [javac]                       ^
> >>>     [javac] 2 errors
> >>>
> >>> My build process is pretty simple:
> >>>
> >>> 1. copy the branch-3.2 source to a temp directory
> >>> (src/patched/branch-3.2)
> >>> 2. apply the ZOOKEEPER patches in my patches directory
> >>> 3. build zookeeper in the temp directory
> >>>
> >>> -Todd
> >>>> -----Original Message-----
> >>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>> Sent: Monday, August 03, 2009 4:09 PM
> >>>> To: zookeeper-user@hadoop.apache.org
> >>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>
> >>>> Flavio,
> >>>> I notice that you've updated the patches referenced for the WAN
> >>>> deployment. There appears to be an order dependency w/ respect to
> >>> these
> >>>> four patches...
> >>>>
> >>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>>>
> >>>> 473 -> 479 (479 fails)
> >>>>
> >>>>
> >>>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>> /src/patched/branch-3.2$ patch -p0 <
> >>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> >>>> patching file
> >>>>
> >>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >>>> ical.java
> >>>> patching file
> >>>>
> >>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >>>> patching file
> >>>>
> >>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >>>> .java
> >>>> patching file
> >>>>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >>>> Hunk #1 FAILED at 93.
> >>>> Hunk #2 FAILED at 145.
> >>>> 2 out of 2 hunks FAILED -- saving rejects to file
> >>>>
> >>>
> >
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >>>>
> >>>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>> /src/patched/branch-3.2$ h ../patches/
> >>>>
> >>>> Could you advise as to which patches I need to apply, and in what
> >>> order?
> >>>>
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>> Sent: Friday, July 31, 2009 9:51 PM
> >>>>> To: zookeeper-user@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> Perfect! Thanks for the update, Todd.
> >>>>>
> >>>>> -Flavio
> >>>>>
> >>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>>>
> >>>>>> Thanks. You were right, I had a stale version of 479.
Compilation
> >>>>>> succeeds and all tests pass on branch-3.2 with the latest
patches
> >>>> 473,
> >>>>>> 479, 481, and 491.
> >>>>>>
> >>>>>> -Todd
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>
> >>>>>>> It should be in 479. Perhaps you have a stale version of the
> >>> patch.
> >>>>>>>
> >>>>>>> -Flavio
> >>>>>>>
> >>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>>>
> >>>>>>>> Flavio,
> >>>>>>>>
> >>>>>>>> I'm getting a compilation error for patch 491:
> >>>>>>>>
> >>>>>>>> compile-main:
> >>>>>>>>   [javac] Compiling 1 source file to
> >>>>>>>>
> >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>> src/p
> >>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>   [javac]
> >>>>>>>>
> >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>> src/p
> >>>>>>>>
> >>>>
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>> FastL
> >>>>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>>>   [javac] location: interface
> >>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>>>   [javac]
> >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>   [javac]
^
> >>>>>>>>   [javac] 1 error
> >>>>>>>>
> >>>>>>>> I see a reference to getWeight in both
FastLeaderElection.java
> >>> in
> >>>>>>>> patch
> >>>>>>>> 491:
> >>>>>>>>
> >>>>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>> FastLeaderElection.java
> >>>>>>>> :
> >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>>>> 0)
> >>>>>>>>
> >>>>>>>> However, I don't see a reference to this method in patches
473,
> >>>> 479,
> >>>>>>>> or
> >>>>>>>> 481. I also don't see a reference to this method in the
> > trunk...
> >>>>>>>>
> >>>>>>>> -Todd
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>
> >>>>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>>>> -Todd
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>
> >>>>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>>>
> >>>>>>>>>> -Flavio
> >>>>>>>>>>
> >>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>>>
> >>>>>>>>>>> This repro's in both branch-3.2, and
branch-3.2+patches(473,
> >>>> 479,
> >>>>>>>>>>> 481).
> >>>>>>>>>>>
> >>>>>>>>>>> Basically, it seems like the nodes are electing pd4-zook02
> > to
> >>>> be
> >>>>>>>> the
> >>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> >>> supposed
> >>>> to
> >>>>>>>> be
> >>>>>>>>>>> and
> >>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
and
> >>> it
> >>>>>>>> loops
> >>>>>>>>>>> over and over.
> >>>>>>>>>>>
> >>>>>>>>>>> -------------
> >>>>>>>>>>> Server config
> >>>>>>>>>>> -------------
> >>>>>>>>>>>
> >>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>>>
> >>>>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>>>> weight.1=1
> >>>>>>>>>>> weight.2=1
> >>>>>>>>>>> weight.3=1
> >>>>>>>>>>> weight.4=1
> >>>>>>>>>>> weight.5=1
> >>>>>>>>>>>
> >>>>>>>>>>> group.2:6:7:8:9
> >>>>>>>>>>> weight.6=0
> >>>>>>>>>>> weight.7=0
> >>>>>>>>>>> weight.8=0
> >>>>>>>>>>> weight.9=0
> >>>>>>>>>>>
> >>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> >>> different
> >>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> > machines
> >>>> in
> >>>>>>>> dc1
> >>>>>>>>>>> have voting rights, and the ability to become a leader.
The
> >>>>>>>> machines
> >>>>>>>>>>> in
> >>>>>>>>>>> the pods all have a weight of zero, and are not expected
to
> >>>>>> become
> >>>>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>>>
> >>>>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>>>
> >>>>>>>>>>> -Todd
> >>>>>>>>
> >>>>>>
> >>>
> >


Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Todd, 
 I just committed 480 and 491. You can checkout the 3.2 branch now.

Thanks
mahadev


On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> That'd be perfect. Thanks!
> 
>> -----Original Message-----
>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>> Sent: Monday, August 03, 2009 4:24 PM
>> To: zookeeper-user@hadoop.apache.org
>> Subject: Re: Unending Leader Elections in WAN deploy
>> 
>> Hi Todd,
>>   Most of the patches that you mention should be in the branch 3.2 by
> tomm
>> or so. 481, 479 are already in. 480 and 491 should be in by tomm.
> Would
>> that
>> suffice for you?
>> 
>> Thanks
>> mahadev
>> 
>> 
>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
>> 
>>> Another problem...I've reverted to the latest versions of the
> patches
>>> that are not specific to branch-3.2, and I'm getting two compilation
>>> errors:
>>> 
>>> build-generated:
>>>     [javac] Compiling 44 source files to
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>> atched/branch-3.2/build/classes
>>> 
>>> compile-main:
>>>     [javac] Compiling 2 source files to
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>> atched/branch-3.2/build/classes
>>>     [javac]
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>> 
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>> mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers()
> have
>>> the same erasure
>>>     [javac]         public String[] getQuorumPeers();
>>>     [javac]                         ^
>>>     [javac]
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>> 
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>> mStats.java:31: name clash: getServerState() and getServerState()
> have
>>> the same erasure
>>>     [javac]         public String getServerState();
>>>     [javac]                       ^
>>>     [javac] 2 errors
>>> 
>>> My build process is pretty simple:
>>> 
>>> 1. copy the branch-3.2 source to a temp directory
>>> (src/patched/branch-3.2)
>>> 2. apply the ZOOKEEPER patches in my patches directory
>>> 3. build zookeeper in the temp directory
>>> 
>>> -Todd
>>>> -----Original Message-----
>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>> To: zookeeper-user@hadoop.apache.org
>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>> 
>>>> Flavio,
>>>> I notice that you've updated the patches referenced for the WAN
>>>> deployment. There appears to be an order dependency w/ respect to
>>> these
>>>> four patches...
>>>> 
>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>> 
>>>> 473 -> 479 (479 fails)
>>>> 
>>>> 
>>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>> /src/patched/branch-3.2$ patch -p0 <
>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>> patching file
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>> ical.java
>>>> patching file
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>> patching file
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>> .java
>>>> patching file
>>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>> Hunk #1 FAILED at 93.
>>>> Hunk #2 FAILED at 145.
>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>> 
>>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>>> 
>>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>> /src/patched/branch-3.2$ h ../patches/
>>>> 
>>>> Could you advise as to which patches I need to apply, and in what
>>> order?
>>>> 
>>>> -Todd
>>>> 
>>>>> -----Original Message-----
>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> Perfect! Thanks for the update, Todd.
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>> 
>>>>>> Thanks. You were right, I had a stale version of 479. Compilation
>>>>>> succeeds and all tests pass on branch-3.2 with the latest patches
>>>> 473,
>>>>>> 479, 481, and 491.
>>>>>> 
>>>>>> -Todd
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> It should be in 479. Perhaps you have a stale version of the
>>> patch.
>>>>>>> 
>>>>>>> -Flavio
>>>>>>> 
>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>> 
>>>>>>>> Flavio,
>>>>>>>> 
>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>> 
>>>>>>>> compile-main:
>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>> 
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>> src/p
>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>   [javac]
>>>>>>>> 
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>> src/p
>>>>>>>> 
>>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>> FastL
>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>   [javac] location: interface
>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>   [javac]
>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>   [javac]                                                    ^
>>>>>>>>   [javac] 1 error
>>>>>>>> 
>>>>>>>> I see a reference to getWeight in both FastLeaderElection.java
>>> in
>>>>>>>> patch
>>>>>>>> 491:
>>>>>>>> 
>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>> FastLeaderElection.java
>>>>>>>> :
>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>> 0)
>>>>>>>> 
>>>>>>>> However, I don't see a reference to this method in patches 473,
>>>> 479,
>>>>>>>> or
>>>>>>>> 481. I also don't see a reference to this method in the
> trunk...
>>>>>>>> 
>>>>>>>> -Todd
>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>> 
>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>> -Todd
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>> 
>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>> 
>>>>>>>>>> -Flavio
>>>>>>>>>> 
>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>> 
>>>>>>>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473,
>>>> 479,
>>>>>>>>>>> 481).
>>>>>>>>>>> 
>>>>>>>>>>> Basically, it seems like the nodes are electing pd4-zook02
> to
>>>> be
>>>>>>>> the
>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>> supposed
>>>> to
>>>>>>>> be
>>>>>>>>>>> and
>>>>>>>>>>> then disconnects everyone. Then they re-elect it again, and
>>> it
>>>>>>>> loops
>>>>>>>>>>> over and over.
>>>>>>>>>>> 
>>>>>>>>>>> -------------
>>>>>>>>>>> Server config
>>>>>>>>>>> -------------
>>>>>>>>>>> 
>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>> 
>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>> weight.1=1
>>>>>>>>>>> weight.2=1
>>>>>>>>>>> weight.3=1
>>>>>>>>>>> weight.4=1
>>>>>>>>>>> weight.5=1
>>>>>>>>>>> 
>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>> weight.6=0
>>>>>>>>>>> weight.7=0
>>>>>>>>>>> weight.8=0
>>>>>>>>>>> weight.9=0
>>>>>>>>>>> 
>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>> different
>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> machines
>>>> in
>>>>>>>> dc1
>>>>>>>>>>> have voting rights, and the ability to become a leader. The
>>>>>>>> machines
>>>>>>>>>>> in
>>>>>>>>>>> the pods all have a weight of zero, and are not expected to
>>>>>> become
>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>> 
>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>> 
>>>>>>>>>>> -Todd
>>>>>>>> 
>>>>>> 
>>> 
> 


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
That'd be perfect. Thanks!

> -----Original Message-----
> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> Sent: Monday, August 03, 2009 4:24 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Hi Todd,
>   Most of the patches that you mention should be in the branch 3.2 by
tomm
> or so. 481, 479 are already in. 480 and 491 should be in by tomm.
Would
> that
> suffice for you?
> 
> Thanks
> mahadev
> 
> 
> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
> 
> > Another problem...I've reverted to the latest versions of the
patches
> > that are not specific to branch-3.2, and I'm getting two compilation
> > errors:
> >
> > build-generated:
> >     [javac] Compiling 44 source files to
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > atched/branch-3.2/build/classes
> >
> > compile-main:
> >     [javac] Compiling 2 source files to
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > atched/branch-3.2/build/classes
> >     [javac]
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers()
have
> > the same erasure
> >     [javac]         public String[] getQuorumPeers();
> >     [javac]                         ^
> >     [javac]
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > mStats.java:31: name clash: getServerState() and getServerState()
have
> > the same erasure
> >     [javac]         public String getServerState();
> >     [javac]                       ^
> >     [javac] 2 errors
> >
> > My build process is pretty simple:
> >
> > 1. copy the branch-3.2 source to a temp directory
> > (src/patched/branch-3.2)
> > 2. apply the ZOOKEEPER patches in my patches directory
> > 3. build zookeeper in the temp directory
> >
> > -Todd
> >> -----Original Message-----
> >> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >> Sent: Monday, August 03, 2009 4:09 PM
> >> To: zookeeper-user@hadoop.apache.org
> >> Subject: RE: Unending Leader Elections in WAN deploy
> >>
> >> Flavio,
> >> I notice that you've updated the patches referenced for the WAN
> >> deployment. There appears to be an order dependency w/ respect to
> > these
> >> four patches...
> >>
> >> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>
> >> 473 -> 479 (479 fails)
> >>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >> /src/patched/branch-3.2$ patch -p0 <
> >> ../patches/ZOOKEEPER-479-branch3.2.patch
> >> patching file
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >> ical.java
> >> patching file
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >> patching file
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >> .java
> >> patching file
> >> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >> Hunk #1 FAILED at 93.
> >> Hunk #2 FAILED at 145.
> >> 2 out of 2 hunks FAILED -- saving rejects to file
> >>
> >
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >> /src/patched/branch-3.2$ h ../patches/
> >>
> >> Could you advise as to which patches I need to apply, and in what
> > order?
> >>
> >> -Todd
> >>
> >>> -----Original Message-----
> >>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>> Sent: Friday, July 31, 2009 9:51 PM
> >>> To: zookeeper-user@hadoop.apache.org
> >>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>
> >>> Perfect! Thanks for the update, Todd.
> >>>
> >>> -Flavio
> >>>
> >>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>
> >>>> Thanks. You were right, I had a stale version of 479. Compilation
> >>>> succeeds and all tests pass on branch-3.2 with the latest patches
> >> 473,
> >>>> 479, 481, and 491.
> >>>>
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>> To: zookeeper-user@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> It should be in 479. Perhaps you have a stale version of the
> > patch.
> >>>>>
> >>>>> -Flavio
> >>>>>
> >>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>
> >>>>>> Flavio,
> >>>>>>
> >>>>>> I'm getting a compilation error for patch 491:
> >>>>>>
> >>>>>> compile-main:
> >>>>>>   [javac] Compiling 1 source file to
> >>>>>>
> >> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>> src/p
> >>>>>> atched/branch-3.2/build/classes
> >>>>>>   [javac]
> >>>>>>
> >> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>> src/p
> >>>>>>
> >> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>> FastL
> >>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>   [javac] location: interface
> >>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>   [javac]
> >>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>   [javac]                                                    ^
> >>>>>>   [javac] 1 error
> >>>>>>
> >>>>>> I see a reference to getWeight in both FastLeaderElection.java
> > in
> >>>>>> patch
> >>>>>> 491:
> >>>>>>
> >>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>> FastLeaderElection.java
> >>>>>> :
> >>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>> 0)
> >>>>>>
> >>>>>> However, I don't see a reference to this method in patches 473,
> >> 479,
> >>>>>> or
> >>>>>> 481. I also don't see a reference to this method in the
trunk...
> >>>>>>
> >>>>>> -Todd
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>
> >>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>> -Todd
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>
> >>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>
> >>>>>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473,
> >> 479,
> >>>>>>>>> 481).
> >>>>>>>>>
> >>>>>>>>> Basically, it seems like the nodes are electing pd4-zook02
to
> >> be
> >>>>>> the
> >>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> > supposed
> >> to
> >>>>>> be
> >>>>>>>>> and
> >>>>>>>>> then disconnects everyone. Then they re-elect it again, and
> > it
> >>>>>> loops
> >>>>>>>>> over and over.
> >>>>>>>>>
> >>>>>>>>> -------------
> >>>>>>>>> Server config
> >>>>>>>>> -------------
> >>>>>>>>>
> >>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>
> >>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>> weight.1=1
> >>>>>>>>> weight.2=1
> >>>>>>>>> weight.3=1
> >>>>>>>>> weight.4=1
> >>>>>>>>> weight.5=1
> >>>>>>>>>
> >>>>>>>>> group.2:6:7:8:9
> >>>>>>>>> weight.6=0
> >>>>>>>>> weight.7=0
> >>>>>>>>> weight.8=0
> >>>>>>>>> weight.9=0
> >>>>>>>>>
> >>>>>>>>> Note that we have 2 groups, composed of machines in 3
> > different
> >>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
machines
> >> in
> >>>>>> dc1
> >>>>>>>>> have voting rights, and the ability to become a leader. The
> >>>>>> machines
> >>>>>>>>> in
> >>>>>>>>> the pods all have a weight of zero, and are not expected to
> >>>> become
> >>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>
> >>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>
> >>>>>>>>> -Todd
> >>>>>>
> >>>>
> >


Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Todd,
  Most of the patches that you mention should be in the branch 3.2 by tomm
or so. 481, 479 are already in. 480 and 491 should be in by tomm. Would that
suffice for you?

Thanks
mahadev 


On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> Another problem...I've reverted to the latest versions of the patches
> that are not specific to branch-3.2, and I'm getting two compilation
> errors:
> 
> build-generated:
>     [javac] Compiling 44 source files to
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> atched/branch-3.2/build/classes
> 
> compile-main:
>     [javac] Compiling 2 source files to
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> atched/branch-3.2/build/classes
>     [javac]
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers() have
> the same erasure
>     [javac]         public String[] getQuorumPeers();
>     [javac]                         ^
>     [javac]
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> mStats.java:31: name clash: getServerState() and getServerState() have
> the same erasure
>     [javac]         public String getServerState();
>     [javac]                       ^
>     [javac] 2 errors
> 
> My build process is pretty simple:
> 
> 1. copy the branch-3.2 source to a temp directory
> (src/patched/branch-3.2)
> 2. apply the ZOOKEEPER patches in my patches directory
> 3. build zookeeper in the temp directory
> 
> -Todd
>> -----Original Message-----
>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>> Sent: Monday, August 03, 2009 4:09 PM
>> To: zookeeper-user@hadoop.apache.org
>> Subject: RE: Unending Leader Elections in WAN deploy
>> 
>> Flavio,
>> I notice that you've updated the patches referenced for the WAN
>> deployment. There appears to be an order dependency w/ respect to
> these
>> four patches...
>> 
>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>> 
>> 473 -> 479 (479 fails)
>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>> /src/patched/branch-3.2$ patch -p0 <
>> ../patches/ZOOKEEPER-479-branch3.2.patch
>> patching file
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>> ical.java
>> patching file
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>> patching file
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>> .java
>> patching file
>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>> Hunk #1 FAILED at 93.
>> Hunk #2 FAILED at 145.
>> 2 out of 2 hunks FAILED -- saving rejects to file
>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>> /src/patched/branch-3.2$ h ../patches/
>> 
>> Could you advise as to which patches I need to apply, and in what
> order?
>> 
>> -Todd
>> 
>>> -----Original Message-----
>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>> Sent: Friday, July 31, 2009 9:51 PM
>>> To: zookeeper-user@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>> 
>>> Perfect! Thanks for the update, Todd.
>>> 
>>> -Flavio
>>> 
>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>> 
>>>> Thanks. You were right, I had a stale version of 479. Compilation
>>>> succeeds and all tests pass on branch-3.2 with the latest patches
>> 473,
>>>> 479, 481, and 491.
>>>> 
>>>> -Todd
>>>> 
>>>>> -----Original Message-----
>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> It should be in 479. Perhaps you have a stale version of the
> patch.
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>> 
>>>>>> Flavio,
>>>>>> 
>>>>>> I'm getting a compilation error for patch 491:
>>>>>> 
>>>>>> compile-main:
>>>>>>   [javac] Compiling 1 source file to
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>> src/p
>>>>>> atched/branch-3.2/build/classes
>>>>>>   [javac]
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>> src/p
>>>>>> 
>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>> FastL
>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>   [javac] location: interface
>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>   [javac]
>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>   [javac]                                                    ^
>>>>>>   [javac] 1 error
>>>>>> 
>>>>>> I see a reference to getWeight in both FastLeaderElection.java
> in
>>>>>> patch
>>>>>> 491:
>>>>>> 
>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>> FastLeaderElection.java
>>>>>> :
>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>> 0)
>>>>>> 
>>>>>> However, I don't see a reference to this method in patches 473,
>> 479,
>>>>>> or
>>>>>> 481. I also don't see a reference to this method in the trunk...
>>>>>> 
>>>>>> -Todd
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>> -Todd
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>> 
>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>> 
>>>>>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473,
>> 479,
>>>>>>>>> 481).
>>>>>>>>> 
>>>>>>>>> Basically, it seems like the nodes are electing pd4-zook02 to
>> be
>>>>>> the
>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> supposed
>> to
>>>>>> be
>>>>>>>>> and
>>>>>>>>> then disconnects everyone. Then they re-elect it again, and
> it
>>>>>> loops
>>>>>>>>> over and over.
>>>>>>>>> 
>>>>>>>>> -------------
>>>>>>>>> Server config
>>>>>>>>> -------------
>>>>>>>>> 
>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>> 
>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>> weight.1=1
>>>>>>>>> weight.2=1
>>>>>>>>> weight.3=1
>>>>>>>>> weight.4=1
>>>>>>>>> weight.5=1
>>>>>>>>> 
>>>>>>>>> group.2:6:7:8:9
>>>>>>>>> weight.6=0
>>>>>>>>> weight.7=0
>>>>>>>>> weight.8=0
>>>>>>>>> weight.9=0
>>>>>>>>> 
>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> different
>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only machines
>> in
>>>>>> dc1
>>>>>>>>> have voting rights, and the ability to become a leader. The
>>>>>> machines
>>>>>>>>> in
>>>>>>>>> the pods all have a weight of zero, and are not expected to
>>>> become
>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>> 
>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>> 
>>>>>>>>> -Todd
>>>>>> 
>>>> 
> 


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Another problem...I've reverted to the latest versions of the patches
that are not specific to branch-3.2, and I'm getting two compilation
errors:

build-generated:
    [javac] Compiling 44 source files to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes

compile-main:
    [javac] Compiling 2 source files to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes
    [javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers() have
the same erasure
    [javac]         public String[] getQuorumPeers();
    [javac]                         ^
    [javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
mStats.java:31: name clash: getServerState() and getServerState() have
the same erasure
    [javac]         public String getServerState();
    [javac]                       ^
    [javac] 2 errors

My build process is pretty simple:

1. copy the branch-3.2 source to a temp directory
(src/patched/branch-3.2)
2. apply the ZOOKEEPER patches in my patches directory
3. build zookeeper in the temp directory

-Todd
> -----Original Message-----
> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> Sent: Monday, August 03, 2009 4:09 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: RE: Unending Leader Elections in WAN deploy
> 
> Flavio,
> I notice that you've updated the patches referenced for the WAN
> deployment. There appears to be an order dependency w/ respect to
these
> four patches...
> 
> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> 
> 473 -> 479 (479 fails)
> 
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> /src/patched/branch-3.2$ patch -p0 <
> ../patches/ZOOKEEPER-479-branch3.2.patch
> patching file
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> ical.java
> patching file
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> patching file
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> .java
> patching file
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> Hunk #1 FAILED at 93.
> Hunk #2 FAILED at 145.
> 2 out of 2 hunks FAILED -- saving rejects to file
>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> /src/patched/branch-3.2$ h ../patches/
> 
> Could you advise as to which patches I need to apply, and in what
order?
> 
> -Todd
> 
> > -----Original Message-----
> > From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > Sent: Friday, July 31, 2009 9:51 PM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: Unending Leader Elections in WAN deploy
> >
> > Perfect! Thanks for the update, Todd.
> >
> > -Flavio
> >
> > On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >
> > > Thanks. You were right, I had a stale version of 479. Compilation
> > > succeeds and all tests pass on branch-3.2 with the latest patches
> 473,
> > > 479, 481, and 491.
> > >
> > > -Todd
> > >
> > >> -----Original Message-----
> > >> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >> Sent: Friday, July 31, 2009 7:48 PM
> > >> To: zookeeper-user@hadoop.apache.org
> > >> Subject: Re: Unending Leader Elections in WAN deploy
> > >>
> > >> It should be in 479. Perhaps you have a stale version of the
patch.
> > >>
> > >> -Flavio
> > >>
> > >> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> > >>
> > >>> Flavio,
> > >>>
> > >>> I'm getting a compilation error for patch 491:
> > >>>
> > >>> compile-main:
> > >>>   [javac] Compiling 1 source file to
> > >>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>> src/p
> > >>> atched/branch-3.2/build/classes
> > >>>   [javac]
> > >>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>> src/p
> > >>>
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> > >>> FastL
> > >>> eaderElection.java:601: cannot find symbol
> > >>>   [javac] symbol  : method getWeight(long)
> > >>>   [javac] location: interface
> > >>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> > >>>   [javac]
> > >>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>>   [javac]                                                    ^
> > >>>   [javac] 1 error
> > >>>
> > >>> I see a reference to getWeight in both FastLeaderElection.java
in
> > >>> patch
> > >>> 491:
> > >>>
> > >>> patches/ZOOKEEPER-491.patch:+
> > >>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>> src/java/main/org/apache/zookeeper/server/quorum/
> > >>> FastLeaderElection.java
> > >>> :
> > >>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> > >>> 0)
> > >>>
> > >>> However, I don't see a reference to this method in patches 473,
> 479,
> > >>> or
> > >>> 481. I also don't see a reference to this method in the trunk...
> > >>>
> > >>> -Todd
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> > >>>> Sent: Friday, July 31, 2009 7:30 PM
> > >>>> To: zookeeper-user@hadoop.apache.org
> > >>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>
> > >>>> Ok, I'll apply that patch and report back.
> > >>>> -Todd
> > >>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>> Sent: Friday, July 31, 2009 7:18 PM
> > >>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>
> > >>>>> You're missing 491 from your set of patches.
> > >>>>>
> > >>>>> -Flavio
> > >>>>>
> > >>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> > >>>>>
> > >>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473,
> 479,
> > >>>>>> 481).
> > >>>>>>
> > >>>>>> Basically, it seems like the nodes are electing pd4-zook02 to
> be
> > >>> the
> > >>>>>> leader. However, pd4-zook02 seems to realize it's not
supposed
> to
> > >>> be
> > >>>>>> and
> > >>>>>> then disconnects everyone. Then they re-elect it again, and
it
> > >>> loops
> > >>>>>> over and over.
> > >>>>>>
> > >>>>>> -------------
> > >>>>>> Server config
> > >>>>>> -------------
> > >>>>>>
> > >>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> > >>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> > >>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> > >>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> > >>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> > >>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> > >>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> > >>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> > >>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> > >>>>>>
> > >>>>>> group.1:1:2:3:4:5
> > >>>>>> weight.1=1
> > >>>>>> weight.2=1
> > >>>>>> weight.3=1
> > >>>>>> weight.4=1
> > >>>>>> weight.5=1
> > >>>>>>
> > >>>>>> group.2:6:7:8:9
> > >>>>>> weight.6=0
> > >>>>>> weight.7=0
> > >>>>>> weight.8=0
> > >>>>>> weight.9=0
> > >>>>>>
> > >>>>>> Note that we have 2 groups, composed of machines in 3
different
> > >>>>>> locations (dc1, pd1, and pd4). The idea is that only machines
> in
> > >>> dc1
> > >>>>>> have voting rights, and the ability to become a leader. The
> > >>> machines
> > >>>>>> in
> > >>>>>> the pods all have a weight of zero, and are not expected to
> > > become
> > >>>>>> leaders, or to vote on transactions.
> > >>>>>>
> > >>>>>> Let me know what I can do to help resolve this issue.
> > >>>>>>
> > >>>>>> -Todd
> > >>>
> > >


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Flavio,
I notice that you've updated the patches referenced for the WAN
deployment. There appears to be an order dependency w/ respect to these
four patches...

ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch

473 -> 479 (479 fails)

toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/patched/branch-3.2$ patch -p0 <
../patches/ZOOKEEPER-479-branch3.2.patch 
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
ical.java
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
.java
patching file
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
Hunk #1 FAILED at 93.
Hunk #2 FAILED at 145.
2 out of 2 hunks FAILED -- saving rejects to file
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/patched/branch-3.2$ h ../patches/

Could you advise as to which patches I need to apply, and in what order?

-Todd

> -----Original Message-----
> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> Sent: Friday, July 31, 2009 9:51 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Perfect! Thanks for the update, Todd.
> 
> -Flavio
> 
> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> 
> > Thanks. You were right, I had a stale version of 479. Compilation
> > succeeds and all tests pass on branch-3.2 with the latest patches
473,
> > 479, 481, and 491.
> >
> > -Todd
> >
> >> -----Original Message-----
> >> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >> Sent: Friday, July 31, 2009 7:48 PM
> >> To: zookeeper-user@hadoop.apache.org
> >> Subject: Re: Unending Leader Elections in WAN deploy
> >>
> >> It should be in 479. Perhaps you have a stale version of the patch.
> >>
> >> -Flavio
> >>
> >> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>
> >>> Flavio,
> >>>
> >>> I'm getting a compilation error for patch 491:
> >>>
> >>> compile-main:
> >>>   [javac] Compiling 1 source file to
> >>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>> src/p
> >>> atched/branch-3.2/build/classes
> >>>   [javac]
> >>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>> src/p
> >>>
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>> FastL
> >>> eaderElection.java:601: cannot find symbol
> >>>   [javac] symbol  : method getWeight(long)
> >>>   [javac] location: interface
> >>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>   [javac]
> >>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>   [javac]                                                    ^
> >>>   [javac] 1 error
> >>>
> >>> I see a reference to getWeight in both FastLeaderElection.java in
> >>> patch
> >>> 491:
> >>>
> >>> patches/ZOOKEEPER-491.patch:+
> >>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>> src/java/main/org/apache/zookeeper/server/quorum/
> >>> FastLeaderElection.java
> >>> :
> >>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>> 0)
> >>>
> >>> However, I don't see a reference to this method in patches 473,
479,
> >>> or
> >>> 481. I also don't see a reference to this method in the trunk...
> >>>
> >>> -Todd
> >>>
> >>>> -----Original Message-----
> >>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>> To: zookeeper-user@hadoop.apache.org
> >>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>
> >>>> Ok, I'll apply that patch and report back.
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>> To: zookeeper-user@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> You're missing 491 from your set of patches.
> >>>>>
> >>>>> -Flavio
> >>>>>
> >>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>
> >>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473,
479,
> >>>>>> 481).
> >>>>>>
> >>>>>> Basically, it seems like the nodes are electing pd4-zook02 to
be
> >>> the
> >>>>>> leader. However, pd4-zook02 seems to realize it's not supposed
to
> >>> be
> >>>>>> and
> >>>>>> then disconnects everyone. Then they re-elect it again, and it
> >>> loops
> >>>>>> over and over.
> >>>>>>
> >>>>>> -------------
> >>>>>> Server config
> >>>>>> -------------
> >>>>>>
> >>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>
> >>>>>> group.1:1:2:3:4:5
> >>>>>> weight.1=1
> >>>>>> weight.2=1
> >>>>>> weight.3=1
> >>>>>> weight.4=1
> >>>>>> weight.5=1
> >>>>>>
> >>>>>> group.2:6:7:8:9
> >>>>>> weight.6=0
> >>>>>> weight.7=0
> >>>>>> weight.8=0
> >>>>>> weight.9=0
> >>>>>>
> >>>>>> Note that we have 2 groups, composed of machines in 3 different
> >>>>>> locations (dc1, pd1, and pd4). The idea is that only machines
in
> >>> dc1
> >>>>>> have voting rights, and the ability to become a leader. The
> >>> machines
> >>>>>> in
> >>>>>> the pods all have a weight of zero, and are not expected to
> > become
> >>>>>> leaders, or to vote on transactions.
> >>>>>>
> >>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>
> >>>>>> -Todd
> >>>
> >


Re: Unending Leader Elections in WAN deploy

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Perfect! Thanks for the update, Todd.

-Flavio

On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:

> Thanks. You were right, I had a stale version of 479. Compilation
> succeeds and all tests pass on branch-3.2 with the latest patches 473,
> 479, 481, and 491.
>
> -Todd
>
>> -----Original Message-----
>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>> Sent: Friday, July 31, 2009 7:48 PM
>> To: zookeeper-user@hadoop.apache.org
>> Subject: Re: Unending Leader Elections in WAN deploy
>>
>> It should be in 479. Perhaps you have a stale version of the patch.
>>
>> -Flavio
>>
>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>
>>> Flavio,
>>>
>>> I'm getting a compilation error for patch 491:
>>>
>>> compile-main:
>>>   [javac] Compiling 1 source file to
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>> src/p
>>> atched/branch-3.2/build/classes
>>>   [javac]
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>> src/p
>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>> FastL
>>> eaderElection.java:601: cannot find symbol
>>>   [javac] symbol  : method getWeight(long)
>>>   [javac] location: interface
>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>   [javac]
>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>   [javac]                                                    ^
>>>   [javac] 1 error
>>>
>>> I see a reference to getWeight in both FastLeaderElection.java in
>>> patch
>>> 491:
>>>
>>> patches/ZOOKEEPER-491.patch:+
>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>> src/java/main/org/apache/zookeeper/server/quorum/
>>> FastLeaderElection.java
>>> :
>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>> 0)
>>>
>>> However, I don't see a reference to this method in patches 473, 479,
>>> or
>>> 481. I also don't see a reference to this method in the trunk...
>>>
>>> -Todd
>>>
>>>> -----Original Message-----
>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>> To: zookeeper-user@hadoop.apache.org
>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>
>>>> Ok, I'll apply that patch and report back.
>>>> -Todd
>>>>
>>>>> -----Original Message-----
>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>
>>>>> You're missing 491 from your set of patches.
>>>>>
>>>>> -Flavio
>>>>>
>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>
>>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
>>>>>> 481).
>>>>>>
>>>>>> Basically, it seems like the nodes are electing pd4-zook02 to be
>>> the
>>>>>> leader. However, pd4-zook02 seems to realize it's not supposed to
>>> be
>>>>>> and
>>>>>> then disconnects everyone. Then they re-elect it again, and it
>>> loops
>>>>>> over and over.
>>>>>>
>>>>>> -------------
>>>>>> Server config
>>>>>> -------------
>>>>>>
>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>
>>>>>> group.1:1:2:3:4:5
>>>>>> weight.1=1
>>>>>> weight.2=1
>>>>>> weight.3=1
>>>>>> weight.4=1
>>>>>> weight.5=1
>>>>>>
>>>>>> group.2:6:7:8:9
>>>>>> weight.6=0
>>>>>> weight.7=0
>>>>>> weight.8=0
>>>>>> weight.9=0
>>>>>>
>>>>>> Note that we have 2 groups, composed of machines in 3 different
>>>>>> locations (dc1, pd1, and pd4). The idea is that only machines in
>>> dc1
>>>>>> have voting rights, and the ability to become a leader. The
>>> machines
>>>>>> in
>>>>>> the pods all have a weight of zero, and are not expected to
> become
>>>>>> leaders, or to vote on transactions.
>>>>>>
>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>
>>>>>> -Todd
>>>
>


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Thanks. You were right, I had a stale version of 479. Compilation
succeeds and all tests pass on branch-3.2 with the latest patches 473,
479, 481, and 491.

-Todd
 
> -----Original Message-----
> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> Sent: Friday, July 31, 2009 7:48 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> It should be in 479. Perhaps you have a stale version of the patch.
> 
> -Flavio
> 
> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> 
> > Flavio,
> >
> > I'm getting a compilation error for patch 491:
> >
> > compile-main:
> >    [javac] Compiling 1 source file to
> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > src/p
> > atched/branch-3.2/build/classes
> >    [javac]
> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > src/p
> > atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> > FastL
> > eaderElection.java:601: cannot find symbol
> >    [javac] symbol  : method getWeight(long)
> >    [javac] location: interface
> > org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >    [javac]
> > if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >    [javac]                                                    ^
> >    [javac] 1 error
> >
> > I see a reference to getWeight in both FastLeaderElection.java in
> > patch
> > 491:
> >
> > patches/ZOOKEEPER-491.patch:+
> > if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > src/java/main/org/apache/zookeeper/server/quorum/
> > FastLeaderElection.java
> > :
> > if(self.getQuorumVerifier().getWeight(n.sid) !=
> > 0)
> >
> > However, I don't see a reference to this method in patches 473, 479,
> > or
> > 481. I also don't see a reference to this method in the trunk...
> >
> > -Todd
> >
> >> -----Original Message-----
> >> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >> Sent: Friday, July 31, 2009 7:30 PM
> >> To: zookeeper-user@hadoop.apache.org
> >> Subject: RE: Unending Leader Elections in WAN deploy
> >>
> >> Ok, I'll apply that patch and report back.
> >> -Todd
> >>
> >>> -----Original Message-----
> >>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>> Sent: Friday, July 31, 2009 7:18 PM
> >>> To: zookeeper-user@hadoop.apache.org
> >>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>
> >>> You're missing 491 from your set of patches.
> >>>
> >>> -Flavio
> >>>
> >>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>
> >>>> This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
> >>>> 481).
> >>>>
> >>>> Basically, it seems like the nodes are electing pd4-zook02 to be
> > the
> >>>> leader. However, pd4-zook02 seems to realize it's not supposed to
> > be
> >>>> and
> >>>> then disconnects everyone. Then they re-elect it again, and it
> > loops
> >>>> over and over.
> >>>>
> >>>> -------------
> >>>> Server config
> >>>> -------------
> >>>>
> >>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>
> >>>> group.1:1:2:3:4:5
> >>>> weight.1=1
> >>>> weight.2=1
> >>>> weight.3=1
> >>>> weight.4=1
> >>>> weight.5=1
> >>>>
> >>>> group.2:6:7:8:9
> >>>> weight.6=0
> >>>> weight.7=0
> >>>> weight.8=0
> >>>> weight.9=0
> >>>>
> >>>> Note that we have 2 groups, composed of machines in 3 different
> >>>> locations (dc1, pd1, and pd4). The idea is that only machines in
> > dc1
> >>>> have voting rights, and the ability to become a leader. The
> > machines
> >>>> in
> >>>> the pods all have a weight of zero, and are not expected to
become
> >>>> leaders, or to vote on transactions.
> >>>>
> >>>> Let me know what I can do to help resolve this issue.
> >>>>
> >>>> -Todd
> >


Re: Unending Leader Elections in WAN deploy

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
It should be in 479. Perhaps you have a stale version of the patch.

-Flavio

On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:

> Flavio,
>
> I'm getting a compilation error for patch 491:
>
> compile-main:
>    [javac] Compiling 1 source file to
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ 
> src/p
> atched/branch-3.2/build/classes
>    [javac]
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ 
> src/p
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/ 
> FastL
> eaderElection.java:601: cannot find symbol
>    [javac] symbol  : method getWeight(long)
>    [javac] location: interface
> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>    [javac]
> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>    [javac]                                                    ^
>    [javac] 1 error
>
> I see a reference to getWeight in both FastLeaderElection.java in  
> patch
> 491:
>
> patches/ZOOKEEPER-491.patch:+
> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> src/java/main/org/apache/zookeeper/server/quorum/ 
> FastLeaderElection.java
> :                         
> if(self.getQuorumVerifier().getWeight(n.sid) !=
> 0)
>
> However, I don't see a reference to this method in patches 473, 479,  
> or
> 481. I also don't see a reference to this method in the trunk...
>
> -Todd
>
>> -----Original Message-----
>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>> Sent: Friday, July 31, 2009 7:30 PM
>> To: zookeeper-user@hadoop.apache.org
>> Subject: RE: Unending Leader Elections in WAN deploy
>>
>> Ok, I'll apply that patch and report back.
>> -Todd
>>
>>> -----Original Message-----
>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>> Sent: Friday, July 31, 2009 7:18 PM
>>> To: zookeeper-user@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>
>>> You're missing 491 from your set of patches.
>>>
>>> -Flavio
>>>
>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>
>>>> This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
>>>> 481).
>>>>
>>>> Basically, it seems like the nodes are electing pd4-zook02 to be
> the
>>>> leader. However, pd4-zook02 seems to realize it's not supposed to
> be
>>>> and
>>>> then disconnects everyone. Then they re-elect it again, and it
> loops
>>>> over and over.
>>>>
>>>> -------------
>>>> Server config
>>>> -------------
>>>>
>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>
>>>> group.1:1:2:3:4:5
>>>> weight.1=1
>>>> weight.2=1
>>>> weight.3=1
>>>> weight.4=1
>>>> weight.5=1
>>>>
>>>> group.2:6:7:8:9
>>>> weight.6=0
>>>> weight.7=0
>>>> weight.8=0
>>>> weight.9=0
>>>>
>>>> Note that we have 2 groups, composed of machines in 3 different
>>>> locations (dc1, pd1, and pd4). The idea is that only machines in
> dc1
>>>> have voting rights, and the ability to become a leader. The
> machines
>>>> in
>>>> the pods all have a weight of zero, and are not expected to become
>>>> leaders, or to vote on transactions.
>>>>
>>>> Let me know what I can do to help resolve this issue.
>>>>
>>>> -Todd
>


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Flavio,

I'm getting a compilation error for patch 491:

compile-main:
    [javac] Compiling 1 source file to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes
    [javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/FastL
eaderElection.java:601: cannot find symbol
    [javac] symbol  : method getWeight(long)
    [javac] location: interface
org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
    [javac]
if(self.getQuorumVerifier().getWeight(n.sid) != 0) 
    [javac]                                                    ^
    [javac] 1 error

I see a reference to getWeight in both FastLeaderElection.java in patch
491:

patches/ZOOKEEPER-491.patch:+
if(self.getQuorumVerifier().getWeight(n.sid) != 0) 
src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java
:                        if(self.getQuorumVerifier().getWeight(n.sid) !=
0)

However, I don't see a reference to this method in patches 473, 479, or
481. I also don't see a reference to this method in the trunk...

-Todd

> -----Original Message-----
> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> Sent: Friday, July 31, 2009 7:30 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: RE: Unending Leader Elections in WAN deploy
> 
> Ok, I'll apply that patch and report back.
> -Todd
> 
> > -----Original Message-----
> > From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > Sent: Friday, July 31, 2009 7:18 PM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: Unending Leader Elections in WAN deploy
> >
> > You're missing 491 from your set of patches.
> >
> > -Flavio
> >
> > On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >
> > > This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
> > > 481).
> > >
> > > Basically, it seems like the nodes are electing pd4-zook02 to be
the
> > > leader. However, pd4-zook02 seems to realize it's not supposed to
be
> > > and
> > > then disconnects everyone. Then they re-elect it again, and it
loops
> > > over and over.
> > >
> > > -------------
> > > Server config
> > > -------------
> > >
> > > server.1=dc1-zook01.dc01.revsci.net:2888:3888
> > > server.2=dc1-zook02.dc01.revsci.net:2888:3888
> > > server.3=dc1-zook03.dc01.revsci.net:2888:3888
> > > server.4=dc1-zook04.dc01.revsci.net:2888:3888
> > > server.5=dc1-zook05.dc01.revsci.net:2888:3888
> > > server.6=pd1-zook01.pd01.revsci.net:2888:3888
> > > server.7=pd1-zook02.pd01.revsci.net:2888:3888
> > > server.8=pd4-zook01.iad1.audsci.net:2888:3888
> > > server.9=pd4-zook02.iad1.audsci.net:2888:3888
> > >
> > > group.1:1:2:3:4:5
> > > weight.1=1
> > > weight.2=1
> > > weight.3=1
> > > weight.4=1
> > > weight.5=1
> > >
> > > group.2:6:7:8:9
> > > weight.6=0
> > > weight.7=0
> > > weight.8=0
> > > weight.9=0
> > >
> > > Note that we have 2 groups, composed of machines in 3 different
> > > locations (dc1, pd1, and pd4). The idea is that only machines in
dc1
> > > have voting rights, and the ability to become a leader. The
machines
> > > in
> > > the pods all have a weight of zero, and are not expected to become
> > > leaders, or to vote on transactions.
> > >
> > > Let me know what I can do to help resolve this issue.
> > >
> > > -Todd


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Ok, I'll apply that patch and report back.
-Todd

> -----Original Message-----
> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> Sent: Friday, July 31, 2009 7:18 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> You're missing 491 from your set of patches.
> 
> -Flavio
> 
> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> 
> > This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
> > 481).
> >
> > Basically, it seems like the nodes are electing pd4-zook02 to be the
> > leader. However, pd4-zook02 seems to realize it's not supposed to be
> > and
> > then disconnects everyone. Then they re-elect it again, and it loops
> > over and over.
> >
> > -------------
> > Server config
> > -------------
> >
> > server.1=dc1-zook01.dc01.revsci.net:2888:3888
> > server.2=dc1-zook02.dc01.revsci.net:2888:3888
> > server.3=dc1-zook03.dc01.revsci.net:2888:3888
> > server.4=dc1-zook04.dc01.revsci.net:2888:3888
> > server.5=dc1-zook05.dc01.revsci.net:2888:3888
> > server.6=pd1-zook01.pd01.revsci.net:2888:3888
> > server.7=pd1-zook02.pd01.revsci.net:2888:3888
> > server.8=pd4-zook01.iad1.audsci.net:2888:3888
> > server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >
> > group.1:1:2:3:4:5
> > weight.1=1
> > weight.2=1
> > weight.3=1
> > weight.4=1
> > weight.5=1
> >
> > group.2:6:7:8:9
> > weight.6=0
> > weight.7=0
> > weight.8=0
> > weight.9=0
> >
> > Note that we have 2 groups, composed of machines in 3 different
> > locations (dc1, pd1, and pd4). The idea is that only machines in dc1
> > have voting rights, and the ability to become a leader. The machines
> > in
> > the pods all have a weight of zero, and are not expected to become
> > leaders, or to vote on transactions.
> >
> > Let me know what I can do to help resolve this issue.
> >
> > -Todd


Re: Unending Leader Elections in WAN deploy

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
You're missing 491 from your set of patches.

-Flavio

On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:

> This repro's in both branch-3.2, and branch-3.2+patches(473, 479,  
> 481).
>
> Basically, it seems like the nodes are electing pd4-zook02 to be the
> leader. However, pd4-zook02 seems to realize it's not supposed to be  
> and
> then disconnects everyone. Then they re-elect it again, and it loops
> over and over.
>
> -------------
> Server config
> -------------
>
> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>
> group.1:1:2:3:4:5
> weight.1=1
> weight.2=1
> weight.3=1
> weight.4=1
> weight.5=1
>
> group.2:6:7:8:9
> weight.6=0
> weight.7=0
> weight.8=0
> weight.9=0
>
> Note that we have 2 groups, composed of machines in 3 different
> locations (dc1, pd1, and pd4). The idea is that only machines in dc1
> have voting rights, and the ability to become a leader. The machines  
> in
> the pods all have a weight of zero, and are not expected to become
> leaders, or to vote on transactions.
>
> Let me know what I can do to help resolve this issue.
>
> -Todd


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Some how the logs did not attach. Zookeeper logs should be attached.

> -----Original Message-----
> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> Sent: Friday, July 31, 2009 7:15 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Unending Leader Elections in WAN deploy
> 
> This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
481).
> 
> Basically, it seems like the nodes are electing pd4-zook02 to be the
> leader. However, pd4-zook02 seems to realize it's not supposed to be
and
> then disconnects everyone. Then they re-elect it again, and it loops
> over and over.
> 
> -------------
> Server config
> -------------
> 
> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> 
> group.1:1:2:3:4:5
> weight.1=1
> weight.2=1
> weight.3=1
> weight.4=1
> weight.5=1
> 
> group.2:6:7:8:9
> weight.6=0
> weight.7=0
> weight.8=0
> weight.9=0
> 
> Note that we have 2 groups, composed of machines in 3 different
> locations (dc1, pd1, and pd4). The idea is that only machines in dc1
> have voting rights, and the ability to become a leader. The machines
in
> the pods all have a weight of zero, and are not expected to become
> leaders, or to vote on transactions.
> 
> Let me know what I can do to help resolve this issue.
> 
> -Todd

Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
This repro's in both branch-3.2, and branch-3.2+patches(473, 479, 481). 

Basically, it seems like the nodes are electing pd4-zook02 to be the
leader. However, pd4-zook02 seems to realize it's not supposed to be and
then disconnects everyone. Then they re-elect it again, and it loops
over and over.

-------------
Server config
-------------

server.1=dc1-zook01.dc01.revsci.net:2888:3888
server.2=dc1-zook02.dc01.revsci.net:2888:3888
server.3=dc1-zook03.dc01.revsci.net:2888:3888
server.4=dc1-zook04.dc01.revsci.net:2888:3888
server.5=dc1-zook05.dc01.revsci.net:2888:3888
server.6=pd1-zook01.pd01.revsci.net:2888:3888
server.7=pd1-zook02.pd01.revsci.net:2888:3888
server.8=pd4-zook01.iad1.audsci.net:2888:3888
server.9=pd4-zook02.iad1.audsci.net:2888:3888

group.1:1:2:3:4:5               
weight.1=1
weight.2=1
weight.3=1
weight.4=1
weight.5=1

group.2:6:7:8:9
weight.6=0
weight.7=0
weight.8=0
weight.9=0

Note that we have 2 groups, composed of machines in 3 different
locations (dc1, pd1, and pd4). The idea is that only machines in dc1
have voting rights, and the ability to become a leader. The machines in
the pods all have a weight of zero, and are not expected to become
leaders, or to vote on transactions.

Let me know what I can do to help resolve this issue.

-Todd

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
Todd Greenwood wrote:
> On a plus note, I'm finding that this morning, @work rather than @home,
> the tests continue to completion. However, there are other issues that
> I'll bring up on the dev list, such as a requirement to have autoconf
> installed, and problems in the create-cppunit-configure task that can't
> exec libtoolize, fun stuff like tha.

Great, good to hear. At some point figuring out what's up with your 
@home would be interesting to us. :-)

Yes, there are some basic requirements such as autotool, cppunit, etc... 
but please do raise all this on the dev list.

> I need to proceed with the manual patches to branch-3.2, as I am under
> some time constraints to get our infrastructure deployed such that QA
> can start playing with it. However, I'll switch to 3.2.1 as soon as I
> can.

Understood.

Patrick

>> -----Original Message-----
>> From: Patrick Hunt [mailto:phunt@apache.org]
>> Sent: Friday, July 31, 2009 11:38 AM
>> To: zookeeper-user@hadoop.apache.org; Todd Greenwood
>> Subject: Re: test failures in branch-3.2
>>
>> Hi Todd,
>>
>> Sorry for the clutter/confusion. Usually things aren't this cumbersome
> ;-)
>> In particular:
>>    1 committer is on vacation
>>    Mahadev's been out sick for multiple days
>>    I'm sick but trying to hang in there, but def not 100%
>>
>> Hudson (CI) has been offline for effectively the past 3 weeks (that
>> gates all our commits) and is just now back but flaky.
>>
>> 3.2 had some bugs that we are trying to address, but the afore
> mentioned
>> issues are slowing us down. Otw we'd have all this straightened out by
>> now ....
>>
>> At this point you should move this discussion to the dev list - Apache
>> doesn't really like us to discuss code changes/futures here (user
> list).
>> On that list you'll also see the plan for upcoming releases - I
> mention
>> all this because we are actively working toward 3.2.1 which will
> include
>> the JIRAs slated for that release (I'm sure you've seen).
>>
>> If you can wait a bit you might be able to avoid some pain by using
> the
>> upcoming 3.2.1 release. Once the patches land into that branch your
>> issues will be resolved w/o you needing to manually apply patches,
> etc...
>>
>> I did look at the files you attached - it looks fine so I'm not sure
> the
>> issue. The form of this test makes it harder - we are verifying that
> the
>> log contains sufficient information when a particular error occurs. We
>> fiddle with log4j in order to do this, which means that the log you
> are
>> including doesn't specify the problem.
>>
>> Try instrumenting this test with a try/catch around the content of the
>> test method (all the code in the failing method inside a big try/catch
>> is what I mean). Then print the error to std out as part of the catch.
>> That should shed some light. If you could debug it a bit that would
> help
>> - because we aren't seeing this in our environment.
>>
>> Again, sort of a moot point if you can wait a week or so...
>>
>> Regards,
>>
>> Patrick
>>
>> Todd Greenwood wrote:
>>> Inline.
>>>
>>>> -----Original Message-----
>>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>>> Sent: Thursday, July 30, 2009 10:57 PM
>>>> To: zookeeper-user@hadoop.apache.org
>>>> Subject: Re: test failures in branch-3.2
>>>>
>>>> Todd Greenwood wrote:
>>>>> Starting w/ branch-3.2 (no changes) I applied patches in this
> order:
>>>>> 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest
>>> fails.
>>>>> 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file
> -
>>>>> PortAssignment.java.
>>>>>
>>>>> PortAssignment.java was added by Patrick as part of
>>> ZOOKEEPER-473.patch,
>>>>> which is a pretty hefty patch (> 2k lines) and touches a large
>>> number of
>>>>> files.
>>>> Hrm, those patches were probably created against the trunk. We'll
> have
>>>> to have separate patches for trunk and 3.2 branch on 481.
>>>>
>>>> If you could update the jira with this detail (481 needs two
> patches,
>>>> one for each branch) that would be great!
>>>>
>>> Done.
>>>
>>>>> 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails
>>> (jvm
>>>>> crashes).
>>>> 473 is "special" (unique) in the sense that it changes log4j while
> the
>>>> the vm is running. In general though it's a pretty boring test and
>>>> shouldn't be failing.
>>>>
>>>> Are you sure you have the right patch file? there are 2 patch files
> on
>>>> the JIRA for 473, make sure that you have the one from 7/16, NOT
> the
>>> one
>>>> from 7/15. Check that the patch file, the correct one should NOT
>>> contain
>>>> changes to build.xml or conf/log4j* files. If this still happens
> send
>>> me
>>>> your build.xml, conf/log4j* and QuroumPeerMainTest.java files in
> email
>>>> for review. I'll take a look.
>>>>
>>>
>>> I've annotated the files w/ their date while downloading:
>>> 112700 2009-07-31 11:02 ZOOKEEPER-473-7-15.patch
>>> 110607 2009-07-31 11:01 ZOOKEEPER-473-7-16.patch
>>>
>>> It appears I applied the 7-16 patch, as that is the matching file
> size
>>> of the patch file I applied.
>>>
>>> If there are to be multiple patch files for multiple branches (3.2,
>>> trunk, etc.) would it make sense to lable the patch files
> accordingly?
>>> Requested files in attached tar.
>>>
>>> -Todd
>>>
>>>> Patrick
>>>>
>>>>
>>>>> [junit] Running
>>> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>>>     [junit] Running
>>>>> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>>>     [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0
>>> sec
>>>>>     [junit] Test
>>> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>>> FAILED (crashed)
>>>>>
>>>>> ------------
>>>>> Test Log
>>>>> ------------
>>>>> Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>>> Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
>>>>>
>>>>> Testcase: testBadPeerAddressInQuorum took 0.004 sec
>>>>>     Caused an ERROR
>>>>> Forked Java VM exited abnormally. Please note the time in the
> report
>>>>> does not reflect the time until the VM exit.
>>>>> junit.framework.AssertionFailedError: Forked Java VM exited
>>> abnormally.
>>>>> Please note the time in the report does not reflect the time until
>>> the
>>>>> VM exit.
>>>>>
>>>>> -Todd
>>>>>
>>>>> -----Original Message-----
>>>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>>>> Sent: Thursday, July 30, 2009 10:13 PM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: test failures in branch-3.2
>>>>>
>>>>> Todd Greenwood wrote:
>>>>>> ....
>>>>>> [Todd] Yes, I believe "address in use" was the problem w/
> FLETest.
>>> I
>>>>>> assumed it was a timing issue w/ respect to test A not fully
>>> releasing
>>>>>> resources before test B started.
>>>>> Might be, but actually I think it's related to this:
>>>>> http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
>>>>>
>>>>> Patrick

RE: test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
Patrick,
Thank you for the background (and I hope you and Mahadev recover
quickly).

On a plus note, I'm finding that this morning, @work rather than @home,
the tests continue to completion. However, there are other issues that
I'll bring up on the dev list, such as a requirement to have autoconf
installed, and problems in the create-cppunit-configure task that can't
exec libtoolize, fun stuff like tha.

I need to proceed with the manual patches to branch-3.2, as I am under
some time constraints to get our infrastructure deployed such that QA
can start playing with it. However, I'll switch to 3.2.1 as soon as I
can.

-Todd

> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org]
> Sent: Friday, July 31, 2009 11:38 AM
> To: zookeeper-user@hadoop.apache.org; Todd Greenwood
> Subject: Re: test failures in branch-3.2
> 
> Hi Todd,
> 
> Sorry for the clutter/confusion. Usually things aren't this cumbersome
;-)
> 
> In particular:
>    1 committer is on vacation
>    Mahadev's been out sick for multiple days
>    I'm sick but trying to hang in there, but def not 100%
> 
> Hudson (CI) has been offline for effectively the past 3 weeks (that
> gates all our commits) and is just now back but flaky.
> 
> 3.2 had some bugs that we are trying to address, but the afore
mentioned
> issues are slowing us down. Otw we'd have all this straightened out by
> now ....
> 
> At this point you should move this discussion to the dev list - Apache
> doesn't really like us to discuss code changes/futures here (user
list).
> On that list you'll also see the plan for upcoming releases - I
mention
> all this because we are actively working toward 3.2.1 which will
include
> the JIRAs slated for that release (I'm sure you've seen).
> 
> If you can wait a bit you might be able to avoid some pain by using
the
> upcoming 3.2.1 release. Once the patches land into that branch your
> issues will be resolved w/o you needing to manually apply patches,
etc...
> 
> 
> I did look at the files you attached - it looks fine so I'm not sure
the
> issue. The form of this test makes it harder - we are verifying that
the
> log contains sufficient information when a particular error occurs. We
> fiddle with log4j in order to do this, which means that the log you
are
> including doesn't specify the problem.
> 
> Try instrumenting this test with a try/catch around the content of the
> test method (all the code in the failing method inside a big try/catch
> is what I mean). Then print the error to std out as part of the catch.
> That should shed some light. If you could debug it a bit that would
help
> - because we aren't seeing this in our environment.
> 
> Again, sort of a moot point if you can wait a week or so...
> 
> Regards,
> 
> Patrick
> 
> Todd Greenwood wrote:
> > Inline.
> >
> >> -----Original Message-----
> >> From: Patrick Hunt [mailto:phunt@apache.org]
> >> Sent: Thursday, July 30, 2009 10:57 PM
> >> To: zookeeper-user@hadoop.apache.org
> >> Subject: Re: test failures in branch-3.2
> >>
> >> Todd Greenwood wrote:
> >>> Starting w/ branch-3.2 (no changes) I applied patches in this
order:
> >>>
> >>> 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest
> > fails.
> >>> 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file
-
> >>> PortAssignment.java.
> >>>
> >>> PortAssignment.java was added by Patrick as part of
> > ZOOKEEPER-473.patch,
> >>> which is a pretty hefty patch (> 2k lines) and touches a large
> > number of
> >>> files.
> >> Hrm, those patches were probably created against the trunk. We'll
have
> >> to have separate patches for trunk and 3.2 branch on 481.
> >>
> >> If you could update the jira with this detail (481 needs two
patches,
> >> one for each branch) that would be great!
> >>
> >
> > Done.
> >
> >>> 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails
> > (jvm
> >>> crashes).
> >> 473 is "special" (unique) in the sense that it changes log4j while
the
> >> the vm is running. In general though it's a pretty boring test and
> >> shouldn't be failing.
> >>
> >> Are you sure you have the right patch file? there are 2 patch files
on
> >> the JIRA for 473, make sure that you have the one from 7/16, NOT
the
> > one
> >> from 7/15. Check that the patch file, the correct one should NOT
> > contain
> >> changes to build.xml or conf/log4j* files. If this still happens
send
> > me
> >> your build.xml, conf/log4j* and QuroumPeerMainTest.java files in
email
> >> for review. I'll take a look.
> >>
> >
> >
> > I've annotated the files w/ their date while downloading:
> > 112700 2009-07-31 11:02 ZOOKEEPER-473-7-15.patch
> > 110607 2009-07-31 11:01 ZOOKEEPER-473-7-16.patch
> >
> > It appears I applied the 7-16 patch, as that is the matching file
size
> > of the patch file I applied.
> >
> > If there are to be multiple patch files for multiple branches (3.2,
> > trunk, etc.) would it make sense to lable the patch files
accordingly?
> >
> > Requested files in attached tar.
> >
> > -Todd
> >
> >> Patrick
> >>
> >>
> >>> [junit] Running
> > org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >>>     [junit] Running
> >>> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >>>     [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0
> > sec
> >>>     [junit] Test
> > org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >>> FAILED (crashed)
> >>>
> >>> ------------
> >>> Test Log
> >>> ------------
> >>> Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >>> Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
> >>>
> >>> Testcase: testBadPeerAddressInQuorum took 0.004 sec
> >>>     Caused an ERROR
> >>> Forked Java VM exited abnormally. Please note the time in the
report
> >>> does not reflect the time until the VM exit.
> >>> junit.framework.AssertionFailedError: Forked Java VM exited
> > abnormally.
> >>> Please note the time in the report does not reflect the time until
> > the
> >>> VM exit.
> >>>
> >>> -Todd
> >>>
> >>> -----Original Message-----
> >>> From: Patrick Hunt [mailto:phunt@apache.org]
> >>> Sent: Thursday, July 30, 2009 10:13 PM
> >>> To: zookeeper-user@hadoop.apache.org
> >>> Subject: Re: test failures in branch-3.2
> >>>
> >>> Todd Greenwood wrote:
> >>>> ....
> >>>> [Todd] Yes, I believe "address in use" was the problem w/
FLETest.
> > I
> >>>> assumed it was a timing issue w/ respect to test A not fully
> > releasing
> >>>> resources before test B started.
> >>> Might be, but actually I think it's related to this:
> >>> http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
> >>>
> >>> Patrick

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
Hi Todd,

Sorry for the clutter/confusion. Usually things aren't this cumbersome ;-)

In particular:
   1 committer is on vacation
   Mahadev's been out sick for multiple days
   I'm sick but trying to hang in there, but def not 100%

Hudson (CI) has been offline for effectively the past 3 weeks (that 
gates all our commits) and is just now back but flaky.

3.2 had some bugs that we are trying to address, but the afore mentioned 
issues are slowing us down. Otw we'd have all this straightened out by 
now ....

At this point you should move this discussion to the dev list - Apache 
doesn't really like us to discuss code changes/futures here (user list). 
On that list you'll also see the plan for upcoming releases - I mention 
all this because we are actively working toward 3.2.1 which will include 
the JIRAs slated for that release (I'm sure you've seen).

If you can wait a bit you might be able to avoid some pain by using the 
upcoming 3.2.1 release. Once the patches land into that branch your 
issues will be resolved w/o you needing to manually apply patches, etc...


I did look at the files you attached - it looks fine so I'm not sure the 
issue. The form of this test makes it harder - we are verifying that the 
log contains sufficient information when a particular error occurs. We 
fiddle with log4j in order to do this, which means that the log you are 
including doesn't specify the problem.

Try instrumenting this test with a try/catch around the content of the 
test method (all the code in the failing method inside a big try/catch 
is what I mean). Then print the error to std out as part of the catch. 
That should shed some light. If you could debug it a bit that would help 
- because we aren't seeing this in our environment.

Again, sort of a moot point if you can wait a week or so...

Regards,

Patrick

Todd Greenwood wrote:
> Inline.
> 
>> -----Original Message-----
>> From: Patrick Hunt [mailto:phunt@apache.org]
>> Sent: Thursday, July 30, 2009 10:57 PM
>> To: zookeeper-user@hadoop.apache.org
>> Subject: Re: test failures in branch-3.2
>>
>> Todd Greenwood wrote:
>>> Starting w/ branch-3.2 (no changes) I applied patches in this order:
>>>
>>> 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest
> fails.
>>> 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file -
>>> PortAssignment.java.
>>>
>>> PortAssignment.java was added by Patrick as part of
> ZOOKEEPER-473.patch,
>>> which is a pretty hefty patch (> 2k lines) and touches a large
> number of
>>> files.
>> Hrm, those patches were probably created against the trunk. We'll have
>> to have separate patches for trunk and 3.2 branch on 481.
>>
>> If you could update the jira with this detail (481 needs two patches,
>> one for each branch) that would be great!
>>
> 
> Done.
> 
>>> 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails
> (jvm
>>> crashes).
>> 473 is "special" (unique) in the sense that it changes log4j while the
>> the vm is running. In general though it's a pretty boring test and
>> shouldn't be failing.
>>
>> Are you sure you have the right patch file? there are 2 patch files on
>> the JIRA for 473, make sure that you have the one from 7/16, NOT the
> one
>> from 7/15. Check that the patch file, the correct one should NOT
> contain
>> changes to build.xml or conf/log4j* files. If this still happens send
> me
>> your build.xml, conf/log4j* and QuroumPeerMainTest.java files in email
>> for review. I'll take a look.
>>
> 
> 
> I've annotated the files w/ their date while downloading:
> 112700 2009-07-31 11:02 ZOOKEEPER-473-7-15.patch
> 110607 2009-07-31 11:01 ZOOKEEPER-473-7-16.patch
> 
> It appears I applied the 7-16 patch, as that is the matching file size
> of the patch file I applied.
> 
> If there are to be multiple patch files for multiple branches (3.2,
> trunk, etc.) would it make sense to lable the patch files accordingly?
> 
> Requested files in attached tar.
> 
> -Todd
> 
>> Patrick
>>
>>
>>> [junit] Running
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>     [junit] Running
>>> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>     [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0
> sec
>>>     [junit] Test
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>> FAILED (crashed)
>>>
>>> ------------
>>> Test Log
>>> ------------
>>> Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>> Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
>>>
>>> Testcase: testBadPeerAddressInQuorum took 0.004 sec
>>>     Caused an ERROR
>>> Forked Java VM exited abnormally. Please note the time in the report
>>> does not reflect the time until the VM exit.
>>> junit.framework.AssertionFailedError: Forked Java VM exited
> abnormally.
>>> Please note the time in the report does not reflect the time until
> the
>>> VM exit.
>>>
>>> -Todd
>>>
>>> -----Original Message-----
>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>> Sent: Thursday, July 30, 2009 10:13 PM
>>> To: zookeeper-user@hadoop.apache.org
>>> Subject: Re: test failures in branch-3.2
>>>
>>> Todd Greenwood wrote:
>>>> ....
>>>> [Todd] Yes, I believe "address in use" was the problem w/ FLETest.
> I
>>>> assumed it was a timing issue w/ respect to test A not fully
> releasing
>>>> resources before test B started.
>>> Might be, but actually I think it's related to this:
>>> http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
>>>
>>> Patrick

RE: test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
Inline.

> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org]
> Sent: Thursday, July 30, 2009 10:57 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: test failures in branch-3.2
> 
> Todd Greenwood wrote:
> > Starting w/ branch-3.2 (no changes) I applied patches in this order:
> >
> > 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest
fails.
> > 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file -
> > PortAssignment.java.
> >
> > PortAssignment.java was added by Patrick as part of
ZOOKEEPER-473.patch,
> > which is a pretty hefty patch (> 2k lines) and touches a large
number of
> > files.
> 
> Hrm, those patches were probably created against the trunk. We'll have
> to have separate patches for trunk and 3.2 branch on 481.
> 
> If you could update the jira with this detail (481 needs two patches,
> one for each branch) that would be great!
> 

Done.

> > 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails
(jvm
> > crashes).
> 
> 473 is "special" (unique) in the sense that it changes log4j while the
> the vm is running. In general though it's a pretty boring test and
> shouldn't be failing.
> 
> Are you sure you have the right patch file? there are 2 patch files on
> the JIRA for 473, make sure that you have the one from 7/16, NOT the
one
> from 7/15. Check that the patch file, the correct one should NOT
contain
> changes to build.xml or conf/log4j* files. If this still happens send
me
> your build.xml, conf/log4j* and QuroumPeerMainTest.java files in email
> for review. I'll take a look.
> 


I've annotated the files w/ their date while downloading:
112700 2009-07-31 11:02 ZOOKEEPER-473-7-15.patch
110607 2009-07-31 11:01 ZOOKEEPER-473-7-16.patch

It appears I applied the 7-16 patch, as that is the matching file size
of the patch file I applied.

If there are to be multiple patch files for multiple branches (3.2,
trunk, etc.) would it make sense to lable the patch files accordingly?

Requested files in attached tar.

-Todd

> Patrick
> 
> 
> > [junit] Running
org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >     [junit] Running
> > org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >     [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0
sec
> >     [junit] Test
org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> > FAILED (crashed)
> >
> > ------------
> > Test Log
> > ------------
> > Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> > Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
> >
> > Testcase: testBadPeerAddressInQuorum took 0.004 sec
> >     Caused an ERROR
> > Forked Java VM exited abnormally. Please note the time in the report
> > does not reflect the time until the VM exit.
> > junit.framework.AssertionFailedError: Forked Java VM exited
abnormally.
> > Please note the time in the report does not reflect the time until
the
> > VM exit.
> >
> > -Todd
> >
> > -----Original Message-----
> > From: Patrick Hunt [mailto:phunt@apache.org]
> > Sent: Thursday, July 30, 2009 10:13 PM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: test failures in branch-3.2
> >
> > Todd Greenwood wrote:
> >> ....
> >> [Todd] Yes, I believe "address in use" was the problem w/ FLETest.
I
> >> assumed it was a timing issue w/ respect to test A not fully
releasing
> >> resources before test B started.
> >
> > Might be, but actually I think it's related to this:
> > http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
> >
> > Patrick

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
Todd Greenwood wrote:
> Starting w/ branch-3.2 (no changes) I applied patches in this order:
> 
> 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest fails.
> 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file -
> PortAssignment.java.
> 
> PortAssignment.java was added by Patrick as part of ZOOKEEPER-473.patch,
> which is a pretty hefty patch (> 2k lines) and touches a large number of
> files. 

Hrm, those patches were probably created against the trunk. We'll have 
to have separate patches for trunk and 3.2 branch on 481.

If you could update the jira with this detail (481 needs two patches, 
one for each branch) that would be great!

> 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails (jvm
> crashes).

473 is "special" (unique) in the sense that it changes log4j while the 
the vm is running. In general though it's a pretty boring test and 
shouldn't be failing.

Are you sure you have the right patch file? there are 2 patch files on 
the JIRA for 473, make sure that you have the one from 7/16, NOT the one 
from 7/15. Check that the patch file, the correct one should NOT contain 
changes to build.xml or conf/log4j* files. If this still happens send me 
your build.xml, conf/log4j* and QuroumPeerMainTest.java files in email 
for review. I'll take a look.

Patrick


> [junit] Running org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>     [junit] Running
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>     [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
>     [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> FAILED (crashed)
> 
> ------------
> Test Log
> ------------
> Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec 
> 
> Testcase: testBadPeerAddressInQuorum took 0.004 sec 
>     Caused an ERROR
> Forked Java VM exited abnormally. Please note the time in the report
> does not reflect the time until the VM exit.
> junit.framework.AssertionFailedError: Forked Java VM exited abnormally.
> Please note the time in the report does not reflect the time until the
> VM exit.
> 
> -Todd
> 
> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org] 
> Sent: Thursday, July 30, 2009 10:13 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: test failures in branch-3.2
> 
> Todd Greenwood wrote:
>> ....
>> [Todd] Yes, I believe "address in use" was the problem w/ FLETest. I
>> assumed it was a timing issue w/ respect to test A not fully releasing
>> resources before test B started.
> 
> Might be, but actually I think it's related to this:
> http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
> 
> Patrick

RE: test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
Patrick/Flavio -

Starting w/ branch-3.2 (no changes) I applied patches in this order:

1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest fails.
2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file -
PortAssignment.java.

PortAssignment.java was added by Patrick as part of ZOOKEEPER-473.patch,
which is a pretty hefty patch (> 2k lines) and touches a large number of
files. 

3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails (jvm
crashes).

[junit] Running org.apache.zookeeper.server.quorum.QuorumPeerMainTest
    [junit] Running
org.apache.zookeeper.server.quorum.QuorumPeerMainTest
    [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
    [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
FAILED (crashed)

------------
Test Log
------------
Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec 

Testcase: testBadPeerAddressInQuorum took 0.004 sec 
    Caused an ERROR
Forked Java VM exited abnormally. Please note the time in the report
does not reflect the time until the VM exit.
junit.framework.AssertionFailedError: Forked Java VM exited abnormally.
Please note the time in the report does not reflect the time until the
VM exit.

-Todd

-----Original Message-----
From: Patrick Hunt [mailto:phunt@apache.org] 
Sent: Thursday, July 30, 2009 10:13 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: test failures in branch-3.2

Todd Greenwood wrote:
> ....
> [Todd] Yes, I believe "address in use" was the problem w/ FLETest. I
> assumed it was a timing issue w/ respect to test A not fully releasing
> resources before test B started.

Might be, but actually I think it's related to this:
http://hea-www.harvard.edu/~fine/Tech/addrinuse.html

Patrick

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
Todd Greenwood wrote:
> ....
> [Todd] Yes, I believe "address in use" was the problem w/ FLETest. I
> assumed it was a timing issue w/ respect to test A not fully releasing
> resources before test B started.

Might be, but actually I think it's related to this:
http://hea-www.harvard.edu/~fine/Tech/addrinuse.html

Patrick

RE: test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
Patrick, inline.

-----Original Message-----
From: Patrick Hunt [mailto:phunt@apache.org] 
Sent: Thursday, July 30, 2009 9:13 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: test failures in branch-3.2

Todd Greenwood wrote:
> The build succeeds, but not the all of the tests. In previous test
runs,
> I noticed an error in org.apache.zookeeper.test.FLETest. It was not
able
> to bind to a port or something. Now, after a machine reboot, I'm
getting
> different failures. 

"address in use"? That's a problem in the test framework pre-3.3. In 3.3

(current svn trunk) I fixed it but it's not in 3.2.x. This is a problem 
with the test framework though and not a real problem, it shows up 
occasionally (depends on timing).

[Todd] Yes, I believe "address in use" was the problem w/ FLETest. I
assumed it was a timing issue w/ respect to test A not fully releasing
resources before test B started.

> branch-3.2 $ ant test
> 
> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> FAILED (crashed)
> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
> 
> Test logs for these two tests attached.

This is unusual though - looking at the log it seems that the JVM itself

crashed for the QPMainTest! for HQT we are seeing:

junit.framework.AssertionFailedError: Threads didn't join

which Flavio mentioned to me once is possible to happen but not a real 
problem (he can elaborate).

What version of java are you using? OS, other environment that might be 
interesting? (vm? etc...) You might try looking at the jvm crash dump 
file (I think it's in /tmp)

[Todd] ---------------------------
$ uname -a
Linux TODDG01LT 2.6.28-14-generic #47-Ubuntu SMP Sat Jul 25 01:19:55 UTC
2009 x86_64 GNU/Linux

$ which java
/home/toddg/bin/x64/java/jdk1.6.0_13/bin/java

$ java -version
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

Memory = 4GB
[Todd] ---------------------------

If you run each of these two tests individually do they run? example:
ant -Dtestcase=FLENewEpochTest test-core-java

[Todd] Will try this once my local build is working and report back.
I'll open a separate mail thread on applying patches.

> My goal here is to get to a known state (all tests succeeding or have
> workarounds for the failures). Following that, I plan to apply the
> patches Flavio recommended for a WAN deploy (479 and 481). After I
> verify that the tests continue to run, I'll package this up and deploy
> it to our WAN for testing. 

Sounds like a good plan.

> So, are these known issues? Do the tests normally run en masse, or do
> some of the tests hold on to resources and prevent other tests from
> passing?

Typically they do run to completion, but occasionally on my machine 
(java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
random failure due to address in use, or the same "didn't join" that you

saw. Usually I see this if I'm multitasking (vs just letting the tests 
run w/o using the box). As I said this is addressed in 3.3 (address 
reuse at the very least, and I haven't see the other issues).

Patrick



Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
Todd Greenwood wrote:
> The build succeeds, but not the all of the tests. In previous test runs,
> I noticed an error in org.apache.zookeeper.test.FLETest. It was not able
> to bind to a port or something. Now, after a machine reboot, I'm getting
> different failures. 

"address in use"? That's a problem in the test framework pre-3.3. In 3.3 
(current svn trunk) I fixed it but it's not in 3.2.x. This is a problem 
with the test framework though and not a real problem, it shows up 
occasionally (depends on timing).

> branch-3.2 $ ant test
> 
> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> FAILED (crashed)
> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
> 
> Test logs for these two tests attached.

This is unusual though - looking at the log it seems that the JVM itself 
crashed for the QPMainTest! for HQT we are seeing:

junit.framework.AssertionFailedError: Threads didn't join

which Flavio mentioned to me once is possible to happen but not a real 
problem (he can elaborate).

What version of java are you using? OS, other environment that might be 
interesting? (vm? etc...) You might try looking at the jvm crash dump 
file (I think it's in /tmp)

If you run each of these two tests individually do they run? example:
ant -Dtestcase=FLENewEpochTest test-core-java

> My goal here is to get to a known state (all tests succeeding or have
> workarounds for the failures). Following that, I plan to apply the
> patches Flavio recommended for a WAN deploy (479 and 481). After I
> verify that the tests continue to run, I'll package this up and deploy
> it to our WAN for testing. 

Sounds like a good plan.

> So, are these known issues? Do the tests normally run en masse, or do
> some of the tests hold on to resources and prevent other tests from
> passing?

Typically they do run to completion, but occasionally on my machine 
(java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
random failure due to address in use, or the same "didn't join" that you 
saw. Usually I see this if I'm multitasking (vs just letting the tests 
run w/o using the box). As I said this is addressed in 3.3 (address 
reuse at the very least, and I haven't see the other issues).

Patrick



Re: test failures in branch-3.2

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Todd,

On Jul 30, 2009, at 5:08 PM, Todd Greenwood wrote:

> The build succeeds, but not the all of the tests. In previous test  
> runs,
> I noticed an error in org.apache.zookeeper.test.FLETest. It was not  
> able
> to bind to a port or something. Now, after a machine reboot, I'm  
> getting
> different failures.
>

This issue might be fixed in trunk, but not in the 3.2 distribution.

> branch-3.2 $ ant test
>
> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> FAILED (crashed)
> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
>

HierarchicalQuorumTest is supposed to fail until you apply the patches  
I mentioned. I don't know what could have caused the crash of the jvm  
in the other one.

-Flavio