You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Mahadev Konar <ma...@yahoo-inc.com> on 2009/07/23 23:39:42 UTC

Bug in 3.2 release.


Hi folks, 
 We just discovered a bug in 3.2 release

http://issues.apache.org/jira/browse/ZOOKEEPER-484.

This bug will affect your clients whenever they switch zookeeper servers -
from a zookeeper server that is a follower to a server that is leader. We
should have a fix out by next week in 3.2.1 and trunk. 3.2.1 should be out
in the next 2-3 weeks.

If you are already using 3.2.0 in production I would suggest switching it
back to 3.1.1 (though there is a workaround mentioned in the jira
http://issues.apache.org/jira/browse/ZOOKEEPER-484 but I would advise
against it). 

The 3.2.0 clients are compatible with 3.1.1 servers.

Thanks 
mahadev


------ End of Forwarded Message


Re: Zookeeper WAN Configuration

Posted by Benjamin Reed <br...@yahoo-inc.com>.
the processing of the write transaction is described in the zookeeper 
internals presentation on 
http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations i think 
other presentations may also touch on it. we also have it in the 
ZooKeeper documentation: 
http://hadoop.apache.org/zookeeper/docs/r3.2.0/zookeeperInternals.html

ben


Todd Greenwood wrote:
> Flavio & Ted, thank you for your comments.
>
> So it sounds like the only way to currently deploy to the WAN is to
> deploy ZK Servers to the central DC and open up client connections to
> these ZK servers from the edge nodes. True?
>
> In the future, once the Observers feature is implemented, then we should
> be able to deploy zk servers to both the DC and to the pods...with all
> the goodness that Flavio mentions below.
>
> Flavio - do you have a doc that describes exactly what happens in the
> transaction of a write operation? For instance, I'd like to know at
> exactly what stage a write has been commited to the ensemble, and not
> just the zk server the client is connected to. I figure it must be
> something like:
>
> clientA.write(path, value)
> -> serverA writes to memory
> -> serverA writes to transacted disk every n/seconds or m/bytes
> -> serverA sends write to Leader
> -> Leader stamps with transaction id
> -> Leader responds to ensemble with update + transaction id
>
> -Todd
>
> -----Original Message-----
> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com] 
> Sent: Friday, July 24, 2009 4:50 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Zookeeper WAN Configuration
>
> Just a few quick observations:
>
> On Jul 24, 2009, at 4:40 PM, Ted Dunning wrote:
>
>   
>> On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
>> <to...@audiencescience.com>wrote:
>>
>>     
>>> Could you explain the idea behind the Observers feature, what this
>>> concept is supposed to address, and how it applies to the WAN
>>> configuration problem in particular?
>>>
>>>       
>> Not really.  I am just echoing comments on observers from them that  
>> know.
>>
>>     
>
> Without observers, increasing the number of servers in an ensemble  
> enables higher read throughput, but causes write throughput to drop  
> because the number of votes to order each write operation increases.  
> Essentially, observers are zookeeper servers that don't vote when  
> ordering updates to the zookeeper state. Adding observers enables  
> higher read throughput affecting minimally write throughput (leader  
> still has to send commits to everyone, at least in the version we have  
> been working on).
>
>   
>>> """
>>> The ideas for federating ZK or allowing observers would likely do  
>>> what
>>> you
>>> want.  I can imagine that an observer would only care that it can see
>>> it's
>>> local peers and one of the observers would be elected to get updates
>>> (and
>>> thus would care about the central service).
>>> """
>>> This certainly sounds like exactly what I want...Was this  
>>> introduced in
>>> 3.2 in full, or only partially?
>>>
>>>       
>> I don't think it is even in trunk yet.  Look on Jira or at the  
>> recent logs
>> of this mailing list.
>>     
>
> It is not on trunk yet.
>
> -Flavio
>
>   


Re: Zookeeper WAN Configuration

Posted by Ted Dunning <te...@gmail.com>.
See here:

http://issues.apache.org/jira/browse/ZOOKEEPER-29 (in trunk since April,
released in 3.2.0)
http://issues.apache.org/jira/browse/ZOOKEEPER-368 (not in yet)

On Mon, Jul 27, 2009 at 4:02 PM, Todd Greenwood
<to...@audiencescience.com>wrote:

> [Todd] Great, we'll proceed with hierarchical configuration w/ ZK Servers
> in pods having a voting weight of zero. Could you provide a pointer to a
> configuration that shows this? The docs are a bit lean in this regard...
>



-- 
Ted Dunning, CTO
DeepDyve

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
well try running these two tests individually and see if they always 
fail or just occassionally. that will be a good start (and the env detail).

Patrick

Todd Greenwood wrote:
> No edits to conf/log4j.properties.
> 
> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org] 
> Sent: Thursday, July 30, 2009 9:25 PM
> To: Patrick Hunt
> Cc: zookeeper-user@hadoop.apache.org
> Subject: Re: test failures in branch-3.2
> 
> btw QuorumPeerMainTest uses the CONSOLE appender which is setup in 
> conf/log4j.properties, now that I think of it perhaps not such a good 
> idea :-)
> 
> If you edited cong/log4j.properties it may be causing the test to fail, 
> did you do this? (if you run the test by itself using -Dtestcase does it
> 
> always fail?)
> 
> I've entered a jira to address this:
> https://issues.apache.org/jira/browse/ZOOKEEPER-492
> 
> Patrick
> 
> Patrick Hunt wrote:
>> Todd Greenwood wrote:
>>> The build succeeds, but not the all of the tests. In previous test
> runs,
>>> I noticed an error in org.apache.zookeeper.test.FLETest. It was not
> able
>>> to bind to a port or something. Now, after a machine reboot, I'm
> getting
>>> different failures. 
>> "address in use"? That's a problem in the test framework pre-3.3. In
> 3.3 
>> (current svn trunk) I fixed it but it's not in 3.2.x. This is a
> problem 
>> with the test framework though and not a real problem, it shows up 
>> occasionally (depends on timing).
>>
>>> branch-3.2 $ ant test
>>>
>>> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>> FAILED (crashed)
>>> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
>>>
>>> Test logs for these two tests attached.
>> This is unusual though - looking at the log it seems that the JVM
> itself 
>> crashed for the QPMainTest! for HQT we are seeing:
>>
>> junit.framework.AssertionFailedError: Threads didn't join
>>
>> which Flavio mentioned to me once is possible to happen but not a real
> 
>> problem (he can elaborate).
>>
>> What version of java are you using? OS, other environment that might
> be 
>> interesting? (vm? etc...) You might try looking at the jvm crash dump 
>> file (I think it's in /tmp)
>>
>> If you run each of these two tests individually do they run? example:
>> ant -Dtestcase=FLENewEpochTest test-core-java
>>
>>> My goal here is to get to a known state (all tests succeeding or have
>>> workarounds for the failures). Following that, I plan to apply the
>>> patches Flavio recommended for a WAN deploy (479 and 481). After I
>>> verify that the tests continue to run, I'll package this up and
> deploy
>>> it to our WAN for testing. 
>> Sounds like a good plan.
>>
>>> So, are these known issues? Do the tests normally run en masse, or do
>>> some of the tests hold on to resources and prevent other tests from
>>> passing?
>> Typically they do run to completion, but occasionally on my machine 
>> (java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
>> random failure due to address in use, or the same "didn't join" that
> you 
>> saw. Usually I see this if I'm multitasking (vs just letting the tests
> 
>> run w/o using the box). As I said this is addressed in 3.3 (address 
>> reuse at the very least, and I haven't see the other issues).
>>
>> Patrick
>>
>>

RE: test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
No edits to conf/log4j.properties.

-----Original Message-----
From: Patrick Hunt [mailto:phunt@apache.org] 
Sent: Thursday, July 30, 2009 9:25 PM
To: Patrick Hunt
Cc: zookeeper-user@hadoop.apache.org
Subject: Re: test failures in branch-3.2

btw QuorumPeerMainTest uses the CONSOLE appender which is setup in 
conf/log4j.properties, now that I think of it perhaps not such a good 
idea :-)

If you edited cong/log4j.properties it may be causing the test to fail, 
did you do this? (if you run the test by itself using -Dtestcase does it

always fail?)

I've entered a jira to address this:
https://issues.apache.org/jira/browse/ZOOKEEPER-492

Patrick

Patrick Hunt wrote:
> Todd Greenwood wrote:
>> The build succeeds, but not the all of the tests. In previous test
runs,
>> I noticed an error in org.apache.zookeeper.test.FLETest. It was not
able
>> to bind to a port or something. Now, after a machine reboot, I'm
getting
>> different failures. 
> 
> "address in use"? That's a problem in the test framework pre-3.3. In
3.3 
> (current svn trunk) I fixed it but it's not in 3.2.x. This is a
problem 
> with the test framework though and not a real problem, it shows up 
> occasionally (depends on timing).
> 
>> branch-3.2 $ ant test
>>
>> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>> FAILED (crashed)
>> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
>>
>> Test logs for these two tests attached.
> 
> This is unusual though - looking at the log it seems that the JVM
itself 
> crashed for the QPMainTest! for HQT we are seeing:
> 
> junit.framework.AssertionFailedError: Threads didn't join
> 
> which Flavio mentioned to me once is possible to happen but not a real

> problem (he can elaborate).
> 
> What version of java are you using? OS, other environment that might
be 
> interesting? (vm? etc...) You might try looking at the jvm crash dump 
> file (I think it's in /tmp)
> 
> If you run each of these two tests individually do they run? example:
> ant -Dtestcase=FLENewEpochTest test-core-java
> 
>> My goal here is to get to a known state (all tests succeeding or have
>> workarounds for the failures). Following that, I plan to apply the
>> patches Flavio recommended for a WAN deploy (479 and 481). After I
>> verify that the tests continue to run, I'll package this up and
deploy
>> it to our WAN for testing. 
> 
> Sounds like a good plan.
> 
>> So, are these known issues? Do the tests normally run en masse, or do
>> some of the tests hold on to resources and prevent other tests from
>> passing?
> 
> Typically they do run to completion, but occasionally on my machine 
> (java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
> random failure due to address in use, or the same "didn't join" that
you 
> saw. Usually I see this if I'm multitasking (vs just letting the tests

> run w/o using the box). As I said this is addressed in 3.3 (address 
> reuse at the very least, and I haven't see the other issues).
> 
> Patrick
> 
> 

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
btw QuorumPeerMainTest uses the CONSOLE appender which is setup in 
conf/log4j.properties, now that I think of it perhaps not such a good 
idea :-)

If you edited cong/log4j.properties it may be causing the test to fail, 
did you do this? (if you run the test by itself using -Dtestcase does it 
always fail?)

I've entered a jira to address this:
https://issues.apache.org/jira/browse/ZOOKEEPER-492

Patrick

Patrick Hunt wrote:
> Todd Greenwood wrote:
>> The build succeeds, but not the all of the tests. In previous test runs,
>> I noticed an error in org.apache.zookeeper.test.FLETest. It was not able
>> to bind to a port or something. Now, after a machine reboot, I'm getting
>> different failures. 
> 
> "address in use"? That's a problem in the test framework pre-3.3. In 3.3 
> (current svn trunk) I fixed it but it's not in 3.2.x. This is a problem 
> with the test framework though and not a real problem, it shows up 
> occasionally (depends on timing).
> 
>> branch-3.2 $ ant test
>>
>> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>> FAILED (crashed)
>> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
>>
>> Test logs for these two tests attached.
> 
> This is unusual though - looking at the log it seems that the JVM itself 
> crashed for the QPMainTest! for HQT we are seeing:
> 
> junit.framework.AssertionFailedError: Threads didn't join
> 
> which Flavio mentioned to me once is possible to happen but not a real 
> problem (he can elaborate).
> 
> What version of java are you using? OS, other environment that might be 
> interesting? (vm? etc...) You might try looking at the jvm crash dump 
> file (I think it's in /tmp)
> 
> If you run each of these two tests individually do they run? example:
> ant -Dtestcase=FLENewEpochTest test-core-java
> 
>> My goal here is to get to a known state (all tests succeeding or have
>> workarounds for the failures). Following that, I plan to apply the
>> patches Flavio recommended for a WAN deploy (479 and 481). After I
>> verify that the tests continue to run, I'll package this up and deploy
>> it to our WAN for testing. 
> 
> Sounds like a good plan.
> 
>> So, are these known issues? Do the tests normally run en masse, or do
>> some of the tests hold on to resources and prevent other tests from
>> passing?
> 
> Typically they do run to completion, but occasionally on my machine 
> (java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
> random failure due to address in use, or the same "didn't join" that you 
> saw. Usually I see this if I'm multitasking (vs just letting the tests 
> run w/o using the box). As I said this is addressed in 3.3 (address 
> reuse at the very least, and I haven't see the other issues).
> 
> Patrick
> 
> 

RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
IT says yes, there are firewalls, but that yes, there is full
connectivity between each of the zk servers.

> -----Original Message-----
> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> Sent: Tuesday, August 04, 2009 6:01 PM
> To: zookeeper-dev@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Hi todd,
>   I see a lot of
> 
> java.net.ConnectException: Connection refused
>         at sun.nio.ch.Net.connect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
>         at
java.nio.channels.SocketChannel.open(SocketChannel.java:146)
>         at
>
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnx
Ma
> na
> ger.java:324)
>         at
>
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxMana
ge
> r.
> java:304)
>         at
>
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSe
nd
> er
> .process(FastLeaderElection.java:317)
>         at
>
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSe
nd
> er
> .run(FastLeaderElection.java:290)
>         at java.lang.Thread.run(Thread.java:619)
> 
> 
> Is it possible that there is some firewall? Can all the servers 1-9
> connect
> to all the others using ports that you specified in zoo.cfg i.e
2888/3888?
> 
> 
> Thanks
> mahadev
> 
> 
> On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
> 
> > Looks like we're not getting *any* leader elected now.... Logs
attached.
> >
> >> -----Original Message-----
> >> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >> Sent: Tuesday, August 04, 2009 4:07 PM
> >> To: zookeeper-dev@hadoop.apache.org
> >> Subject: RE: Unending Leader Elections in WAN deploy
> >>
> >> Patrick, thanks! I'll forward on to IT and I'll report back to you
> >> shortly...
> >>
> >>> -----Original Message-----
> >>> From: Patrick Hunt [mailto:phunt@apache.org]
> >>> Sent: Tuesday, August 04, 2009 3:55 PM
> >>> To: zookeeper-dev@hadoop.apache.org
> >>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>
> >>> Todd, Mahadev and I looked at this and it turns out to be a
> >> regression.
> >>> Ironically a patch I created for 3.2 branch to add quorum tests
> >> actually
> >>> broke the quorum config -- a default value for a config parameter
> > was
> >>> lost. I'm going to submit a patch asap to get the default back,
but
> >> for
> >>> the time being you can set:
> >>>
> >>> electionAlg=3
> >>>
> >>> in each of your config files.
> >>>
> >>> You should see reference to FastLeaderElection in your log files
if
> >> this
> >>> parameter is set correctly.
> >>>
> >>> Sorry for the trouble,
> >>>
> >>> Patrick
> >>>
> >>> Todd Greenwood wrote:
> >>>> Mahadev,
> >>>>
> >>>> I just heard from IT that this build behaves in exactly the same
> > way
> >> as
> >>>> previous versions, e.g. we get continuous leader elections that
> >>>> disconnect the followers and then get re-elected, and
> >> disconnect...etc.
> >>>>
> >>>> This is from a fresh sync to the 3.2 branch:
> >>>>
> >>>> svn co
> >>>>
> > http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> >>>> ./branch-3.2
> >>>>
> >>>> CHANGES.TXT show the various fixes included:
> >>>>
> >>>>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
> >>>> Release 3.2.1
> >>>>
> >>>> Backward compatibile changes:
> >>>>
> >>>> BUGFIXES:
> >>>>   ZOOKEEPER-468. avoid compile warning in send_auth_info().
(chris
> >> via
> >>>> flavio)
> >>>>
> >>>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten
(chris
> >> via
> >>>> mahadev)
> >>>>
> >>>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
> > mahadev)
> >>>>
> >>>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
> > via
> >>>> mahadev)
> >>>>
> >>>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
> >>>>   (giri via mahadev)
> >>>>
> >>>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
> >> mahadev)
> >>>>
> >>>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
> >> immediate
> >>>>   failure. (chris via mahadev)
> >>>>
> >>>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
> >> via
> >>>> phunt)
> >>>>
> >>>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase
(and
> >>>> other)
> >>>>   embedded clients (ryan rawson via phunt)
> >>>>
> >>>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
> >> via
> >>>> mahadev)
> >>>>
> >>>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
> > correctly
> >>>>   (flavio via mahadev)
> >>>>
> >>>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
> >> empty
> >>>> cert
> >>>>   (Chris Darroch via phunt)
> >>>>
> >>>>   ZOOKEEPER-480. FLE should perform leader check when node is not
> >>>> leading and
> >>>>   add vote of follower (flavio via mahadev)
> >>>>
> >>>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
> >> (flavio
> >>>> via
> >>>>   mahadev)
> >>>>
> >>>> What can I do to assist you with this issue?
> >>>>
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>> Sent: Tuesday, August 04, 2009 12:43 PM
> >>>>> To: zookeeper-dev@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> Hi todd,
> >>>>>  comments in line
> >>>>>
> >>>>>
> >>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> >>>> wrote:
> >>>>>> Mahadev,
> >>>>>>
> >>>>>> Some quick questions:
> >>>>>>
> >>>>>> 1. Version
> >>>>>>
> >>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
> > is
> >>>> still
> >>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
> >>>> calling
> >>>>>> this release 3.2.1?
> >>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
> > we
> >>>> tag
> >>>>> the
> >>>>> release.
> >>>>>
> >>>>>> 2. Build targets
> >>>>>>
> >>>>>> The package target fails b/c the create-cppunit-configure
target
> >>>> fails
> >>>>>> due to various problems w/ respect to autoconf. Are these
> >>>> dependencies
> >>>>>> documented somewhere ? I'd like to have a fully building
system.
> >>>>>>
> >>>>>> create-cppunit-configure:
> >>>>>>      [exec] Can't exec "libtoolize": No such file or directory
> > at
> >>>>>> /usr/bin/autoreconf line 188.
> >>>>>>      [exec] Use of uninitialized value $libtoolize in pattern
> >> match
> >>>>>> (m//) at /usr/bin/autoreconf line 188.
> >>>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
> > not
> >>>> found
> >>>>>> in library
> >>>>>>      [exec] configure.ac:33: error: possibly undefined macro:
> >>>>>> AM_PATH_CPPUNIT
> >>>>>>      [exec]       If this token and others are legitimate,
> > please
> >>>> use
> >>>>>> m4_pattern_allow.
> >>>>>>      [exec]       See the Autoconf documentation.
> >>>>>>      [exec] configure.ac:53: error: possibly undefined macro:
> >>>>>> AC_PROG_LIBTOOL
> >>>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
> > status:
> >> 1
> >>>>>>
> >>>>> You need auto tools to run this. Please read the README for
> >> building c
> >>>>> client library at src/c/ for the installation requirements.
> >>>>>> 3. Sync failure:
> >>>>>>
> >>>>>> This is still failing.
> >>>>>>
> >>>>>> svn: URL
> >>>>>>
> > 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> >>>>>> doesn't exist
> >>>>>>
> >>>>> Yes this hasn't been fixed yet!
> >>>>>
> >>>>> Thanks
> >>>>> mahadev
> >>>>>> -Todd
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Todd Greenwood
> >>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
> >>>>>>> To: 'zookeeper-user@hadoop.apache.org'
> >>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>
> >>>>>>> Great news. Thank you Mahadev. I'll report our findings later
> >>>> today.
> >>>>>>> -Todd
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
> >>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>
> >>>>>>>> Hi Todd,
> >>>>>>>>  I just committed 480 and 491. You can checkout the 3.2
branch
> >>>> now.
> >>>>>>>> Thanks
> >>>>>>>> mahadev
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
> > <to...@audiencescience.com>
> >>>>>> wrote:
> >>>>>>>>> That'd be perfect. Thanks!
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
> >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>
> >>>>>>>>>> Hi Todd,
> >>>>>>>>>>   Most of the patches that you mention should be in the
> > branch
> >>>>>> 3.2 by
> >>>>>>>>> tomm
> >>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> >>>> tomm.
> >>>>>>>>> Would
> >>>>>>>>>> that
> >>>>>>>>>> suffice for you?
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> mahadev
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
> >> <to...@audiencescience.com>
> >>>>>>> wrote:
> >>>>>>>>>>> Another problem...I've reverted to the latest versions of
> > the
> >>>>>>>>> patches
> >>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
> >>>>>> compilation
> >>>>>>>>>>> errors:
> >>>>>>>>>>>
> >>>>>>>>>>> build-generated:
> >>>>>>>>>>>     [javac] Compiling 44 source files to
> >>>>>>>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>
> >>>>>>>>>>> compile-main:
> >>>>>>>>>>>     [javac] Compiling 2 source files to
> >>>>>>>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>     [javac]
> >>>>>>>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>> atched/branch-
> >>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
> >>>>>> getQuorumPeers()
> >>>>>>>>> have
> >>>>>>>>>>> the same erasure
> >>>>>>>>>>>     [javac]         public String[] getQuorumPeers();
> >>>>>>>>>>>     [javac]                         ^
> >>>>>>>>>>>     [javac]
> >>>>>>>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>> atched/branch-
> >>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>>>>> mStats.java:31: name clash: getServerState() and
> >>>>>> getServerState()
> >>>>>>>>> have
> >>>>>>>>>>> the same erasure
> >>>>>>>>>>>     [javac]         public String getServerState();
> >>>>>>>>>>>     [javac]                       ^
> >>>>>>>>>>>     [javac] 2 errors
> >>>>>>>>>>>
> >>>>>>>>>>> My build process is pretty simple:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
> >>>>>>>>>>> (src/patched/branch-3.2)
> >>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
> >>>>>>>>>>> 3. build zookeeper in the temp directory
> >>>>>>>>>>>
> >>>>>>>>>>> -Todd
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> Flavio,
> >>>>>>>>>>>> I notice that you've updated the patches referenced for
> > the
> >>>> WAN
> >>>>>>>>>>>> deployment. There appears to be an order dependency w/
> >> respect
> >>>>>> to
> >>>>>>>>>>> these
> >>>>>>>>>>>> four patches...
> >>>>>>>>>>>>
> >>>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>>>>>>>>>>>
> >>>>>>>>>>>> 473 -> 479 (479 fails)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
> >>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>>>>> patching file
> >>>>>>>>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >>>>>>>>>>>> ical.java
> >>>>>>>>>>>> patching file
> >>>>>>>>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >>>>>>>>>>>> patching file
> >>>>>>>>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >>>>>>>>>>>> .java
> >>>>>>>>>>>> patching file
> >>>>>>>>>>>>
> >>>>>>
> >> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >>>>>>>>>>>> Hunk #1 FAILED at 93.
> >>>>>>>>>>>> Hunk #2 FAILED at 145.
> >>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
> >>>>>>>>>>>>
> >>>>
> >>
> >
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >>>>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
> >>>>>>>>>>>>
> >>>>>>>>>>>> Could you advise as to which patches I need to apply, and
> > in
> >>>>>> what
> >>>>>>>>>>> order?
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> Perfect! Thanks for the update, Todd.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
> >>>>>> Compilation
> >>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
> > latest
> >>>>>> patches
> >>>>>>>>>>>> 473,
> >>>>>>>>>>>> 479, 481, and 491.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
> > of
> >>>> the
> >>>>>>>>>>> patch.
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Flavio,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm getting a compilation error for patch 491:
> >>>>>>>>>>>>
> >>>>>>>>>>>> compile-main:
> >>>>>>>>>>>>   [javac] Compiling 1 source file to
> >>>>>>>>>>>>
> >>>>>>
> >> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>> src/p
> >>>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>>   [javac]
> >>>>>>>>>>>>
> >>>>>>
> >> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>> src/p
> >>>>>>>>>>>>
> >>>>>>
> >> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>> FastL
> >>>>>>>>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>>>>>>>   [javac] location: interface
> >>>>>>>>>>>>
> >> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>>>>>>>   [javac]
> >>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>>   [javac]
> >>>>>> ^
> >>>>>>>>>>>>   [javac] 1 error
> >>>>>>>>>>>>
> >>>>>>>>>>>> I see a reference to getWeight in both
> >>>>>> FastLeaderElection.java
> >>>>>>>>>>> in
> >>>>>>>>>>>> patch
> >>>>>>>>>>>> 491:
> >>>>>>>>>>>>
> >>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>> FastLeaderElection.java
> >>>>>>>>>>>> :
> >>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>>>>>>>> 0)
> >>>>>>>>>>>>
> >>>>>>>>>>>> However, I don't see a reference to this method in
> >> patches
> >>>>>> 473,
> >>>>>>>>>>>> 479,
> >>>>>>>>>>>> or
> >>>>>>>>>>>> 481. I also don't see a reference to this method in
> > the
> >>>>>>>>> trunk...
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Todd Greenwood
> > [mailto:toddg@audiencescience.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> This repro's in both branch-3.2, and
> >>>>>> branch-3.2+patches(473,
> >>>>>>>>>>>> 479,
> >>>>>>>>>>>> 481).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Basically, it seems like the nodes are electing
> >>>>>> pd4-zook02
> >>>>>>>>> to
> >>>>>>>>>>>> be
> >>>>>>>>>>>> the
> >>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> >>>>>>>>>>> supposed
> >>>>>>>>>>>> to
> >>>>>>>>>>>> be
> >>>>>>>>>>>> and
> >>>>>>>>>>>> then disconnects everyone. Then they re-elect it
> > again,
> >>>>>> and
> >>>>>>>>>>> it
> >>>>>>>>>>>> loops
> >>>>>>>>>>>> over and over.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -------------
> >>>>>>>>>>>> Server config
> >>>>>>>>>>>> -------------
> >>>>>>>>>>>>
> >>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>
> >>>>>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>>>>> weight.1=1
> >>>>>>>>>>>> weight.2=1
> >>>>>>>>>>>> weight.3=1
> >>>>>>>>>>>> weight.4=1
> >>>>>>>>>>>> weight.5=1
> >>>>>>>>>>>>
> >>>>>>>>>>>> group.2:6:7:8:9
> >>>>>>>>>>>> weight.6=0
> >>>>>>>>>>>> weight.7=0
> >>>>>>>>>>>> weight.8=0
> >>>>>>>>>>>> weight.9=0
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> >>>>>>>>>>> different
> >>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> >>>>>>>>> machines
> >>>>>>>>>>>> in
> >>>>>>>>>>>> dc1
> >>>>>>>>>>>> have voting rights, and the ability to become a
> > leader.
> >>>>>> The
> >>>>>>>>>>>> machines
> >>>>>>>>>>>> in
> >>>>>>>>>>>> the pods all have a weight of zero, and are not
> >> expected
> >>>>>> to
> >>>>>>>>>>>> become
> >>>>>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Todd
> >>>>


Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Todd,
 Can you attach the files to the jira? I will takea look at this and will
get back to you by end of day today.

Thanks
mahadev


On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> Looks like we're not getting *any* leader elected now.... Logs attached.
> 
>> -----Original Message-----
>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>> Sent: Tuesday, August 04, 2009 4:07 PM
>> To: zookeeper-dev@hadoop.apache.org
>> Subject: RE: Unending Leader Elections in WAN deploy
>> 
>> Patrick, thanks! I'll forward on to IT and I'll report back to you
>> shortly...
>> 
>>> -----Original Message-----
>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>> Sent: Tuesday, August 04, 2009 3:55 PM
>>> To: zookeeper-dev@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>> 
>>> Todd, Mahadev and I looked at this and it turns out to be a
>> regression.
>>> Ironically a patch I created for 3.2 branch to add quorum tests
>> actually
>>> broke the quorum config -- a default value for a config parameter
> was
>>> lost. I'm going to submit a patch asap to get the default back, but
>> for
>>> the time being you can set:
>>> 
>>> electionAlg=3
>>> 
>>> in each of your config files.
>>> 
>>> You should see reference to FastLeaderElection in your log files if
>> this
>>> parameter is set correctly.
>>> 
>>> Sorry for the trouble,
>>> 
>>> Patrick
>>> 
>>> Todd Greenwood wrote:
>>>> Mahadev,
>>>> 
>>>> I just heard from IT that this build behaves in exactly the same
> way
>> as
>>>> previous versions, e.g. we get continuous leader elections that
>>>> disconnect the followers and then get re-elected, and
>> disconnect...etc.
>>>> 
>>>> This is from a fresh sync to the 3.2 branch:
>>>> 
>>>> svn co
>>>> 
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
>>>> ./branch-3.2
>>>> 
>>>> CHANGES.TXT show the various fixes included:
>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
>>>> Release 3.2.1
>>>> 
>>>> Backward compatibile changes:
>>>> 
>>>> BUGFIXES:
>>>>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
>> via
>>>> flavio)
>>>> 
>>>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
>> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
> mahadev)
>>>> 
>>>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>>>>   (giri via mahadev)
>>>> 
>>>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
>> mahadev)
>>>> 
>>>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
>> immediate
>>>>   failure. (chris via mahadev)
>>>> 
>>>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
>> via
>>>> phunt)
>>>> 
>>>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
>>>> other)
>>>>   embedded clients (ryan rawson via phunt)
>>>> 
>>>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
>> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
> correctly
>>>>   (flavio via mahadev)
>>>> 
>>>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
>> empty
>>>> cert
>>>>   (Chris Darroch via phunt)
>>>> 
>>>>   ZOOKEEPER-480. FLE should perform leader check when node is not
>>>> leading and
>>>>   add vote of follower (flavio via mahadev)
>>>> 
>>>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
>> (flavio
>>>> via
>>>>   mahadev)
>>>> 
>>>> What can I do to assist you with this issue?
>>>> 
>>>> -Todd
>>>> 
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>> Sent: Tuesday, August 04, 2009 12:43 PM
>>>>> To: zookeeper-dev@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> Hi todd,
>>>>>  comments in line
>>>>> 
>>>>> 
>>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>> wrote:
>>>>>> Mahadev,
>>>>>> 
>>>>>> Some quick questions:
>>>>>> 
>>>>>> 1. Version
>>>>>> 
>>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
> is
>>>> still
>>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
>>>> calling
>>>>>> this release 3.2.1?
>>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
> we
>>>> tag
>>>>> the
>>>>> release.
>>>>> 
>>>>>> 2. Build targets
>>>>>> 
>>>>>> The package target fails b/c the create-cppunit-configure target
>>>> fails
>>>>>> due to various problems w/ respect to autoconf. Are these
>>>> dependencies
>>>>>> documented somewhere ? I'd like to have a fully building system.
>>>>>> 
>>>>>> create-cppunit-configure:
>>>>>>      [exec] Can't exec "libtoolize": No such file or directory
> at
>>>>>> /usr/bin/autoreconf line 188.
>>>>>>      [exec] Use of uninitialized value $libtoolize in pattern
>> match
>>>>>> (m//) at /usr/bin/autoreconf line 188.
>>>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
> not
>>>> found
>>>>>> in library
>>>>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>>>>> AM_PATH_CPPUNIT
>>>>>>      [exec]       If this token and others are legitimate,
> please
>>>> use
>>>>>> m4_pattern_allow.
>>>>>>      [exec]       See the Autoconf documentation.
>>>>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>>>>> AC_PROG_LIBTOOL
>>>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
> status:
>> 1
>>>>>> 
>>>>> You need auto tools to run this. Please read the README for
>> building c
>>>>> client library at src/c/ for the installation requirements.
>>>>>> 3. Sync failure:
>>>>>> 
>>>>>> This is still failing.
>>>>>> 
>>>>>> svn: URL
>>>>>> 
> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>>>>> doesn't exist
>>>>>> 
>>>>> Yes this hasn't been fixed yet!
>>>>> 
>>>>> Thanks
>>>>> mahadev
>>>>>> -Todd
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Todd Greenwood
>>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> Great news. Thank you Mahadev. I'll report our findings later
>>>> today.
>>>>>>> -Todd
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>> 
>>>>>>>> Hi Todd,
>>>>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
>>>> now.
>>>>>>>> Thanks
>>>>>>>> mahadev
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
> <to...@audiencescience.com>
>>>>>> wrote:
>>>>>>>>> That'd be perfect. Thanks!
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>> 
>>>>>>>>>> Hi Todd,
>>>>>>>>>>   Most of the patches that you mention should be in the
> branch
>>>>>> 3.2 by
>>>>>>>>> tomm
>>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
>>>> tomm.
>>>>>>>>> Would
>>>>>>>>>> that
>>>>>>>>>> suffice for you?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> mahadev
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
>> <to...@audiencescience.com>
>>>>>>> wrote:
>>>>>>>>>>> Another problem...I've reverted to the latest versions of
> the
>>>>>>>>> patches
>>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>>>>> compilation
>>>>>>>>>>> errors:
>>>>>>>>>>> 
>>>>>>>>>>> build-generated:
>>>>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>> 
>>>>>>>>>>> compile-main:
>>>>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>     [javac]
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-
>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>>>>> getQuorumPeers()
>>>>>>>>> have
>>>>>>>>>>> the same erasure
>>>>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>>>>     [javac]                         ^
>>>>>>>>>>>     [javac]
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-
>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>> mStats.java:31: name clash: getServerState() and
>>>>>> getServerState()
>>>>>>>>> have
>>>>>>>>>>> the same erasure
>>>>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>>>>     [javac]                       ^
>>>>>>>>>>>     [javac] 2 errors
>>>>>>>>>>> 
>>>>>>>>>>> My build process is pretty simple:
>>>>>>>>>>> 
>>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>>>>> (src/patched/branch-3.2)
>>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>>>> 
>>>>>>>>>>> -Todd
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Flavio,
>>>>>>>>>>>> I notice that you've updated the patches referenced for
> the
>>>> WAN
>>>>>>>>>>>> deployment. There appears to be an order dependency w/
>> respect
>>>>>> to
>>>>>>>>>>> these
>>>>>>>>>>>> four patches...
>>>>>>>>>>>> 
>>>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>>>> 
>>>>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>>>>> ical.java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>>>>> .java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>>>> 
>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>>>> 
>>>>>>>>>>>> Could you advise as to which patches I need to apply, and
> in
>>>>>> what
>>>>>>>>>>> order?
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>>>>> Compilation
>>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
> latest
>>>>>> patches
>>>>>>>>>>>> 473,
>>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
> of
>>>> the
>>>>>>>>>>> patch.
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Flavio,
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>> 
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> 
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> 
>>>>>> 
>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastL
>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>> 
>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>   [javac]
>>>>>> ^
>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>> 
>>>>>>>>>>>> I see a reference to getWeight in both
>>>>>> FastLeaderElection.java
>>>>>>>>>>> in
>>>>>>>>>>>> patch
>>>>>>>>>>>> 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>> :
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>> 0)
>>>>>>>>>>>> 
>>>>>>>>>>>> However, I don't see a reference to this method in
>> patches
>>>>>> 473,
>>>>>>>>>>>> 479,
>>>>>>>>>>>> or
>>>>>>>>>>>> 481. I also don't see a reference to this method in
> the
>>>>>>>>> trunk...
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood
> [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This repro's in both branch-3.2, and
>>>>>> branch-3.2+patches(473,
>>>>>>>>>>>> 479,
>>>>>>>>>>>> 481).
>>>>>>>>>>>> 
>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>>>>> pd4-zook02
>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> the
>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>>>>> supposed
>>>>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> and
>>>>>>>>>>>> then disconnects everyone. Then they re-elect it
> again,
>>>>>> and
>>>>>>>>>>> it
>>>>>>>>>>>> loops
>>>>>>>>>>>> over and over.
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------
>>>>>>>>>>>> Server config
>>>>>>>>>>>> -------------
>>>>>>>>>>>> 
>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>> 
>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>> 
>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>> 
>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>>>>> different
>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> dc1
>>>>>>>>>>>> have voting rights, and the ability to become a
> leader.
>>>>>> The
>>>>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> the pods all have a weight of zero, and are not
>> expected
>>>>>> to
>>>>>>>>>>>> become
>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>> 


Re: Unending Leader Elections in WAN deploy

Posted by Patrick Hunt <ph...@apache.org>.
(I see the same error in fle0weighttest using latest 3.2 btw)

Patrick Hunt wrote:
> Mahadev/Flavio -- looks like 0 weight is still busted, fle0weighttest is 
> actually failing on my machine, however it's reported as success:
> ------------- Standard Error -----------------
> Exception in thread "Thread-108" junit.framework.AssertionFailedError: 
> Elected zero-weight server
>     at junit.framework.Assert.fail(Assert.java:47)
>     at 
> org.apache.zookeeper.test.FLEZeroWeightTest$LEThread.run(FLEZeroWeightTest.java:138) 
> 
> ------------- ---------------- ---------------
> 
> this is probably due because the test is calling assert in a thread 
> other than the main test thread - which junit will not track/knowabout.
> 
> One problem I see with these tests (0weight test I looked at) -- it 
> doesn't have a client attempt to connect to the various servers as part 
> of declaring success. Really we should only consider "success"ful test 
> (ie assert that) if a client can connect to each server in the cluster 
> and change/seechanges. As part of fixing this we really need to do a 
> sanity check by testing the various command lines and checking that a 
> client can connect.
> 
> I'm not even sure FLEnewepochtest/fletest/etc... are passing either. new 
> epoch seems to just thrash...
> 
> Also I tried 3 & 5 server quorums "by hand from the command line" with 0 
> weight and they see similar issues to what Todd is seeing.
> 
> I'm using the latest code in mainline btw.
> 
> Patrick
> 
> Mahadev Konar wrote:
>> Hi todd,
>>   I see a lot of
>> java.net.ConnectException: Connection refused
>>         at sun.nio.ch.Net.connect(Native Method)
>>         at 
>> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
>>         at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
>>         at 
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxMana 
>>
>> ger.java:324)
>>         at 
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager. 
>>
>> java:304)
>>         at 
>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender 
>>
>> .process(FastLeaderElection.java:317)
>>         at 
>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender 
>>
>> .run(FastLeaderElection.java:290)
>>         at java.lang.Thread.run(Thread.java:619)
>>
>>
>> Is it possible that there is some firewall? Can all the servers 1-9 
>> connect
>> to all the others using ports that you specified in zoo.cfg i.e 
>> 2888/3888?
>>
>>
>> Thanks
>> mahadev
>>
>>
>> On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
>>
>>> Looks like we're not getting *any* leader elected now.... Logs attached.
>>>
>>>> -----Original Message-----
>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>> Sent: Tuesday, August 04, 2009 4:07 PM
>>>> To: zookeeper-dev@hadoop.apache.org
>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>
>>>> Patrick, thanks! I'll forward on to IT and I'll report back to you
>>>> shortly...
>>>>
>>>>> -----Original Message-----
>>>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>>>> Sent: Tuesday, August 04, 2009 3:55 PM
>>>>> To: zookeeper-dev@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>
>>>>> Todd, Mahadev and I looked at this and it turns out to be a
>>>> regression.
>>>>> Ironically a patch I created for 3.2 branch to add quorum tests
>>>> actually
>>>>> broke the quorum config -- a default value for a config parameter
>>> was
>>>>> lost. I'm going to submit a patch asap to get the default back, but
>>>> for
>>>>> the time being you can set:
>>>>>
>>>>> electionAlg=3
>>>>>
>>>>> in each of your config files.
>>>>>
>>>>> You should see reference to FastLeaderElection in your log files if
>>>> this
>>>>> parameter is set correctly.
>>>>>
>>>>> Sorry for the trouble,
>>>>>
>>>>> Patrick
>>>>>
>>>>> Todd Greenwood wrote:
>>>>>> Mahadev,
>>>>>>
>>>>>> I just heard from IT that this build behaves in exactly the same
>>> way
>>>> as
>>>>>> previous versions, e.g. we get continuous leader elections that
>>>>>> disconnect the followers and then get re-elected, and
>>>> disconnect...etc.
>>>>>> This is from a fresh sync to the 3.2 branch:
>>>>>>
>>>>>> svn co
>>>>>>
>>> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
>>>>>> ./branch-3.2
>>>>>>
>>>>>> CHANGES.TXT show the various fixes included:
>>>>>>
>>>>>>
>>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
>>>>>> Release 3.2.1
>>>>>>
>>>>>> Backward compatibile changes:
>>>>>>
>>>>>> BUGFIXES:
>>>>>>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
>>>> via
>>>>>> flavio)
>>>>>>
>>>>>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
>>>> via
>>>>>> mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
>>> mahadev)
>>>>>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
>>> via
>>>>>> mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>>>>>>   (giri via mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
>>>> mahadev)
>>>>>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
>>>> immediate
>>>>>>   failure. (chris via mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
>>>> via
>>>>>> phunt)
>>>>>>
>>>>>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
>>>>>> other)
>>>>>>   embedded clients (ryan rawson via phunt)
>>>>>>
>>>>>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
>>>> via
>>>>>> mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
>>> correctly
>>>>>>   (flavio via mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
>>>> empty
>>>>>> cert
>>>>>>   (Chris Darroch via phunt)
>>>>>>
>>>>>>   ZOOKEEPER-480. FLE should perform leader check when node is not
>>>>>> leading and
>>>>>>   add vote of follower (flavio via mahadev)
>>>>>>
>>>>>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
>>>> (flavio
>>>>>> via
>>>>>>   mahadev)
>>>>>>
>>>>>> What can I do to assist you with this issue?
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>> Sent: Tuesday, August 04, 2009 12:43 PM
>>>>>>> To: zookeeper-dev@hadoop.apache.org
>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>
>>>>>>> Hi todd,
>>>>>>>  comments in line
>>>>>>>
>>>>>>>
>>>>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>>>> wrote:
>>>>>>>> Mahadev,
>>>>>>>>
>>>>>>>> Some quick questions:
>>>>>>>>
>>>>>>>> 1. Version
>>>>>>>>
>>>>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
>>> is
>>>>>> still
>>>>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
>>>>>> calling
>>>>>>>> this release 3.2.1?
>>>>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
>>> we
>>>>>> tag
>>>>>>> the
>>>>>>> release.
>>>>>>>
>>>>>>>> 2. Build targets
>>>>>>>>
>>>>>>>> The package target fails b/c the create-cppunit-configure target
>>>>>> fails
>>>>>>>> due to various problems w/ respect to autoconf. Are these
>>>>>> dependencies
>>>>>>>> documented somewhere ? I'd like to have a fully building system.
>>>>>>>>
>>>>>>>> create-cppunit-configure:
>>>>>>>>      [exec] Can't exec "libtoolize": No such file or directory
>>> at
>>>>>>>> /usr/bin/autoreconf line 188.
>>>>>>>>      [exec] Use of uninitialized value $libtoolize in pattern
>>>> match
>>>>>>>> (m//) at /usr/bin/autoreconf line 188.
>>>>>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
>>> not
>>>>>> found
>>>>>>>> in library
>>>>>>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>>>>>>> AM_PATH_CPPUNIT
>>>>>>>>      [exec]       If this token and others are legitimate,
>>> please
>>>>>> use
>>>>>>>> m4_pattern_allow.
>>>>>>>>      [exec]       See the Autoconf documentation.
>>>>>>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>>>>>>> AC_PROG_LIBTOOL
>>>>>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
>>> status:
>>>> 1
>>>>>>> You need auto tools to run this. Please read the README for
>>>> building c
>>>>>>> client library at src/c/ for the installation requirements.
>>>>>>>> 3. Sync failure:
>>>>>>>>
>>>>>>>> This is still failing.
>>>>>>>>
>>>>>>>> svn: URL
>>>>>>>>
>>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>>>>>>> doesn't exist
>>>>>>>>
>>>>>>> Yes this hasn't been fixed yet!
>>>>>>>
>>>>>>> Thanks
>>>>>>> mahadev
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Todd Greenwood
>>>>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>>>>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>
>>>>>>>>> Great news. Thank you Mahadev. I'll report our findings later
>>>>>> today.
>>>>>>>>> -Todd
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>
>>>>>>>>>> Hi Todd,
>>>>>>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
>>>>>> now.
>>>>>>>>>> Thanks
>>>>>>>>>> mahadev
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
>>> <to...@audiencescience.com>
>>>>>>>> wrote:
>>>>>>>>>>> That'd be perfect. Thanks!
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Todd,
>>>>>>>>>>>>   Most of the patches that you mention should be in the
>>> branch
>>>>>>>> 3.2 by
>>>>>>>>>>> tomm
>>>>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
>>>>>> tomm.
>>>>>>>>>>> Would
>>>>>>>>>>>> that
>>>>>>>>>>>> suffice for you?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> mahadev
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
>>>> <to...@audiencescience.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> Another problem...I've reverted to the latest versions of
>>> the
>>>>>>>>>>> patches
>>>>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>>>>>>> compilation
>>>>>>>>>>>>> errors:
>>>>>>>>>>>>>
>>>>>>>>>>>>> build-generated:
>>>>>>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>
>>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>     [javac]
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-
>>>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>>>>>>> getQuorumPeers()
>>>>>>>>>>> have
>>>>>>>>>>>>> the same erasure
>>>>>>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>>>>>>     [javac]                         ^
>>>>>>>>>>>>>     [javac]
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-
>>>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>>>> mStats.java:31: name clash: getServerState() and
>>>>>>>> getServerState()
>>>>>>>>>>> have
>>>>>>>>>>>>> the same erasure
>>>>>>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>>>>>>     [javac]                       ^
>>>>>>>>>>>>>     [javac] 2 errors
>>>>>>>>>>>>>
>>>>>>>>>>>>> My build process is pretty simple:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>>>>>>> (src/patched/branch-3.2)
>>>>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>>> I notice that you've updated the patches referenced for
>>> the
>>>>>> WAN
>>>>>>>>>>>>>> deployment. There appears to be an order dependency w/
>>>> respect
>>>>>>>> to
>>>>>>>>>>>>> these
>>>>>>>>>>>>>> four patches...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>>
>>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>>>>>>> ical.java
>>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>>
>>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>>
>>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>>>>>>> .java
>>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>>
>>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>>>>>>
>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Could you advise as to which patches I need to apply, and
>>> in
>>>>>>>> what
>>>>>>>>>>>>> order?
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>>>>>>> Compilation
>>>>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
>>> latest
>>>>>>>> patches
>>>>>>>>>>>>>> 473,
>>>>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
>>> of
>>>>>> the
>>>>>>>>>>>>> patch.
>>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>>>>
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>>> src/p
>>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>>>
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>>> src/p
>>>>>>>>>>>>>>
>>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>>> FastL
>>>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>>>>
>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>>>   [javac]
>>>>>>>> ^
>>>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see a reference to getWeight in both
>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> patch
>>>>>>>>>>>>>> 491:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>>>> :
>>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>>>> 0)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, I don't see a reference to this method in
>>>> patches
>>>>>>>> 473,
>>>>>>>>>>>>>> 479,
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> 481. I also don't see a reference to this method in
>>> the
>>>>>>>>>>> trunk...
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Todd Greenwood
>>> [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This repro's in both branch-3.2, and
>>>>>>>> branch-3.2+patches(473,
>>>>>>>>>>>>>> 479,
>>>>>>>>>>>>>> 481).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>>>>>>> pd4-zook02
>>>>>>>>>>> to
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>>>>>>> supposed
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it
>>> again,
>>>>>>>> and
>>>>>>>>>>>>> it
>>>>>>>>>>>>>> loops
>>>>>>>>>>>>>> over and over.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>> Server config
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>>>>>>> different
>>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>>>>>>> machines
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> dc1
>>>>>>>>>>>>>> have voting rights, and the ability to become a
>>> leader.
>>>>>>>> The
>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the pods all have a weight of zero, and are not
>>>> expected
>>>>>>>> to
>>>>>>>>>>>>>> become
>>>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Todd
>>

Re: Unending Leader Elections in WAN deploy

Posted by Patrick Hunt <ph...@apache.org>.
Mahadev/Flavio -- looks like 0 weight is still busted, fle0weighttest is 
actually failing on my machine, however it's reported as success:
------------- Standard Error -----------------
Exception in thread "Thread-108" junit.framework.AssertionFailedError: 
Elected zero-weight server
	at junit.framework.Assert.fail(Assert.java:47)
	at 
org.apache.zookeeper.test.FLEZeroWeightTest$LEThread.run(FLEZeroWeightTest.java:138)
------------- ---------------- ---------------

this is probably due because the test is calling assert in a thread 
other than the main test thread - which junit will not track/knowabout.

One problem I see with these tests (0weight test I looked at) -- it 
doesn't have a client attempt to connect to the various servers as part 
of declaring success. Really we should only consider "success"ful test 
(ie assert that) if a client can connect to each server in the cluster 
and change/seechanges. As part of fixing this we really need to do a 
sanity check by testing the various command lines and checking that a 
client can connect.

I'm not even sure FLEnewepochtest/fletest/etc... are passing either. new 
epoch seems to just thrash...

Also I tried 3 & 5 server quorums "by hand from the command line" with 0 
weight and they see similar issues to what Todd is seeing.

I'm using the latest code in mainline btw.

Patrick

Mahadev Konar wrote:
> Hi todd,
>   I see a lot of 
> 
> java.net.ConnectException: Connection refused
>         at sun.nio.ch.Net.connect(Native Method)
>         at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
>         at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxMana
> ger.java:324)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.
> java:304)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender
> .process(FastLeaderElection.java:317)
>         at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender
> .run(FastLeaderElection.java:290)
>         at java.lang.Thread.run(Thread.java:619)
> 
> 
> Is it possible that there is some firewall? Can all the servers 1-9 connect
> to all the others using ports that you specified in zoo.cfg i.e 2888/3888?
> 
> 
> Thanks
> mahadev
> 
> 
> On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
> 
>> Looks like we're not getting *any* leader elected now.... Logs attached.
>>
>>> -----Original Message-----
>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>> Sent: Tuesday, August 04, 2009 4:07 PM
>>> To: zookeeper-dev@hadoop.apache.org
>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>
>>> Patrick, thanks! I'll forward on to IT and I'll report back to you
>>> shortly...
>>>
>>>> -----Original Message-----
>>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>>> Sent: Tuesday, August 04, 2009 3:55 PM
>>>> To: zookeeper-dev@hadoop.apache.org
>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>
>>>> Todd, Mahadev and I looked at this and it turns out to be a
>>> regression.
>>>> Ironically a patch I created for 3.2 branch to add quorum tests
>>> actually
>>>> broke the quorum config -- a default value for a config parameter
>> was
>>>> lost. I'm going to submit a patch asap to get the default back, but
>>> for
>>>> the time being you can set:
>>>>
>>>> electionAlg=3
>>>>
>>>> in each of your config files.
>>>>
>>>> You should see reference to FastLeaderElection in your log files if
>>> this
>>>> parameter is set correctly.
>>>>
>>>> Sorry for the trouble,
>>>>
>>>> Patrick
>>>>
>>>> Todd Greenwood wrote:
>>>>> Mahadev,
>>>>>
>>>>> I just heard from IT that this build behaves in exactly the same
>> way
>>> as
>>>>> previous versions, e.g. we get continuous leader elections that
>>>>> disconnect the followers and then get re-elected, and
>>> disconnect...etc.
>>>>> This is from a fresh sync to the 3.2 branch:
>>>>>
>>>>> svn co
>>>>>
>> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
>>>>> ./branch-3.2
>>>>>
>>>>> CHANGES.TXT show the various fixes included:
>>>>>
>>>>>
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
>>>>> Release 3.2.1
>>>>>
>>>>> Backward compatibile changes:
>>>>>
>>>>> BUGFIXES:
>>>>>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
>>> via
>>>>> flavio)
>>>>>
>>>>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
>>> via
>>>>> mahadev)
>>>>>
>>>>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
>> mahadev)
>>>>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
>> via
>>>>> mahadev)
>>>>>
>>>>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>>>>>   (giri via mahadev)
>>>>>
>>>>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
>>> mahadev)
>>>>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
>>> immediate
>>>>>   failure. (chris via mahadev)
>>>>>
>>>>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
>>> via
>>>>> phunt)
>>>>>
>>>>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
>>>>> other)
>>>>>   embedded clients (ryan rawson via phunt)
>>>>>
>>>>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
>>> via
>>>>> mahadev)
>>>>>
>>>>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
>> correctly
>>>>>   (flavio via mahadev)
>>>>>
>>>>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
>>> empty
>>>>> cert
>>>>>   (Chris Darroch via phunt)
>>>>>
>>>>>   ZOOKEEPER-480. FLE should perform leader check when node is not
>>>>> leading and
>>>>>   add vote of follower (flavio via mahadev)
>>>>>
>>>>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
>>> (flavio
>>>>> via
>>>>>   mahadev)
>>>>>
>>>>> What can I do to assist you with this issue?
>>>>>
>>>>> -Todd
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>> Sent: Tuesday, August 04, 2009 12:43 PM
>>>>>> To: zookeeper-dev@hadoop.apache.org
>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>
>>>>>> Hi todd,
>>>>>>  comments in line
>>>>>>
>>>>>>
>>>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>>> wrote:
>>>>>>> Mahadev,
>>>>>>>
>>>>>>> Some quick questions:
>>>>>>>
>>>>>>> 1. Version
>>>>>>>
>>>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
>> is
>>>>> still
>>>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
>>>>> calling
>>>>>>> this release 3.2.1?
>>>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
>> we
>>>>> tag
>>>>>> the
>>>>>> release.
>>>>>>
>>>>>>> 2. Build targets
>>>>>>>
>>>>>>> The package target fails b/c the create-cppunit-configure target
>>>>> fails
>>>>>>> due to various problems w/ respect to autoconf. Are these
>>>>> dependencies
>>>>>>> documented somewhere ? I'd like to have a fully building system.
>>>>>>>
>>>>>>> create-cppunit-configure:
>>>>>>>      [exec] Can't exec "libtoolize": No such file or directory
>> at
>>>>>>> /usr/bin/autoreconf line 188.
>>>>>>>      [exec] Use of uninitialized value $libtoolize in pattern
>>> match
>>>>>>> (m//) at /usr/bin/autoreconf line 188.
>>>>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
>> not
>>>>> found
>>>>>>> in library
>>>>>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>>>>>> AM_PATH_CPPUNIT
>>>>>>>      [exec]       If this token and others are legitimate,
>> please
>>>>> use
>>>>>>> m4_pattern_allow.
>>>>>>>      [exec]       See the Autoconf documentation.
>>>>>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>>>>>> AC_PROG_LIBTOOL
>>>>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
>> status:
>>> 1
>>>>>> You need auto tools to run this. Please read the README for
>>> building c
>>>>>> client library at src/c/ for the installation requirements.
>>>>>>> 3. Sync failure:
>>>>>>>
>>>>>>> This is still failing.
>>>>>>>
>>>>>>> svn: URL
>>>>>>>
>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>>>>>> doesn't exist
>>>>>>>
>>>>>> Yes this hasn't been fixed yet!
>>>>>>
>>>>>> Thanks
>>>>>> mahadev
>>>>>>> -Todd
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Todd Greenwood
>>>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>>>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>
>>>>>>>> Great news. Thank you Mahadev. I'll report our findings later
>>>>> today.
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>
>>>>>>>>> Hi Todd,
>>>>>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
>>>>> now.
>>>>>>>>> Thanks
>>>>>>>>> mahadev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
>> <to...@audiencescience.com>
>>>>>>> wrote:
>>>>>>>>>> That'd be perfect. Thanks!
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>
>>>>>>>>>>> Hi Todd,
>>>>>>>>>>>   Most of the patches that you mention should be in the
>> branch
>>>>>>> 3.2 by
>>>>>>>>>> tomm
>>>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
>>>>> tomm.
>>>>>>>>>> Would
>>>>>>>>>>> that
>>>>>>>>>>> suffice for you?
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> mahadev
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
>>> <to...@audiencescience.com>
>>>>>>>> wrote:
>>>>>>>>>>>> Another problem...I've reverted to the latest versions of
>> the
>>>>>>>>>> patches
>>>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>>>>>> compilation
>>>>>>>>>>>> errors:
>>>>>>>>>>>>
>>>>>>>>>>>> build-generated:
>>>>>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>
>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>     [javac]
>>>>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>> atched/branch-
>>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>>>>>> getQuorumPeers()
>>>>>>>>>> have
>>>>>>>>>>>> the same erasure
>>>>>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>>>>>     [javac]                         ^
>>>>>>>>>>>>     [javac]
>>>>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>> atched/branch-
>>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>>> mStats.java:31: name clash: getServerState() and
>>>>>>> getServerState()
>>>>>>>>>> have
>>>>>>>>>>>> the same erasure
>>>>>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>>>>>     [javac]                       ^
>>>>>>>>>>>>     [javac] 2 errors
>>>>>>>>>>>>
>>>>>>>>>>>> My build process is pretty simple:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>>>>>> (src/patched/branch-3.2)
>>>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>>>>>
>>>>>>>>>>>> -Todd
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>> I notice that you've updated the patches referenced for
>> the
>>>>> WAN
>>>>>>>>>>>>> deployment. There appears to be an order dependency w/
>>> respect
>>>>>>> to
>>>>>>>>>>>> these
>>>>>>>>>>>>> four patches...
>>>>>>>>>>>>>
>>>>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>>>>>
>>>>>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>>>>>> ical.java
>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>>>>>> .java
>>>>>>>>>>>>> patching file
>>>>>>>>>>>>>
>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>>>>>
>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could you advise as to which patches I need to apply, and
>> in
>>>>>>> what
>>>>>>>>>>>> order?
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>>>>>> Compilation
>>>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
>> latest
>>>>>>> patches
>>>>>>>>>>>>> 473,
>>>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
>> of
>>>>> the
>>>>>>>>>>>> patch.
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>>
>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastL
>>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>>>
>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>>   [javac]
>>>>>>> ^
>>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see a reference to getWeight in both
>>>>>>> FastLeaderElection.java
>>>>>>>>>>>> in
>>>>>>>>>>>>> patch
>>>>>>>>>>>>> 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>>> :
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>>> 0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, I don't see a reference to this method in
>>> patches
>>>>>>> 473,
>>>>>>>>>>>>> 479,
>>>>>>>>>>>>> or
>>>>>>>>>>>>> 481. I also don't see a reference to this method in
>> the
>>>>>>>>>> trunk...
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Todd Greenwood
>> [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> This repro's in both branch-3.2, and
>>>>>>> branch-3.2+patches(473,
>>>>>>>>>>>>> 479,
>>>>>>>>>>>>> 481).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>>>>>> pd4-zook02
>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>> the
>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>>>>>> supposed
>>>>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>> and
>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it
>> again,
>>>>>>> and
>>>>>>>>>>>> it
>>>>>>>>>>>>> loops
>>>>>>>>>>>>> over and over.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>> Server config
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>
>>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>>>
>>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>>>>>> different
>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>>>>>> machines
>>>>>>>>>>>>> in
>>>>>>>>>>>>> dc1
>>>>>>>>>>>>> have voting rights, and the ability to become a
>> leader.
>>>>>>> The
>>>>>>>>>>>>> machines
>>>>>>>>>>>>> in
>>>>>>>>>>>>> the pods all have a weight of zero, and are not
>>> expected
>>>>>>> to
>>>>>>>>>>>>> become
>>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Todd
> 

Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi todd,
  I see a lot of 

java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.connect(Native Method)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
        at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxMana
ger.java:324)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.
java:304)
        at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender
.process(FastLeaderElection.java:317)
        at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender
.run(FastLeaderElection.java:290)
        at java.lang.Thread.run(Thread.java:619)


Is it possible that there is some firewall? Can all the servers 1-9 connect
to all the others using ports that you specified in zoo.cfg i.e 2888/3888?


Thanks
mahadev


On 8/4/09 4:56 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> Looks like we're not getting *any* leader elected now.... Logs attached.
> 
>> -----Original Message-----
>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>> Sent: Tuesday, August 04, 2009 4:07 PM
>> To: zookeeper-dev@hadoop.apache.org
>> Subject: RE: Unending Leader Elections in WAN deploy
>> 
>> Patrick, thanks! I'll forward on to IT and I'll report back to you
>> shortly...
>> 
>>> -----Original Message-----
>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>> Sent: Tuesday, August 04, 2009 3:55 PM
>>> To: zookeeper-dev@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>> 
>>> Todd, Mahadev and I looked at this and it turns out to be a
>> regression.
>>> Ironically a patch I created for 3.2 branch to add quorum tests
>> actually
>>> broke the quorum config -- a default value for a config parameter
> was
>>> lost. I'm going to submit a patch asap to get the default back, but
>> for
>>> the time being you can set:
>>> 
>>> electionAlg=3
>>> 
>>> in each of your config files.
>>> 
>>> You should see reference to FastLeaderElection in your log files if
>> this
>>> parameter is set correctly.
>>> 
>>> Sorry for the trouble,
>>> 
>>> Patrick
>>> 
>>> Todd Greenwood wrote:
>>>> Mahadev,
>>>> 
>>>> I just heard from IT that this build behaves in exactly the same
> way
>> as
>>>> previous versions, e.g. we get continuous leader elections that
>>>> disconnect the followers and then get re-elected, and
>> disconnect...etc.
>>>> 
>>>> This is from a fresh sync to the 3.2 branch:
>>>> 
>>>> svn co
>>>> 
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
>>>> ./branch-3.2
>>>> 
>>>> CHANGES.TXT show the various fixes included:
>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
>>>> Release 3.2.1
>>>> 
>>>> Backward compatibile changes:
>>>> 
>>>> BUGFIXES:
>>>>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
>> via
>>>> flavio)
>>>> 
>>>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
>> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
> mahadev)
>>>> 
>>>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>>>>   (giri via mahadev)
>>>> 
>>>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
>> mahadev)
>>>> 
>>>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
>> immediate
>>>>   failure. (chris via mahadev)
>>>> 
>>>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
>> via
>>>> phunt)
>>>> 
>>>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
>>>> other)
>>>>   embedded clients (ryan rawson via phunt)
>>>> 
>>>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
>> via
>>>> mahadev)
>>>> 
>>>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
> correctly
>>>>   (flavio via mahadev)
>>>> 
>>>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
>> empty
>>>> cert
>>>>   (Chris Darroch via phunt)
>>>> 
>>>>   ZOOKEEPER-480. FLE should perform leader check when node is not
>>>> leading and
>>>>   add vote of follower (flavio via mahadev)
>>>> 
>>>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
>> (flavio
>>>> via
>>>>   mahadev)
>>>> 
>>>> What can I do to assist you with this issue?
>>>> 
>>>> -Todd
>>>> 
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>> Sent: Tuesday, August 04, 2009 12:43 PM
>>>>> To: zookeeper-dev@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> Hi todd,
>>>>>  comments in line
>>>>> 
>>>>> 
>>>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>> wrote:
>>>>>> Mahadev,
>>>>>> 
>>>>>> Some quick questions:
>>>>>> 
>>>>>> 1. Version
>>>>>> 
>>>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
> is
>>>> still
>>>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
>>>> calling
>>>>>> this release 3.2.1?
>>>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
> we
>>>> tag
>>>>> the
>>>>> release.
>>>>> 
>>>>>> 2. Build targets
>>>>>> 
>>>>>> The package target fails b/c the create-cppunit-configure target
>>>> fails
>>>>>> due to various problems w/ respect to autoconf. Are these
>>>> dependencies
>>>>>> documented somewhere ? I'd like to have a fully building system.
>>>>>> 
>>>>>> create-cppunit-configure:
>>>>>>      [exec] Can't exec "libtoolize": No such file or directory
> at
>>>>>> /usr/bin/autoreconf line 188.
>>>>>>      [exec] Use of uninitialized value $libtoolize in pattern
>> match
>>>>>> (m//) at /usr/bin/autoreconf line 188.
>>>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
> not
>>>> found
>>>>>> in library
>>>>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>>>>> AM_PATH_CPPUNIT
>>>>>>      [exec]       If this token and others are legitimate,
> please
>>>> use
>>>>>> m4_pattern_allow.
>>>>>>      [exec]       See the Autoconf documentation.
>>>>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>>>>> AC_PROG_LIBTOOL
>>>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
> status:
>> 1
>>>>>> 
>>>>> You need auto tools to run this. Please read the README for
>> building c
>>>>> client library at src/c/ for the installation requirements.
>>>>>> 3. Sync failure:
>>>>>> 
>>>>>> This is still failing.
>>>>>> 
>>>>>> svn: URL
>>>>>> 
> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>>>>> doesn't exist
>>>>>> 
>>>>> Yes this hasn't been fixed yet!
>>>>> 
>>>>> Thanks
>>>>> mahadev
>>>>>> -Todd
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Todd Greenwood
>>>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> Great news. Thank you Mahadev. I'll report our findings later
>>>> today.
>>>>>>> -Todd
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>> 
>>>>>>>> Hi Todd,
>>>>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
>>>> now.
>>>>>>>> Thanks
>>>>>>>> mahadev
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
> <to...@audiencescience.com>
>>>>>> wrote:
>>>>>>>>> That'd be perfect. Thanks!
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>> 
>>>>>>>>>> Hi Todd,
>>>>>>>>>>   Most of the patches that you mention should be in the
> branch
>>>>>> 3.2 by
>>>>>>>>> tomm
>>>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
>>>> tomm.
>>>>>>>>> Would
>>>>>>>>>> that
>>>>>>>>>> suffice for you?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> mahadev
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
>> <to...@audiencescience.com>
>>>>>>> wrote:
>>>>>>>>>>> Another problem...I've reverted to the latest versions of
> the
>>>>>>>>> patches
>>>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>>>>> compilation
>>>>>>>>>>> errors:
>>>>>>>>>>> 
>>>>>>>>>>> build-generated:
>>>>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>> 
>>>>>>>>>>> compile-main:
>>>>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>     [javac]
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-
>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>>>>> getQuorumPeers()
>>>>>>>>> have
>>>>>>>>>>> the same erasure
>>>>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>>>>     [javac]                         ^
>>>>>>>>>>>     [javac]
>>>>>>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-
>>>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>>>> mStats.java:31: name clash: getServerState() and
>>>>>> getServerState()
>>>>>>>>> have
>>>>>>>>>>> the same erasure
>>>>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>>>>     [javac]                       ^
>>>>>>>>>>>     [javac] 2 errors
>>>>>>>>>>> 
>>>>>>>>>>> My build process is pretty simple:
>>>>>>>>>>> 
>>>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>>>>> (src/patched/branch-3.2)
>>>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>>>> 
>>>>>>>>>>> -Todd
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Flavio,
>>>>>>>>>>>> I notice that you've updated the patches referenced for
> the
>>>> WAN
>>>>>>>>>>>> deployment. There appears to be an order dependency w/
>> respect
>>>>>> to
>>>>>>>>>>> these
>>>>>>>>>>>> four patches...
>>>>>>>>>>>> 
>>>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>>>> 
>>>>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>>>>> ical.java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>>>>> .java
>>>>>>>>>>>> patching file
>>>>>>>>>>>> 
>>>>>> 
>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>>>> 
>>>> 
>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>>>> 
>>>>>>>>>>>> Could you advise as to which patches I need to apply, and
> in
>>>>>> what
>>>>>>>>>>> order?
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>>>>> Compilation
>>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
> latest
>>>>>> patches
>>>>>>>>>>>> 473,
>>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
> of
>>>> the
>>>>>>>>>>> patch.
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Flavio,
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>> 
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> 
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> 
>>>>>> 
>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastL
>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>> 
>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>   [javac]
>>>>>> ^
>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>> 
>>>>>>>>>>>> I see a reference to getWeight in both
>>>>>> FastLeaderElection.java
>>>>>>>>>>> in
>>>>>>>>>>>> patch
>>>>>>>>>>>> 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>> :
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>> 0)
>>>>>>>>>>>> 
>>>>>>>>>>>> However, I don't see a reference to this method in
>> patches
>>>>>> 473,
>>>>>>>>>>>> 479,
>>>>>>>>>>>> or
>>>>>>>>>>>> 481. I also don't see a reference to this method in
> the
>>>>>>>>> trunk...
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood
> [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This repro's in both branch-3.2, and
>>>>>> branch-3.2+patches(473,
>>>>>>>>>>>> 479,
>>>>>>>>>>>> 481).
>>>>>>>>>>>> 
>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>>>>> pd4-zook02
>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> the
>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>>>>> supposed
>>>>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> and
>>>>>>>>>>>> then disconnects everyone. Then they re-elect it
> again,
>>>>>> and
>>>>>>>>>>> it
>>>>>>>>>>>> loops
>>>>>>>>>>>> over and over.
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------
>>>>>>>>>>>> Server config
>>>>>>>>>>>> -------------
>>>>>>>>>>>> 
>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>> 
>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>> 
>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>> 
>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>>>>> different
>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> dc1
>>>>>>>>>>>> have voting rights, and the ability to become a
> leader.
>>>>>> The
>>>>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> the pods all have a weight of zero, and are not
>> expected
>>>>>> to
>>>>>>>>>>>> become
>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>> 


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Looks like we're not getting *any* leader elected now.... Logs attached.

> -----Original Message-----
> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> Sent: Tuesday, August 04, 2009 4:07 PM
> To: zookeeper-dev@hadoop.apache.org
> Subject: RE: Unending Leader Elections in WAN deploy
> 
> Patrick, thanks! I'll forward on to IT and I'll report back to you
> shortly...
> 
> > -----Original Message-----
> > From: Patrick Hunt [mailto:phunt@apache.org]
> > Sent: Tuesday, August 04, 2009 3:55 PM
> > To: zookeeper-dev@hadoop.apache.org
> > Subject: Re: Unending Leader Elections in WAN deploy
> >
> > Todd, Mahadev and I looked at this and it turns out to be a
> regression.
> > Ironically a patch I created for 3.2 branch to add quorum tests
> actually
> > broke the quorum config -- a default value for a config parameter
was
> > lost. I'm going to submit a patch asap to get the default back, but
> for
> > the time being you can set:
> >
> > electionAlg=3
> >
> > in each of your config files.
> >
> > You should see reference to FastLeaderElection in your log files if
> this
> > parameter is set correctly.
> >
> > Sorry for the trouble,
> >
> > Patrick
> >
> > Todd Greenwood wrote:
> > > Mahadev,
> > >
> > > I just heard from IT that this build behaves in exactly the same
way
> as
> > > previous versions, e.g. we get continuous leader elections that
> > > disconnect the followers and then get re-elected, and
> disconnect...etc.
> > >
> > > This is from a fresh sync to the 3.2 branch:
> > >
> > > svn co
> > >
http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> > > ./branch-3.2
> > >
> > > CHANGES.TXT show the various fixes included:
> > >
> > >
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > > /src/original$ head -n 50 branch-3.2/CHANGES.txt
> > > Release 3.2.1
> > >
> > > Backward compatibile changes:
> > >
> > > BUGFIXES:
> > >   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
> via
> > > flavio)
> > >
> > >   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
> via
> > > mahadev)
> > >
> > >   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
mahadev)
> > >
> > >   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris
via
> > > mahadev)
> > >
> > >   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
> > >   (giri via mahadev)
> > >
> > >   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
> mahadev)
> > >
> > >   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
> immediate
> > >   failure. (chris via mahadev)
> > >
> > >   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
> via
> > > phunt)
> > >
> > >   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
> > > other)
> > >   embedded clients (ryan rawson via phunt)
> > >
> > >   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
> via
> > > mahadev)
> > >
> > >   ZOOKEEPER-479.  QuorumHierarchical does not count groups
correctly
> > >   (flavio via mahadev)
> > >
> > >   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
> empty
> > > cert
> > >   (Chris Darroch via phunt)
> > >
> > >   ZOOKEEPER-480. FLE should perform leader check when node is not
> > > leading and
> > >   add vote of follower (flavio via mahadev)
> > >
> > >   ZOOKEEPER-491. Prevent zero-weight servers from being elected
> (flavio
> > > via
> > >   mahadev)
> > >
> > > What can I do to assist you with this issue?
> > >
> > > -Todd
> > >
> > >> -----Original Message-----
> > >> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> > >> Sent: Tuesday, August 04, 2009 12:43 PM
> > >> To: zookeeper-dev@hadoop.apache.org
> > >> Subject: Re: Unending Leader Elections in WAN deploy
> > >>
> > >> Hi todd,
> > >>  comments in line
> > >>
> > >>
> > >> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> > > wrote:
> > >>> Mahadev,
> > >>>
> > >>> Some quick questions:
> > >>>
> > >>> 1. Version
> > >>>
> > >>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml
is
> > > still
> > >>> calling this 3.2.0. Should this be rev'd, and am I correct in
> > > calling
> > >>> this release 3.2.1?
> > >> Yes the release is 3.2.1. The build.xml will be fixed as soon as
we
> > > tag
> > >> the
> > >> release.
> > >>
> > >>> 2. Build targets
> > >>>
> > >>> The package target fails b/c the create-cppunit-configure target
> > > fails
> > >>> due to various problems w/ respect to autoconf. Are these
> > > dependencies
> > >>> documented somewhere ? I'd like to have a fully building system.
> > >>>
> > >>> create-cppunit-configure:
> > >>>      [exec] Can't exec "libtoolize": No such file or directory
at
> > >>> /usr/bin/autoreconf line 188.
> > >>>      [exec] Use of uninitialized value $libtoolize in pattern
> match
> > >>> (m//) at /usr/bin/autoreconf line 188.
> > >>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT'
not
> > > found
> > >>> in library
> > >>>      [exec] configure.ac:33: error: possibly undefined macro:
> > >>> AM_PATH_CPPUNIT
> > >>>      [exec]       If this token and others are legitimate,
please
> > > use
> > >>> m4_pattern_allow.
> > >>>      [exec]       See the Autoconf documentation.
> > >>>      [exec] configure.ac:53: error: possibly undefined macro:
> > >>> AC_PROG_LIBTOOL
> > >>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
status:
> 1
> > >>>
> > >> You need auto tools to run this. Please read the README for
> building c
> > >> client library at src/c/ for the installation requirements.
> > >>> 3. Sync failure:
> > >>>
> > >>> This is still failing.
> > >>>
> > >>> svn: URL
> > >>>
'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> > >>> doesn't exist
> > >>>
> > >> Yes this hasn't been fixed yet!
> > >>
> > >> Thanks
> > >> mahadev
> > >>> -Todd
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Todd Greenwood
> > >>>> Sent: Tuesday, August 04, 2009 11:26 AM
> > >>>> To: 'zookeeper-user@hadoop.apache.org'
> > >>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>
> > >>>> Great news. Thank you Mahadev. I'll report our findings later
> > > today.
> > >>>> -Todd
> > >>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> > >>>>> Sent: Tuesday, August 04, 2009 11:20 AM
> > >>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>
> > >>>>> Hi Todd,
> > >>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
> > > now.
> > >>>>> Thanks
> > >>>>> mahadev
> > >>>>>
> > >>>>>
> > >>>>> On 8/3/09 4:29 PM, "Todd Greenwood"
<to...@audiencescience.com>
> > >>> wrote:
> > >>>>>> That'd be perfect. Thanks!
> > >>>>>>
> > >>>>>>> -----Original Message-----
> > >>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> > >>>>>>> Sent: Monday, August 03, 2009 4:24 PM
> > >>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>
> > >>>>>>> Hi Todd,
> > >>>>>>>   Most of the patches that you mention should be in the
branch
> > >>> 3.2 by
> > >>>>>> tomm
> > >>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> > > tomm.
> > >>>>>> Would
> > >>>>>>> that
> > >>>>>>> suffice for you?
> > >>>>>>>
> > >>>>>>> Thanks
> > >>>>>>> mahadev
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
> <to...@audiencescience.com>
> > >>>> wrote:
> > >>>>>>>> Another problem...I've reverted to the latest versions of
the
> > >>>>>> patches
> > >>>>>>>> that are not specific to branch-3.2, and I'm getting two
> > >>> compilation
> > >>>>>>>> errors:
> > >>>>>>>>
> > >>>>>>>> build-generated:
> > >>>>>>>>     [javac] Compiling 44 source files to
> > >>>>>>>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>>>>>> atched/branch-3.2/build/classes
> > >>>>>>>>
> > >>>>>>>> compile-main:
> > >>>>>>>>     [javac] Compiling 2 source files to
> > >>>>>>>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>>>>>> atched/branch-3.2/build/classes
> > >>>>>>>>     [javac]
> > >>>>>>>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>>>> atched/branch-
> > >>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > >>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
> > >>> getQuorumPeers()
> > >>>>>> have
> > >>>>>>>> the same erasure
> > >>>>>>>>     [javac]         public String[] getQuorumPeers();
> > >>>>>>>>     [javac]                         ^
> > >>>>>>>>     [javac]
> > >>>>>>>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>>>> atched/branch-
> > >>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > >>>>>>>> mStats.java:31: name clash: getServerState() and
> > >>> getServerState()
> > >>>>>> have
> > >>>>>>>> the same erasure
> > >>>>>>>>     [javac]         public String getServerState();
> > >>>>>>>>     [javac]                       ^
> > >>>>>>>>     [javac] 2 errors
> > >>>>>>>>
> > >>>>>>>> My build process is pretty simple:
> > >>>>>>>>
> > >>>>>>>> 1. copy the branch-3.2 source to a temp directory
> > >>>>>>>> (src/patched/branch-3.2)
> > >>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
> > >>>>>>>> 3. build zookeeper in the temp directory
> > >>>>>>>>
> > >>>>>>>> -Todd
> > >>>>>>>>> -----Original Message-----
> > >>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> > >>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
> > >>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>>>>>>
> > >>>>>>>>> Flavio,
> > >>>>>>>>> I notice that you've updated the patches referenced for
the
> > > WAN
> > >>>>>>>>> deployment. There appears to be an order dependency w/
> respect
> > >>> to
> > >>>>>>>> these
> > >>>>>>>>> four patches...
> > >>>>>>>>>
> > >>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> > >>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> > >>>>>>>>>
> > >>>>>>>>> 473 -> 479 (479 fails)
> > >>>>>>>>>
> > >>>>>>>>>
> > >
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > >>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
> > >>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> > >>>>>>>>> patching file
> > >>>>>>>>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> > >>>>>>>>> ical.java
> > >>>>>>>>> patching file
> > >>>>>>>>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> > >>>>>>>>> patching file
> > >>>>>>>>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> > >>>>>>>>> .java
> > >>>>>>>>> patching file
> > >>>>>>>>>
> > >>>
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> > >>>>>>>>> Hunk #1 FAILED at 93.
> > >>>>>>>>> Hunk #2 FAILED at 145.
> > >>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
> > >>>>>>>>>
> > >
>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> > >
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > >>>>>>>>> /src/patched/branch-3.2$ h ../patches/
> > >>>>>>>>>
> > >>>>>>>>> Could you advise as to which patches I need to apply, and
in
> > >>> what
> > >>>>>>>> order?
> > >>>>>>>>> -Todd
> > >>>>>>>>>
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
> > >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>>>>
> > >>>>>>>>>> Perfect! Thanks for the update, Todd.
> > >>>>>>>>>>
> > >>>>>>>>>> -Flavio
> > >>>>>>>>>>
> > >>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
> > >>> Compilation
> > >>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the
latest
> > >>> patches
> > >>>>>>>>> 473,
> > >>>>>>>>>>> 479, 481, and 491.
> > >>>>>>>>>>>
> > >>>>>>>>>>> -Todd
> > >>>>>>>>>>>
> > >>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> > >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> It should be in 479. Perhaps you have a stale version
of
> > > the
> > >>>>>>>> patch.
> > >>>>>>>>>>>> -Flavio
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Flavio,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'm getting a compilation error for patch 491:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> compile-main:
> > >>>>>>>>>>>>>   [javac] Compiling 1 source file to
> > >>>>>>>>>>>>>
> > >>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>>>>>>>>>>>> src/p
> > >>>>>>>>>>>>> atched/branch-3.2/build/classes
> > >>>>>>>>>>>>>   [javac]
> > >>>>>>>>>>>>>
> > >>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>>>>>>>>>>>> src/p
> > >>>>>>>>>>>>>
> > >>>
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> > >>>>>>>>>>>>> FastL
> > >>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
> > >>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
> > >>>>>>>>>>>>>   [javac] location: interface
> > >>>>>>>>>>>>>
> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> > >>>>>>>>>>>>>   [javac]
> > >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>>>>>>>>>>>>   [javac]
> > >>> ^
> > >>>>>>>>>>>>>   [javac] 1 error
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I see a reference to getWeight in both
> > >>> FastLeaderElection.java
> > >>>>>>>> in
> > >>>>>>>>>>>>> patch
> > >>>>>>>>>>>>> 491:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
> > >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> > >>>>>>>>>>>>> FastLeaderElection.java
> > >>>>>>>>>>>>> :
> > >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> > >>>>>>>>>>>>> 0)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> However, I don't see a reference to this method in
> patches
> > >>> 473,
> > >>>>>>>>> 479,
> > >>>>>>>>>>>>> or
> > >>>>>>>>>>>>> 481. I also don't see a reference to this method in
the
> > >>>>>> trunk...
> > >>>>>>>>>>>>> -Todd
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>>>> From: Todd Greenwood
[mailto:toddg@audiencescience.com]
> > >>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> > >>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
> > >>>>>>>>>>>>>> -Todd
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> > >>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> You're missing 491 from your set of patches.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -Flavio
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> This repro's in both branch-3.2, and
> > >>> branch-3.2+patches(473,
> > >>>>>>>>> 479,
> > >>>>>>>>>>>>>> 481).
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Basically, it seems like the nodes are electing
> > >>> pd4-zook02
> > >>>>>> to
> > >>>>>>>>> be
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> > >>>>>>>> supposed
> > >>>>>>>>> to
> > >>>>>>>>>>>>> be
> > >>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it
again,
> > >>> and
> > >>>>>>>> it
> > >>>>>>>>>>>>> loops
> > >>>>>>>>>>>>>> over and over.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -------------
> > >>>>>>>>>>>>>> Server config
> > >>>>>>>>>>>>>> -------------
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> > >>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> > >>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> group.1:1:2:3:4:5
> > >>>>>>>>>>>>>> weight.1=1
> > >>>>>>>>>>>>>> weight.2=1
> > >>>>>>>>>>>>>> weight.3=1
> > >>>>>>>>>>>>>> weight.4=1
> > >>>>>>>>>>>>>> weight.5=1
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> group.2:6:7:8:9
> > >>>>>>>>>>>>>> weight.6=0
> > >>>>>>>>>>>>>> weight.7=0
> > >>>>>>>>>>>>>> weight.8=0
> > >>>>>>>>>>>>>> weight.9=0
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> > >>>>>>>> different
> > >>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> > >>>>>> machines
> > >>>>>>>>> in
> > >>>>>>>>>>>>> dc1
> > >>>>>>>>>>>>>> have voting rights, and the ability to become a
leader.
> > >>> The
> > >>>>>>>>>>>>> machines
> > >>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>> the pods all have a weight of zero, and are not
> expected
> > >>> to
> > >>>>>>>>>>> become
> > >>>>>>>>>>>>>> leaders, or to vote on transactions.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -Todd
> > >

RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Patrick, thanks! I'll forward on to IT and I'll report back to you
shortly...

> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org]
> Sent: Tuesday, August 04, 2009 3:55 PM
> To: zookeeper-dev@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Todd, Mahadev and I looked at this and it turns out to be a
regression.
> Ironically a patch I created for 3.2 branch to add quorum tests
actually
> broke the quorum config -- a default value for a config parameter was
> lost. I'm going to submit a patch asap to get the default back, but
for
> the time being you can set:
> 
> electionAlg=3
> 
> in each of your config files.
> 
> You should see reference to FastLeaderElection in your log files if
this
> parameter is set correctly.
> 
> Sorry for the trouble,
> 
> Patrick
> 
> Todd Greenwood wrote:
> > Mahadev,
> >
> > I just heard from IT that this build behaves in exactly the same way
as
> > previous versions, e.g. we get continuous leader elections that
> > disconnect the followers and then get re-elected, and
disconnect...etc.
> >
> > This is from a fresh sync to the 3.2 branch:
> >
> > svn co
> > http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> > ./branch-3.2
> >
> > CHANGES.TXT show the various fixes included:
> >
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > /src/original$ head -n 50 branch-3.2/CHANGES.txt
> > Release 3.2.1
> >
> > Backward compatibile changes:
> >
> > BUGFIXES:
> >   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
via
> > flavio)
> >
> >   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
via
> > mahadev)
> >
> >   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev)
> >
> >   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
> > mahadev)
> >
> >   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
> >   (giri via mahadev)
> >
> >   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
mahadev)
> >
> >   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
immediate
> >   failure. (chris via mahadev)
> >
> >   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
via
> > phunt)
> >
> >   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
> > other)
> >   embedded clients (ryan rawson via phunt)
> >
> >   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
via
> > mahadev)
> >
> >   ZOOKEEPER-479.  QuorumHierarchical does not count groups correctly
> >   (flavio via mahadev)
> >
> >   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
empty
> > cert
> >   (Chris Darroch via phunt)
> >
> >   ZOOKEEPER-480. FLE should perform leader check when node is not
> > leading and
> >   add vote of follower (flavio via mahadev)
> >
> >   ZOOKEEPER-491. Prevent zero-weight servers from being elected
(flavio
> > via
> >   mahadev)
> >
> > What can I do to assist you with this issue?
> >
> > -Todd
> >
> >> -----Original Message-----
> >> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >> Sent: Tuesday, August 04, 2009 12:43 PM
> >> To: zookeeper-dev@hadoop.apache.org
> >> Subject: Re: Unending Leader Elections in WAN deploy
> >>
> >> Hi todd,
> >>  comments in line
> >>
> >>
> >> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> > wrote:
> >>> Mahadev,
> >>>
> >>> Some quick questions:
> >>>
> >>> 1. Version
> >>>
> >>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
> > still
> >>> calling this 3.2.0. Should this be rev'd, and am I correct in
> > calling
> >>> this release 3.2.1?
> >> Yes the release is 3.2.1. The build.xml will be fixed as soon as we
> > tag
> >> the
> >> release.
> >>
> >>> 2. Build targets
> >>>
> >>> The package target fails b/c the create-cppunit-configure target
> > fails
> >>> due to various problems w/ respect to autoconf. Are these
> > dependencies
> >>> documented somewhere ? I'd like to have a fully building system.
> >>>
> >>> create-cppunit-configure:
> >>>      [exec] Can't exec "libtoolize": No such file or directory at
> >>> /usr/bin/autoreconf line 188.
> >>>      [exec] Use of uninitialized value $libtoolize in pattern
match
> >>> (m//) at /usr/bin/autoreconf line 188.
> >>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
> > found
> >>> in library
> >>>      [exec] configure.ac:33: error: possibly undefined macro:
> >>> AM_PATH_CPPUNIT
> >>>      [exec]       If this token and others are legitimate, please
> > use
> >>> m4_pattern_allow.
> >>>      [exec]       See the Autoconf documentation.
> >>>      [exec] configure.ac:53: error: possibly undefined macro:
> >>> AC_PROG_LIBTOOL
> >>>      [exec] autoreconf: /usr/bin/autoconf failed with exit status:
1
> >>>
> >> You need auto tools to run this. Please read the README for
building c
> >> client library at src/c/ for the installation requirements.
> >>> 3. Sync failure:
> >>>
> >>> This is still failing.
> >>>
> >>> svn: URL
> >>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> >>> doesn't exist
> >>>
> >> Yes this hasn't been fixed yet!
> >>
> >> Thanks
> >> mahadev
> >>> -Todd
> >>>
> >>>> -----Original Message-----
> >>>> From: Todd Greenwood
> >>>> Sent: Tuesday, August 04, 2009 11:26 AM
> >>>> To: 'zookeeper-user@hadoop.apache.org'
> >>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>
> >>>> Great news. Thank you Mahadev. I'll report our findings later
> > today.
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>> Sent: Tuesday, August 04, 2009 11:20 AM
> >>>>> To: zookeeper-user@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> Hi Todd,
> >>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
> > now.
> >>>>> Thanks
> >>>>> mahadev
> >>>>>
> >>>>>
> >>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
> >>> wrote:
> >>>>>> That'd be perfect. Thanks!
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>>>> Sent: Monday, August 03, 2009 4:24 PM
> >>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>
> >>>>>>> Hi Todd,
> >>>>>>>   Most of the patches that you mention should be in the branch
> >>> 3.2 by
> >>>>>> tomm
> >>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> > tomm.
> >>>>>> Would
> >>>>>>> that
> >>>>>>> suffice for you?
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> mahadev
> >>>>>>>
> >>>>>>>
> >>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
<to...@audiencescience.com>
> >>>> wrote:
> >>>>>>>> Another problem...I've reverted to the latest versions of the
> >>>>>> patches
> >>>>>>>> that are not specific to branch-3.2, and I'm getting two
> >>> compilation
> >>>>>>>> errors:
> >>>>>>>>
> >>>>>>>> build-generated:
> >>>>>>>>     [javac] Compiling 44 source files to
> >>>>>>>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>
> >>>>>>>> compile-main:
> >>>>>>>>     [javac] Compiling 2 source files to
> >>>>>>>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>     [javac]
> >>>>>>>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>> atched/branch-
> >>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
> >>> getQuorumPeers()
> >>>>>> have
> >>>>>>>> the same erasure
> >>>>>>>>     [javac]         public String[] getQuorumPeers();
> >>>>>>>>     [javac]                         ^
> >>>>>>>>     [javac]
> >>>>>>>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>> atched/branch-
> >>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>> mStats.java:31: name clash: getServerState() and
> >>> getServerState()
> >>>>>> have
> >>>>>>>> the same erasure
> >>>>>>>>     [javac]         public String getServerState();
> >>>>>>>>     [javac]                       ^
> >>>>>>>>     [javac] 2 errors
> >>>>>>>>
> >>>>>>>> My build process is pretty simple:
> >>>>>>>>
> >>>>>>>> 1. copy the branch-3.2 source to a temp directory
> >>>>>>>> (src/patched/branch-3.2)
> >>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
> >>>>>>>> 3. build zookeeper in the temp directory
> >>>>>>>>
> >>>>>>>> -Todd
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
> >>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>
> >>>>>>>>> Flavio,
> >>>>>>>>> I notice that you've updated the patches referenced for the
> > WAN
> >>>>>>>>> deployment. There appears to be an order dependency w/
respect
> >>> to
> >>>>>>>> these
> >>>>>>>>> four patches...
> >>>>>>>>>
> >>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>>>>>>>>
> >>>>>>>>> 473 -> 479 (479 fails)
> >>>>>>>>>
> >>>>>>>>>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
> >>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>> patching file
> >>>>>>>>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >>>>>>>>> ical.java
> >>>>>>>>> patching file
> >>>>>>>>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >>>>>>>>> patching file
> >>>>>>>>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >>>>>>>>> .java
> >>>>>>>>> patching file
> >>>>>>>>>
> >>>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >>>>>>>>> Hunk #1 FAILED at 93.
> >>>>>>>>> Hunk #2 FAILED at 145.
> >>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
> >>>>>>>>>
> >
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>> /src/patched/branch-3.2$ h ../patches/
> >>>>>>>>>
> >>>>>>>>> Could you advise as to which patches I need to apply, and in
> >>> what
> >>>>>>>> order?
> >>>>>>>>> -Todd
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
> >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>
> >>>>>>>>>> Perfect! Thanks for the update, Todd.
> >>>>>>>>>>
> >>>>>>>>>> -Flavio
> >>>>>>>>>>
> >>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
> >>> Compilation
> >>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
> >>> patches
> >>>>>>>>> 473,
> >>>>>>>>>>> 479, 481, and 491.
> >>>>>>>>>>>
> >>>>>>>>>>> -Todd
> >>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of
> > the
> >>>>>>>> patch.
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Flavio,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm getting a compilation error for patch 491:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> compile-main:
> >>>>>>>>>>>>>   [javac] Compiling 1 source file to
> >>>>>>>>>>>>>
> >>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>>> src/p
> >>>>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>>>   [javac]
> >>>>>>>>>>>>>
> >>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>>> src/p
> >>>>>>>>>>>>>
> >>>
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>>> FastL
> >>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>>>>>>>>   [javac] location: interface
> >>>>>>>>>>>>>
org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>>>>>>>>   [javac]
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>>>   [javac]
> >>> ^
> >>>>>>>>>>>>>   [javac] 1 error
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I see a reference to getWeight in both
> >>> FastLeaderElection.java
> >>>>>>>> in
> >>>>>>>>>>>>> patch
> >>>>>>>>>>>>> 491:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>>> FastLeaderElection.java
> >>>>>>>>>>>>> :
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>>>>>>>>> 0)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> However, I don't see a reference to this method in
patches
> >>> 473,
> >>>>>>>>> 479,
> >>>>>>>>>>>>> or
> >>>>>>>>>>>>> 481. I also don't see a reference to this method in the
> >>>>>> trunk...
> >>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This repro's in both branch-3.2, and
> >>> branch-3.2+patches(473,
> >>>>>>>>> 479,
> >>>>>>>>>>>>>> 481).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Basically, it seems like the nodes are electing
> >>> pd4-zook02
> >>>>>> to
> >>>>>>>>> be
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> >>>>>>>> supposed
> >>>>>>>>> to
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
> >>> and
> >>>>>>>> it
> >>>>>>>>>>>>> loops
> >>>>>>>>>>>>>> over and over.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -------------
> >>>>>>>>>>>>>> Server config
> >>>>>>>>>>>>>> -------------
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>>>>>>> weight.1=1
> >>>>>>>>>>>>>> weight.2=1
> >>>>>>>>>>>>>> weight.3=1
> >>>>>>>>>>>>>> weight.4=1
> >>>>>>>>>>>>>> weight.5=1
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> group.2:6:7:8:9
> >>>>>>>>>>>>>> weight.6=0
> >>>>>>>>>>>>>> weight.7=0
> >>>>>>>>>>>>>> weight.8=0
> >>>>>>>>>>>>>> weight.9=0
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> >>>>>>>> different
> >>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> >>>>>> machines
> >>>>>>>>> in
> >>>>>>>>>>>>> dc1
> >>>>>>>>>>>>>> have voting rights, and the ability to become a leader.
> >>> The
> >>>>>>>>>>>>> machines
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>> the pods all have a weight of zero, and are not
expected
> >>> to
> >>>>>>>>>>> become
> >>>>>>>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Todd
> >

Re: Unending Leader Elections in WAN deploy

Posted by Patrick Hunt <ph...@apache.org>.
Todd, Mahadev and I looked at this and it turns out to be a regression. 
Ironically a patch I created for 3.2 branch to add quorum tests actually 
broke the quorum config -- a default value for a config parameter was 
lost. I'm going to submit a patch asap to get the default back, but for 
the time being you can set:

electionAlg=3

in each of your config files.

You should see reference to FastLeaderElection in your log files if this 
parameter is set correctly.

Sorry for the trouble,

Patrick

Todd Greenwood wrote:
> Mahadev,
> 
> I just heard from IT that this build behaves in exactly the same way as
> previous versions, e.g. we get continuous leader elections that
> disconnect the followers and then get re-elected, and disconnect...etc.
> 
> This is from a fresh sync to the 3.2 branch:
> 
> svn co
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> ./branch-3.2
> 
> CHANGES.TXT show the various fixes included:
> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> /src/original$ head -n 50 branch-3.2/CHANGES.txt
> Release 3.2.1
> 
> Backward compatibile changes:
> 
> BUGFIXES:
>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris via
> flavio)
> 
>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris via
> mahadev)
> 
>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev)
> 
>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
> mahadev)
> 
>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>   (giri via mahadev)
>   
>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via mahadev)
> 
>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent immediate
>   failure. (chris via mahadev) 
> 
>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev via
> phunt)
> 
>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
> other)
>   embedded clients (ryan rawson via phunt)
> 
>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio via
> mahadev)
> 
>   ZOOKEEPER-479.  QuorumHierarchical does not count groups correctly
>   (flavio via mahadev)
> 
>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with empty
> cert
>   (Chris Darroch via phunt)
> 
>   ZOOKEEPER-480. FLE should perform leader check when node is not
> leading and
>   add vote of follower (flavio via mahadev)
> 
>   ZOOKEEPER-491. Prevent zero-weight servers from being elected (flavio
> via
>   mahadev)
> 
> What can I do to assist you with this issue?
> 
> -Todd
> 
>> -----Original Message-----
>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>> Sent: Tuesday, August 04, 2009 12:43 PM
>> To: zookeeper-dev@hadoop.apache.org
>> Subject: Re: Unending Leader Elections in WAN deploy
>>
>> Hi todd,
>>  comments in line
>>
>>
>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> wrote:
>>> Mahadev,
>>>
>>> Some quick questions:
>>>
>>> 1. Version
>>>
>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
> still
>>> calling this 3.2.0. Should this be rev'd, and am I correct in
> calling
>>> this release 3.2.1?
>> Yes the release is 3.2.1. The build.xml will be fixed as soon as we
> tag
>> the
>> release.
>>
>>> 2. Build targets
>>>
>>> The package target fails b/c the create-cppunit-configure target
> fails
>>> due to various problems w/ respect to autoconf. Are these
> dependencies
>>> documented somewhere ? I'd like to have a fully building system.
>>>
>>> create-cppunit-configure:
>>>      [exec] Can't exec "libtoolize": No such file or directory at
>>> /usr/bin/autoreconf line 188.
>>>      [exec] Use of uninitialized value $libtoolize in pattern match
>>> (m//) at /usr/bin/autoreconf line 188.
>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
> found
>>> in library
>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>> AM_PATH_CPPUNIT
>>>      [exec]       If this token and others are legitimate, please
> use
>>> m4_pattern_allow.
>>>      [exec]       See the Autoconf documentation.
>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>> AC_PROG_LIBTOOL
>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1
>>>
>> You need auto tools to run this. Please read the README for building c
>> client library at src/c/ for the installation requirements.
>>> 3. Sync failure:
>>>
>>> This is still failing.
>>>
>>> svn: URL
>>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>> doesn't exist
>>>
>> Yes this hasn't been fixed yet!
>>
>> Thanks
>> mahadev
>>> -Todd
>>>
>>>> -----Original Message-----
>>>> From: Todd Greenwood
>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>
>>>> Great news. Thank you Mahadev. I'll report our findings later
> today.
>>>> -Todd
>>>>
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>
>>>>> Hi Todd,
>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
> now.
>>>>> Thanks
>>>>> mahadev
>>>>>
>>>>>
>>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
>>> wrote:
>>>>>> That'd be perfect. Thanks!
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>
>>>>>>> Hi Todd,
>>>>>>>   Most of the patches that you mention should be in the branch
>>> 3.2 by
>>>>>> tomm
>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> tomm.
>>>>>> Would
>>>>>>> that
>>>>>>> suffice for you?
>>>>>>>
>>>>>>> Thanks
>>>>>>> mahadev
>>>>>>>
>>>>>>>
>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>> wrote:
>>>>>>>> Another problem...I've reverted to the latest versions of the
>>>>>> patches
>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>> compilation
>>>>>>>> errors:
>>>>>>>>
>>>>>>>> build-generated:
>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>
>>>>>>>> compile-main:
>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>     [javac]
>>>>>>>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> atched/branch-
>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>> getQuorumPeers()
>>>>>> have
>>>>>>>> the same erasure
>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>     [javac]                         ^
>>>>>>>>     [javac]
>>>>>>>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> atched/branch-
>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>> mStats.java:31: name clash: getServerState() and
>>> getServerState()
>>>>>> have
>>>>>>>> the same erasure
>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>     [javac]                       ^
>>>>>>>>     [javac] 2 errors
>>>>>>>>
>>>>>>>> My build process is pretty simple:
>>>>>>>>
>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>> (src/patched/branch-3.2)
>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>
>>>>>>>> -Todd
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>
>>>>>>>>> Flavio,
>>>>>>>>> I notice that you've updated the patches referenced for the
> WAN
>>>>>>>>> deployment. There appears to be an order dependency w/ respect
>>> to
>>>>>>>> these
>>>>>>>>> four patches...
>>>>>>>>>
>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>
>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>
>>>>>>>>>
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>> patching file
>>>>>>>>>
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>> ical.java
>>>>>>>>> patching file
>>>>>>>>>
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>> patching file
>>>>>>>>>
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>> .java
>>>>>>>>> patching file
>>>>>>>>>
>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>
>>>>>>>>> Could you advise as to which patches I need to apply, and in
>>> what
>>>>>>>> order?
>>>>>>>>> -Todd
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>
>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>
>>>>>>>>>> -Flavio
>>>>>>>>>>
>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>> Compilation
>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
>>> patches
>>>>>>>>> 473,
>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>
>>>>>>>>>>> -Todd
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>
>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of
> the
>>>>>>>> patch.
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>
>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>>
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>>
>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastL
>>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>>   [javac]
>>> ^
>>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see a reference to getWeight in both
>>> FastLeaderElection.java
>>>>>>>> in
>>>>>>>>>>>>> patch
>>>>>>>>>>>>> 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>>> :
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>>> 0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, I don't see a reference to this method in patches
>>> 473,
>>>>>>>>> 479,
>>>>>>>>>>>>> or
>>>>>>>>>>>>> 481. I also don't see a reference to this method in the
>>>>>> trunk...
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This repro's in both branch-3.2, and
>>> branch-3.2+patches(473,
>>>>>>>>> 479,
>>>>>>>>>>>>>> 481).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>> pd4-zook02
>>>>>> to
>>>>>>>>> be
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>> supposed
>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
>>> and
>>>>>>>> it
>>>>>>>>>>>>> loops
>>>>>>>>>>>>>> over and over.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>> Server config
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>> different
>>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>> machines
>>>>>>>>> in
>>>>>>>>>>>>> dc1
>>>>>>>>>>>>>> have voting rights, and the ability to become a leader.
>>> The
>>>>>>>>>>>>> machines
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the pods all have a weight of zero, and are not expected
>>> to
>>>>>>>>>>> become
>>>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Todd
> 

RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Will do.

> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org]
> Sent: Tuesday, August 04, 2009 1:34 PM
> To: zookeeper-dev@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> It would be better to create a JIRA with configs as well as logs.
> 
> Patrick
> 
> Mahadev Konar wrote:
> > Hi Todd,
> >
> >   What is the synclimit you are using? Can you post your config? For
> WAN's
> > you will have to use much bigger values for synclimit and others.
> >
> > Thanks
> > mahadev
> >
> >
> > On 8/4/09 1:24 PM, "Todd Greenwood" <to...@audiencescience.com>
wrote:
> >
> >> Mahadev,
> >>
> >> I just heard from IT that this build behaves in exactly the same
way as
> >> previous versions, e.g. we get continuous leader elections that
> >> disconnect the followers and then get re-elected, and
disconnect...etc.
> >>
> >> This is from a fresh sync to the 3.2 branch:
> >>
> >> svn co
> >>
http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> >> ./branch-3.2
> >>
> >> CHANGES.TXT show the various fixes included:
> >>
> >>
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >> /src/original$ head -n 50 branch-3.2/CHANGES.txt
> >> Release 3.2.1
> >>
> >> Backward compatibile changes:
> >>
> >> BUGFIXES:
> >>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris
via
> >> flavio)
> >>
> >>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris
via
> >> mahadev)
> >>
> >>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via
mahadev)
> >>
> >>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
> >> mahadev)
> >>
> >>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
> >>   (giri via mahadev)
> >>
> >>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via
mahadev)
> >>
> >>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent
immediate
> >>   failure. (chris via mahadev)
> >>
> >>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev
via
> >> phunt)
> >>
> >>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
> >> other)
> >>   embedded clients (ryan rawson via phunt)
> >>
> >>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio
via
> >> mahadev)
> >>
> >>   ZOOKEEPER-479.  QuorumHierarchical does not count groups
correctly
> >>   (flavio via mahadev)
> >>
> >>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with
empty
> >> cert
> >>   (Chris Darroch via phunt)
> >>
> >>   ZOOKEEPER-480. FLE should perform leader check when node is not
> >> leading and
> >>   add vote of follower (flavio via mahadev)
> >>
> >>   ZOOKEEPER-491. Prevent zero-weight servers from being elected
(flavio
> >> via
> >>   mahadev)
> >>
> >> What can I do to assist you with this issue?
> >>
> >> -Todd
> >>
> >>> -----Original Message-----
> >>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>> Sent: Tuesday, August 04, 2009 12:43 PM
> >>> To: zookeeper-dev@hadoop.apache.org
> >>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>
> >>> Hi todd,
> >>>  comments in line
> >>>
> >>>
> >>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> >> wrote:
> >>>> Mahadev,
> >>>>
> >>>> Some quick questions:
> >>>>
> >>>> 1. Version
> >>>>
> >>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
> >> still
> >>>> calling this 3.2.0. Should this be rev'd, and am I correct in
> >> calling
> >>>> this release 3.2.1?
> >>> Yes the release is 3.2.1. The build.xml will be fixed as soon as
we
> >> tag
> >>> the
> >>> release.
> >>>
> >>>> 2. Build targets
> >>>>
> >>>> The package target fails b/c the create-cppunit-configure target
> >> fails
> >>>> due to various problems w/ respect to autoconf. Are these
> >> dependencies
> >>>> documented somewhere ? I'd like to have a fully building system.
> >>>>
> >>>> create-cppunit-configure:
> >>>>      [exec] Can't exec "libtoolize": No such file or directory at
> >>>> /usr/bin/autoreconf line 188.
> >>>>      [exec] Use of uninitialized value $libtoolize in pattern
match
> >>>> (m//) at /usr/bin/autoreconf line 188.
> >>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
> >> found
> >>>> in library
> >>>>      [exec] configure.ac:33: error: possibly undefined macro:
> >>>> AM_PATH_CPPUNIT
> >>>>      [exec]       If this token and others are legitimate, please
> >> use
> >>>> m4_pattern_allow.
> >>>>      [exec]       See the Autoconf documentation.
> >>>>      [exec] configure.ac:53: error: possibly undefined macro:
> >>>> AC_PROG_LIBTOOL
> >>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit
status: 1
> >>>>
> >>> You need auto tools to run this. Please read the README for
building c
> >>> client library at src/c/ for the installation requirements.
> >>>> 3. Sync failure:
> >>>>
> >>>> This is still failing.
> >>>>
> >>>> svn: URL
> >>>>
'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> >>>> doesn't exist
> >>>>
> >>> Yes this hasn't been fixed yet!
> >>>
> >>> Thanks
> >>> mahadev
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Todd Greenwood
> >>>>> Sent: Tuesday, August 04, 2009 11:26 AM
> >>>>> To: 'zookeeper-user@hadoop.apache.org'
> >>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> Great news. Thank you Mahadev. I'll report our findings later
> >> today.
> >>>>> -Todd
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
> >>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>
> >>>>>> Hi Todd,
> >>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
> >> now.
> >>>>>> Thanks
> >>>>>> mahadev
> >>>>>>
> >>>>>>
> >>>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
> >>>> wrote:
> >>>>>>> That'd be perfect. Thanks!
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
> >>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>
> >>>>>>>> Hi Todd,
> >>>>>>>>   Most of the patches that you mention should be in the
branch
> >>>> 3.2 by
> >>>>>>> tomm
> >>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> >> tomm.
> >>>>>>> Would
> >>>>>>>> that
> >>>>>>>> suffice for you?
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> mahadev
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood"
<to...@audiencescience.com>
> >>>>> wrote:
> >>>>>>>>> Another problem...I've reverted to the latest versions of
the
> >>>>>>> patches
> >>>>>>>>> that are not specific to branch-3.2, and I'm getting two
> >>>> compilation
> >>>>>>>>> errors:
> >>>>>>>>>
> >>>>>>>>> build-generated:
> >>>>>>>>>     [javac] Compiling 44 source files to
> >>>>>>>>>
> >>
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>
> >>>>>>>>> compile-main:
> >>>>>>>>>     [javac] Compiling 2 source files to
> >>>>>>>>>
> >>
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>     [javac]
> >>>>>>>>>
> >>
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>> atched/branch-
> >>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
> >>>> getQuorumPeers()
> >>>>>>> have
> >>>>>>>>> the same erasure
> >>>>>>>>>     [javac]         public String[] getQuorumPeers();
> >>>>>>>>>     [javac]                         ^
> >>>>>>>>>     [javac]
> >>>>>>>>>
> >>
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>> atched/branch-
> >>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>>>>> mStats.java:31: name clash: getServerState() and
> >>>> getServerState()
> >>>>>>> have
> >>>>>>>>> the same erasure
> >>>>>>>>>     [javac]         public String getServerState();
> >>>>>>>>>     [javac]                       ^
> >>>>>>>>>     [javac] 2 errors
> >>>>>>>>>
> >>>>>>>>> My build process is pretty simple:
> >>>>>>>>>
> >>>>>>>>> 1. copy the branch-3.2 source to a temp directory
> >>>>>>>>> (src/patched/branch-3.2)
> >>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
> >>>>>>>>> 3. build zookeeper in the temp directory
> >>>>>>>>>
> >>>>>>>>> -Todd
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
> >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>
> >>>>>>>>>> Flavio,
> >>>>>>>>>> I notice that you've updated the patches referenced for the
> >> WAN
> >>>>>>>>>> deployment. There appears to be an order dependency w/
respect
> >>>> to
> >>>>>>>>> these
> >>>>>>>>>> four patches...
> >>>>>>>>>>
> >>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>>>>>>>>>
> >>>>>>>>>> 473 -> 479 (479 fails)
> >>>>>>>>>>
> >>>>>>>>>>
> >>
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
> >>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> >>>>>>>>>> patching file
> >>>>>>>>>>
> >>
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >>>>>>>>>> ical.java
> >>>>>>>>>> patching file
> >>>>>>>>>>
> >>
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >>>>>>>>>> patching file
> >>>>>>>>>>
> >>
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >>>>>>>>>> .java
> >>>>>>>>>> patching file
> >>>>>>>>>>
> >>>>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >>>>>>>>>> Hunk #1 FAILED at 93.
> >>>>>>>>>> Hunk #2 FAILED at 145.
> >>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
> >>>>>>>>>>
> >>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >>
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
> >>>>>>>>>>
> >>>>>>>>>> Could you advise as to which patches I need to apply, and
in
> >>>> what
> >>>>>>>>> order?
> >>>>>>>>>> -Todd
> >>>>>>>>>>
> >>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
> >>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>
> >>>>>>>>>>> Perfect! Thanks for the update, Todd.
> >>>>>>>>>>>
> >>>>>>>>>>> -Flavio
> >>>>>>>>>>>
> >>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
> >>>> Compilation
> >>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
> >>>> patches
> >>>>>>>>>> 473,
> >>>>>>>>>>>> 479, 481, and 491.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of
> >> the
> >>>>>>>>> patch.
> >>>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Flavio,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm getting a compilation error for patch 491:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> compile-main:
> >>>>>>>>>>>>>   [javac] Compiling 1 source file to
> >>>>>>>>>>>>>
> >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>>> src/p
> >>>>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>>>   [javac]
> >>>>>>>>>>>>>
> >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>>>> src/p
> >>>>>>>>>>>>>
> >>>>
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>>> FastL
> >>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>>>>>>>>   [javac] location: interface
> >>>>>>>>>>>>>
org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>>>>>>>>   [javac]
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>>>   [javac]
> >>>> ^
> >>>>>>>>>>>>>   [javac] 1 error
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I see a reference to getWeight in both
> >>>> FastLeaderElection.java
> >>>>>>>>> in
> >>>>>>>>>>>>> patch
> >>>>>>>>>>>>> 491:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>>>> FastLeaderElection.java
> >>>>>>>>>>>>> :
> >>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>>>>>>>>> 0)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> However, I don't see a reference to this method in
patches
> >>>> 473,
> >>>>>>>>>> 479,
> >>>>>>>>>>>>> or
> >>>>>>>>>>>>> 481. I also don't see a reference to this method in the
> >>>>>>> trunk...
> >>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This repro's in both branch-3.2, and
> >>>> branch-3.2+patches(473,
> >>>>>>>>>> 479,
> >>>>>>>>>>>>> 481).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Basically, it seems like the nodes are electing
> >>>> pd4-zook02
> >>>>>>> to
> >>>>>>>>>> be
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> >>>>>>>>> supposed
> >>>>>>>>>> to
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
> >>>> and
> >>>>>>>>> it
> >>>>>>>>>>>>> loops
> >>>>>>>>>>>>> over and over.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------------
> >>>>>>>>>>>>> Server config
> >>>>>>>>>>>>> -------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>>>>>> weight.1=1
> >>>>>>>>>>>>> weight.2=1
> >>>>>>>>>>>>> weight.3=1
> >>>>>>>>>>>>> weight.4=1
> >>>>>>>>>>>>> weight.5=1
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> group.2:6:7:8:9
> >>>>>>>>>>>>> weight.6=0
> >>>>>>>>>>>>> weight.7=0
> >>>>>>>>>>>>> weight.8=0
> >>>>>>>>>>>>> weight.9=0
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> >>>>>>>>> different
> >>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> >>>>>>> machines
> >>>>>>>>>> in
> >>>>>>>>>>>>> dc1
> >>>>>>>>>>>>> have voting rights, and the ability to become a leader.
> >>>> The
> >>>>>>>>>>>>> machines
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>> the pods all have a weight of zero, and are not expected
> >>>> to
> >>>>>>>>>>>> become
> >>>>>>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Todd
> >>>>>>>>>>>>>
> >

Re: Unending Leader Elections in WAN deploy

Posted by Patrick Hunt <ph...@apache.org>.
It would be better to create a JIRA with configs as well as logs.

Patrick

Mahadev Konar wrote:
> Hi Todd,
> 
>   What is the synclimit you are using? Can you post your config? For WAN's
> you will have to use much bigger values for synclimit and others.
> 
> Thanks
> mahadev
> 
> 
> On 8/4/09 1:24 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
> 
>> Mahadev,
>>
>> I just heard from IT that this build behaves in exactly the same way as
>> previous versions, e.g. we get continuous leader elections that
>> disconnect the followers and then get re-elected, and disconnect...etc.
>>
>> This is from a fresh sync to the 3.2 branch:
>>
>> svn co
>> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
>> ./branch-3.2
>>
>> CHANGES.TXT show the various fixes included:
>>
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>> /src/original$ head -n 50 branch-3.2/CHANGES.txt
>> Release 3.2.1
>>
>> Backward compatibile changes:
>>
>> BUGFIXES:
>>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris via
>> flavio)
>>
>>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris via
>> mahadev)
>>
>>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev)
>>
>>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
>> mahadev)
>>
>>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>>   (giri via mahadev)
>>   
>>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via mahadev)
>>
>>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent immediate
>>   failure. (chris via mahadev)
>>
>>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev via
>> phunt)
>>
>>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
>> other)
>>   embedded clients (ryan rawson via phunt)
>>
>>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio via
>> mahadev)
>>
>>   ZOOKEEPER-479.  QuorumHierarchical does not count groups correctly
>>   (flavio via mahadev)
>>
>>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with empty
>> cert
>>   (Chris Darroch via phunt)
>>
>>   ZOOKEEPER-480. FLE should perform leader check when node is not
>> leading and
>>   add vote of follower (flavio via mahadev)
>>
>>   ZOOKEEPER-491. Prevent zero-weight servers from being elected (flavio
>> via
>>   mahadev)
>>
>> What can I do to assist you with this issue?
>>
>> -Todd
>>
>>> -----Original Message-----
>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>> Sent: Tuesday, August 04, 2009 12:43 PM
>>> To: zookeeper-dev@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>
>>> Hi todd,
>>>  comments in line
>>>
>>>
>>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
>> wrote:
>>>> Mahadev,
>>>>
>>>> Some quick questions:
>>>>
>>>> 1. Version
>>>>
>>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
>> still
>>>> calling this 3.2.0. Should this be rev'd, and am I correct in
>> calling
>>>> this release 3.2.1?
>>> Yes the release is 3.2.1. The build.xml will be fixed as soon as we
>> tag
>>> the
>>> release.
>>>
>>>> 2. Build targets
>>>>
>>>> The package target fails b/c the create-cppunit-configure target
>> fails
>>>> due to various problems w/ respect to autoconf. Are these
>> dependencies
>>>> documented somewhere ? I'd like to have a fully building system.
>>>>
>>>> create-cppunit-configure:
>>>>      [exec] Can't exec "libtoolize": No such file or directory at
>>>> /usr/bin/autoreconf line 188.
>>>>      [exec] Use of uninitialized value $libtoolize in pattern match
>>>> (m//) at /usr/bin/autoreconf line 188.
>>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
>> found
>>>> in library
>>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>>> AM_PATH_CPPUNIT
>>>>      [exec]       If this token and others are legitimate, please
>> use
>>>> m4_pattern_allow.
>>>>      [exec]       See the Autoconf documentation.
>>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>>> AC_PROG_LIBTOOL
>>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1
>>>>
>>> You need auto tools to run this. Please read the README for building c
>>> client library at src/c/ for the installation requirements.
>>>> 3. Sync failure:
>>>>
>>>> This is still failing.
>>>>
>>>> svn: URL
>>>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>>> doesn't exist
>>>>
>>> Yes this hasn't been fixed yet!
>>>
>>> Thanks
>>> mahadev
>>>> -Todd
>>>>
>>>>> -----Original Message-----
>>>>> From: Todd Greenwood
>>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>
>>>>> Great news. Thank you Mahadev. I'll report our findings later
>> today.
>>>>> -Todd
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>
>>>>>> Hi Todd,
>>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
>> now.
>>>>>> Thanks
>>>>>> mahadev
>>>>>>
>>>>>>
>>>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>> wrote:
>>>>>>> That'd be perfect. Thanks!
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>
>>>>>>>> Hi Todd,
>>>>>>>>   Most of the patches that you mention should be in the branch
>>>> 3.2 by
>>>>>>> tomm
>>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
>> tomm.
>>>>>>> Would
>>>>>>>> that
>>>>>>>> suffice for you?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> mahadev
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>>> wrote:
>>>>>>>>> Another problem...I've reverted to the latest versions of the
>>>>>>> patches
>>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>>> compilation
>>>>>>>>> errors:
>>>>>>>>>
>>>>>>>>> build-generated:
>>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>
>>>>>>>>> compile-main:
>>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>     [javac]
>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>> atched/branch-
>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>>> getQuorumPeers()
>>>>>>> have
>>>>>>>>> the same erasure
>>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>>     [javac]                         ^
>>>>>>>>>     [javac]
>>>>>>>>>
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>> atched/branch-
>>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>>> mStats.java:31: name clash: getServerState() and
>>>> getServerState()
>>>>>>> have
>>>>>>>>> the same erasure
>>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>>     [javac]                       ^
>>>>>>>>>     [javac] 2 errors
>>>>>>>>>
>>>>>>>>> My build process is pretty simple:
>>>>>>>>>
>>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>>> (src/patched/branch-3.2)
>>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>>>
>>>>>>>>> -Todd
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>
>>>>>>>>>> Flavio,
>>>>>>>>>> I notice that you've updated the patches referenced for the
>> WAN
>>>>>>>>>> deployment. There appears to be an order dependency w/ respect
>>>> to
>>>>>>>>> these
>>>>>>>>>> four patches...
>>>>>>>>>>
>>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>>>
>>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>>>
>>>>>>>>>>
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>>> patching file
>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>>> ical.java
>>>>>>>>>> patching file
>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>>> patching file
>>>>>>>>>>
>> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>>> .java
>>>>>>>>>> patching file
>>>>>>>>>>
>>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>>>
>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>>>
>>>>>>>>>> Could you advise as to which patches I need to apply, and in
>>>> what
>>>>>>>>> order?
>>>>>>>>>> -Todd
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>
>>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>>>
>>>>>>>>>>> -Flavio
>>>>>>>>>>>
>>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>>> Compilation
>>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
>>>> patches
>>>>>>>>>> 473,
>>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>>>
>>>>>>>>>>>> -Todd
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of
>> the
>>>>>>>>> patch.
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Flavio,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>>>
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>>
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>>> src/p
>>>>>>>>>>>>>
>>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastL
>>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>>   [javac]
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>>   [javac]
>>>> ^
>>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see a reference to getWeight in both
>>>> FastLeaderElection.java
>>>>>>>>> in
>>>>>>>>>>>>> patch
>>>>>>>>>>>>> 491:
>>>>>>>>>>>>>
>>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>>> :
>>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>>> 0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, I don't see a reference to this method in patches
>>>> 473,
>>>>>>>>>> 479,
>>>>>>>>>>>>> or
>>>>>>>>>>>>> 481. I also don't see a reference to this method in the
>>>>>>> trunk...
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>>>
>>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> This repro's in both branch-3.2, and
>>>> branch-3.2+patches(473,
>>>>>>>>>> 479,
>>>>>>>>>>>>> 481).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>>> pd4-zook02
>>>>>>> to
>>>>>>>>>> be
>>>>>>>>>>>>> the
>>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>>> supposed
>>>>>>>>>> to
>>>>>>>>>>>>> be
>>>>>>>>>>>>> and
>>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
>>>> and
>>>>>>>>> it
>>>>>>>>>>>>> loops
>>>>>>>>>>>>> over and over.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>> Server config
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>>>
>>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>>>
>>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>>> different
>>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>>> machines
>>>>>>>>>> in
>>>>>>>>>>>>> dc1
>>>>>>>>>>>>> have voting rights, and the ability to become a leader.
>>>> The
>>>>>>>>>>>>> machines
>>>>>>>>>>>>> in
>>>>>>>>>>>>> the pods all have a weight of zero, and are not expected
>>>> to
>>>>>>>>>>>> become
>>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
> 

Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Todd,

  What is the synclimit you are using? Can you post your config? For WAN's
you will have to use much bigger values for synclimit and others.

Thanks
mahadev


On 8/4/09 1:24 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> Mahadev,
> 
> I just heard from IT that this build behaves in exactly the same way as
> previous versions, e.g. we get continuous leader elections that
> disconnect the followers and then get re-elected, and disconnect...etc.
> 
> This is from a fresh sync to the 3.2 branch:
> 
> svn co
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> ./branch-3.2
> 
> CHANGES.TXT show the various fixes included:
> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> /src/original$ head -n 50 branch-3.2/CHANGES.txt
> Release 3.2.1
> 
> Backward compatibile changes:
> 
> BUGFIXES:
>   ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris via
> flavio)
> 
>   ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris via
> mahadev)
> 
>   ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev)
> 
>   ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
> mahadev)
> 
>   ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
>   (giri via mahadev)
>   
>   ZOOKEEPER-467.  Change log level in BookieHandle (flavio via mahadev)
> 
>   ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent immediate
>   failure. (chris via mahadev)
> 
>   ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev via
> phunt)
> 
>   ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
> other)
>   embedded clients (ryan rawson via phunt)
> 
>   ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio via
> mahadev)
> 
>   ZOOKEEPER-479.  QuorumHierarchical does not count groups correctly
>   (flavio via mahadev)
> 
>   ZOOKEEPER-466. crash on zookeeper_close() when using auth with empty
> cert
>   (Chris Darroch via phunt)
> 
>   ZOOKEEPER-480. FLE should perform leader check when node is not
> leading and
>   add vote of follower (flavio via mahadev)
> 
>   ZOOKEEPER-491. Prevent zero-weight servers from being elected (flavio
> via
>   mahadev)
> 
> What can I do to assist you with this issue?
> 
> -Todd
> 
>> -----Original Message-----
>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>> Sent: Tuesday, August 04, 2009 12:43 PM
>> To: zookeeper-dev@hadoop.apache.org
>> Subject: Re: Unending Leader Elections in WAN deploy
>> 
>> Hi todd,
>>  comments in line
>> 
>> 
>> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
> wrote:
>> 
>>> Mahadev,
>>> 
>>> Some quick questions:
>>> 
>>> 1. Version
>>> 
>>> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
> still
>>> calling this 3.2.0. Should this be rev'd, and am I correct in
> calling
>>> this release 3.2.1?
>> Yes the release is 3.2.1. The build.xml will be fixed as soon as we
> tag
>> the
>> release.
>> 
>>> 
>>> 2. Build targets
>>> 
>>> The package target fails b/c the create-cppunit-configure target
> fails
>>> due to various problems w/ respect to autoconf. Are these
> dependencies
>>> documented somewhere ? I'd like to have a fully building system.
>>> 
>>> create-cppunit-configure:
>>>      [exec] Can't exec "libtoolize": No such file or directory at
>>> /usr/bin/autoreconf line 188.
>>>      [exec] Use of uninitialized value $libtoolize in pattern match
>>> (m//) at /usr/bin/autoreconf line 188.
>>>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
> found
>>> in library
>>>      [exec] configure.ac:33: error: possibly undefined macro:
>>> AM_PATH_CPPUNIT
>>>      [exec]       If this token and others are legitimate, please
> use
>>> m4_pattern_allow.
>>>      [exec]       See the Autoconf documentation.
>>>      [exec] configure.ac:53: error: possibly undefined macro:
>>> AC_PROG_LIBTOOL
>>>      [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1
>>> 
>> You need auto tools to run this. Please read the README for building c
>> client library at src/c/ for the installation requirements.
>>> 
>>> 3. Sync failure:
>>> 
>>> This is still failing.
>>> 
>>> svn: URL
>>> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
>>> doesn't exist
>>> 
>> 
>> Yes this hasn't been fixed yet!
>> 
>> Thanks
>> mahadev
>>> -Todd
>>> 
>>>> -----Original Message-----
>>>> From: Todd Greenwood
>>>> Sent: Tuesday, August 04, 2009 11:26 AM
>>>> To: 'zookeeper-user@hadoop.apache.org'
>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>> 
>>>> Great news. Thank you Mahadev. I'll report our findings later
> today.
>>>> -Todd
>>>> 
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> Hi Todd,
>>>>>  I just committed 480 and 491. You can checkout the 3.2 branch
> now.
>>>>> 
>>>>> Thanks
>>>>> mahadev
>>>>> 
>>>>> 
>>>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
>>> wrote:
>>>>> 
>>>>>> That'd be perfect. Thanks!
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> Hi Todd,
>>>>>>>   Most of the patches that you mention should be in the branch
>>> 3.2 by
>>>>>> tomm
>>>>>>> or so. 481, 479 are already in. 480 and 491 should be in by
> tomm.
>>>>>> Would
>>>>>>> that
>>>>>>> suffice for you?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> mahadev
>>>>>>> 
>>>>>>> 
>>>>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> Another problem...I've reverted to the latest versions of the
>>>>>> patches
>>>>>>>> that are not specific to branch-3.2, and I'm getting two
>>> compilation
>>>>>>>> errors:
>>>>>>>> 
>>>>>>>> build-generated:
>>>>>>>>     [javac] Compiling 44 source files to
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>> 
>>>>>>>> compile-main:
>>>>>>>>     [javac] Compiling 2 source files to
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>     [javac]
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> 
>>>>>> atched/branch-
>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>> mStats.java:30: name clash: getQuorumPeers() and
>>> getQuorumPeers()
>>>>>> have
>>>>>>>> the same erasure
>>>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>>>     [javac]                         ^
>>>>>>>>     [javac]
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>>>> 
>>>>>> atched/branch-
>>>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>>>> mStats.java:31: name clash: getServerState() and
>>> getServerState()
>>>>>> have
>>>>>>>> the same erasure
>>>>>>>>     [javac]         public String getServerState();
>>>>>>>>     [javac]                       ^
>>>>>>>>     [javac] 2 errors
>>>>>>>> 
>>>>>>>> My build process is pretty simple:
>>>>>>>> 
>>>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>>>> (src/patched/branch-3.2)
>>>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>>>> 3. build zookeeper in the temp directory
>>>>>>>> 
>>>>>>>> -Todd
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>> 
>>>>>>>>> Flavio,
>>>>>>>>> I notice that you've updated the patches referenced for the
> WAN
>>>>>>>>> deployment. There appears to be an order dependency w/ respect
>>> to
>>>>>>>> these
>>>>>>>>> four patches...
>>>>>>>>> 
>>>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>>>> 
>>>>>>>>> 473 -> 479 (479 fails)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>>>> patching file
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>>>> ical.java
>>>>>>>>> patching file
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>>>> patching file
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>>>> .java
>>>>>>>>> patching file
>>>>>>>>> 
>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>>>> Hunk #1 FAILED at 93.
>>>>>>>>> Hunk #2 FAILED at 145.
>>>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>>>> 
>>>>>>>>> Could you advise as to which patches I need to apply, and in
>>> what
>>>>>>>> order?
>>>>>>>>> 
>>>>>>>>> -Todd
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>> 
>>>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>>>> 
>>>>>>>>>> -Flavio
>>>>>>>>>> 
>>>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks. You were right, I had a stale version of 479.
>>> Compilation
>>>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
>>> patches
>>>>>>>>> 473,
>>>>>>>>>>> 479, 481, and 491.
>>>>>>>>>>> 
>>>>>>>>>>> -Todd
>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> It should be in 479. Perhaps you have a stale version of
> the
>>>>>>>> patch.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Flavio,
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> compile-main:
>>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>>> 
>>>>>>>>> 
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> 
>>>>>>>>> 
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>>> src/p
>>>>>>>>>>>> 
>>>>>>>>> 
>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastL
>>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>>   [javac]
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>>   [javac]
>>> ^
>>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>>> 
>>>>>>>>>>>> I see a reference to getWeight in both
>>> FastLeaderElection.java
>>>>>>>> in
>>>>>>>>>>>> patch
>>>>>>>>>>>> 491:
>>>>>>>>>>>> 
>>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>>> :
>>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>>> 0)
>>>>>>>>>>>> 
>>>>>>>>>>>> However, I don't see a reference to this method in patches
>>> 473,
>>>>>>>>> 479,
>>>>>>>>>>>> or
>>>>>>>>>>>> 481. I also don't see a reference to this method in the
>>>>>> trunk...
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This repro's in both branch-3.2, and
>>> branch-3.2+patches(473,
>>>>>>>>> 479,
>>>>>>>>>>>> 481).
>>>>>>>>>>>> 
>>>>>>>>>>>> Basically, it seems like the nodes are electing
>>> pd4-zook02
>>>>>> to
>>>>>>>>> be
>>>>>>>>>>>> the
>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>>>> supposed
>>>>>>>>> to
>>>>>>>>>>>> be
>>>>>>>>>>>> and
>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
>>> and
>>>>>>>> it
>>>>>>>>>>>> loops
>>>>>>>>>>>> over and over.
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------
>>>>>>>>>>>> Server config
>>>>>>>>>>>> -------------
>>>>>>>>>>>> 
>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>> 
>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>> 
>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>> 
>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>>>> different
>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>>>> machines
>>>>>>>>> in
>>>>>>>>>>>> dc1
>>>>>>>>>>>> have voting rights, and the ability to become a leader.
>>> The
>>>>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> the pods all have a weight of zero, and are not expected
>>> to
>>>>>>>>>>> become
>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>> 
> 


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Mahadev,

I just heard from IT that this build behaves in exactly the same way as
previous versions, e.g. we get continuous leader elections that
disconnect the followers and then get re-elected, and disconnect...etc.

This is from a fresh sync to the 3.2 branch:

svn co
http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
./branch-3.2

CHANGES.TXT show the various fixes included:

toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/original$ head -n 50 branch-3.2/CHANGES.txt
Release 3.2.1

Backward compatibile changes:

BUGFIXES:
  ZOOKEEPER-468. avoid compile warning in send_auth_info(). (chris via
flavio)

  ZOOKEEPER-469. make sure CPPUNIT_CFLAGS isn't overwritten (chris via
mahadev)

  ZOOKEEPER-471. update zkperl for 3.2.x branch. (chris via mahadev)

  ZOOKEEPER-470. include unistd.h for sleep() in c tests (chris via
mahadev)

  ZOOKEEPER-460. bad testRetry in cppunit tests (hudson failure)
  (giri via mahadev)
  
  ZOOKEEPER-467.  Change log level in BookieHandle (flavio via mahadev)

  ZOOKEEPER-482. ignore sigpipe in testRetry to avoid silent immediate
  failure. (chris via mahadev) 

  ZOOKEEPER-487. setdata on root (/) crashes the servers (mahadev via
phunt)

  ZOOKEEPER-457. Make ZookeeperMain public, support for HBase (and
other)
  embedded clients (ryan rawson via phunt)

  ZOOKEEPER-481. Add lastMessageSent to QuorumCnxManager. (flavio via
mahadev)

  ZOOKEEPER-479.  QuorumHierarchical does not count groups correctly
  (flavio via mahadev)

  ZOOKEEPER-466. crash on zookeeper_close() when using auth with empty
cert
  (Chris Darroch via phunt)

  ZOOKEEPER-480. FLE should perform leader check when node is not
leading and
  add vote of follower (flavio via mahadev)

  ZOOKEEPER-491. Prevent zero-weight servers from being elected (flavio
via
  mahadev)

What can I do to assist you with this issue?

-Todd

> -----Original Message-----
> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> Sent: Tuesday, August 04, 2009 12:43 PM
> To: zookeeper-dev@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Hi todd,
>  comments in line
> 
> 
> On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com>
wrote:
> 
> > Mahadev,
> >
> > Some quick questions:
> >
> > 1. Version
> >
> > I see that the CHANGES.txt calls this 3.2.1, but the build.xml is
still
> > calling this 3.2.0. Should this be rev'd, and am I correct in
calling
> > this release 3.2.1?
> Yes the release is 3.2.1. The build.xml will be fixed as soon as we
tag
> the
> release.
> 
> >
> > 2. Build targets
> >
> > The package target fails b/c the create-cppunit-configure target
fails
> > due to various problems w/ respect to autoconf. Are these
dependencies
> > documented somewhere ? I'd like to have a fully building system.
> >
> > create-cppunit-configure:
> >      [exec] Can't exec "libtoolize": No such file or directory at
> > /usr/bin/autoreconf line 188.
> >      [exec] Use of uninitialized value $libtoolize in pattern match
> > (m//) at /usr/bin/autoreconf line 188.
> >      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not
found
> > in library
> >      [exec] configure.ac:33: error: possibly undefined macro:
> > AM_PATH_CPPUNIT
> >      [exec]       If this token and others are legitimate, please
use
> > m4_pattern_allow.
> >      [exec]       See the Autoconf documentation.
> >      [exec] configure.ac:53: error: possibly undefined macro:
> > AC_PROG_LIBTOOL
> >      [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1
> >
> You need auto tools to run this. Please read the README for building c
> client library at src/c/ for the installation requirements.
> >
> > 3. Sync failure:
> >
> > This is still failing.
> >
> > svn: URL
> > 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> > doesn't exist
> >
> 
> Yes this hasn't been fixed yet!
> 
> Thanks
> mahadev
> > -Todd
> >
> >> -----Original Message-----
> >> From: Todd Greenwood
> >> Sent: Tuesday, August 04, 2009 11:26 AM
> >> To: 'zookeeper-user@hadoop.apache.org'
> >> Subject: RE: Unending Leader Elections in WAN deploy
> >>
> >> Great news. Thank you Mahadev. I'll report our findings later
today.
> >> -Todd
> >>
> >>> -----Original Message-----
> >>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>> Sent: Tuesday, August 04, 2009 11:20 AM
> >>> To: zookeeper-user@hadoop.apache.org
> >>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>
> >>> Hi Todd,
> >>>  I just committed 480 and 491. You can checkout the 3.2 branch
now.
> >>>
> >>> Thanks
> >>> mahadev
> >>>
> >>>
> >>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
> > wrote:
> >>>
> >>>> That'd be perfect. Thanks!
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >>>>> Sent: Monday, August 03, 2009 4:24 PM
> >>>>> To: zookeeper-user@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> Hi Todd,
> >>>>>   Most of the patches that you mention should be in the branch
> > 3.2 by
> >>>> tomm
> >>>>> or so. 481, 479 are already in. 480 and 491 should be in by
tomm.
> >>>> Would
> >>>>> that
> >>>>> suffice for you?
> >>>>>
> >>>>> Thanks
> >>>>> mahadev
> >>>>>
> >>>>>
> >>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
> >> wrote:
> >>>>>
> >>>>>> Another problem...I've reverted to the latest versions of the
> >>>> patches
> >>>>>> that are not specific to branch-3.2, and I'm getting two
> > compilation
> >>>>>> errors:
> >>>>>>
> >>>>>> build-generated:
> >>>>>>     [javac] Compiling 44 source files to
> >>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>> atched/branch-3.2/build/classes
> >>>>>>
> >>>>>> compile-main:
> >>>>>>     [javac] Compiling 2 source files to
> >>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>> atched/branch-3.2/build/classes
> >>>>>>     [javac]
> >>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>
> >>>> atched/branch-
> >> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>> mStats.java:30: name clash: getQuorumPeers() and
> > getQuorumPeers()
> >>>> have
> >>>>>> the same erasure
> >>>>>>     [javac]         public String[] getQuorumPeers();
> >>>>>>     [javac]                         ^
> >>>>>>     [javac]
> >>>>>>
> >>>>
> >>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>>>>
> >>>> atched/branch-
> >> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>>>>> mStats.java:31: name clash: getServerState() and
> > getServerState()
> >>>> have
> >>>>>> the same erasure
> >>>>>>     [javac]         public String getServerState();
> >>>>>>     [javac]                       ^
> >>>>>>     [javac] 2 errors
> >>>>>>
> >>>>>> My build process is pretty simple:
> >>>>>>
> >>>>>> 1. copy the branch-3.2 source to a temp directory
> >>>>>> (src/patched/branch-3.2)
> >>>>>> 2. apply the ZOOKEEPER patches in my patches directory
> >>>>>> 3. build zookeeper in the temp directory
> >>>>>>
> >>>>>> -Todd
> >>>>>>> -----Original Message-----
> >>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>> Sent: Monday, August 03, 2009 4:09 PM
> >>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>
> >>>>>>> Flavio,
> >>>>>>> I notice that you've updated the patches referenced for the
WAN
> >>>>>>> deployment. There appears to be an order dependency w/ respect
> > to
> >>>>>> these
> >>>>>>> four patches...
> >>>>>>>
> >>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>>>>>>
> >>>>>>> 473 -> 479 (479 fails)
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>> /src/patched/branch-3.2$ patch -p0 <
> >>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> >>>>>>> patching file
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >>>>>>> ical.java
> >>>>>>> patching file
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >>>>>>> patching file
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >>>>>>> .java
> >>>>>>> patching file
> >>>>>>>
> > src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >>>>>>> Hunk #1 FAILED at 93.
> >>>>>>> Hunk #2 FAILED at 145.
> >>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>>>>> /src/patched/branch-3.2$ h ../patches/
> >>>>>>>
> >>>>>>> Could you advise as to which patches I need to apply, and in
> > what
> >>>>>> order?
> >>>>>>>
> >>>>>>> -Todd
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
> >>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>
> >>>>>>>> Perfect! Thanks for the update, Todd.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>>>>>>
> >>>>>>>>> Thanks. You were right, I had a stale version of 479.
> > Compilation
> >>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
> > patches
> >>>>>>> 473,
> >>>>>>>>> 479, 481, and 491.
> >>>>>>>>>
> >>>>>>>>> -Todd
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>
> >>>>>>>>>> It should be in 479. Perhaps you have a stale version of
the
> >>>>>> patch.
> >>>>>>>>>>
> >>>>>>>>>> -Flavio
> >>>>>>>>>>
> >>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Flavio,
> >>>>>>>>>>>
> >>>>>>>>>>> I'm getting a compilation error for patch 491:
> >>>>>>>>>>>
> >>>>>>>>>>> compile-main:
> >>>>>>>>>>>   [javac] Compiling 1 source file to
> >>>>>>>>>>>
> >>>>>>>
> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>> src/p
> >>>>>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>>>>   [javac]
> >>>>>>>>>>>
> >>>>>>>
> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>>>>> src/p
> >>>>>>>>>>>
> >>>>>>>
> > atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>> FastL
> >>>>>>>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>>>>>>   [javac] location: interface
> >>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>>>>>>   [javac]
> >>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>>   [javac]
> > ^
> >>>>>>>>>>>   [javac] 1 error
> >>>>>>>>>>>
> >>>>>>>>>>> I see a reference to getWeight in both
> > FastLeaderElection.java
> >>>>>> in
> >>>>>>>>>>> patch
> >>>>>>>>>>> 491:
> >>>>>>>>>>>
> >>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>>>>> FastLeaderElection.java
> >>>>>>>>>>> :
> >>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>>>>>>> 0)
> >>>>>>>>>>>
> >>>>>>>>>>> However, I don't see a reference to this method in patches
> > 473,
> >>>>>>> 479,
> >>>>>>>>>>> or
> >>>>>>>>>>> 481. I also don't see a reference to this method in the
> >>>> trunk...
> >>>>>>>>>>>
> >>>>>>>>>>> -Todd
> >>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>>>
> >>>>>>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> This repro's in both branch-3.2, and
> > branch-3.2+patches(473,
> >>>>>>> 479,
> >>>>>>>>>>>> 481).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Basically, it seems like the nodes are electing
> > pd4-zook02
> >>>> to
> >>>>>>> be
> >>>>>>>>>>> the
> >>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> >>>>>> supposed
> >>>>>>> to
> >>>>>>>>>>> be
> >>>>>>>>>>>> and
> >>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
> > and
> >>>>>> it
> >>>>>>>>>>> loops
> >>>>>>>>>>>> over and over.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -------------
> >>>>>>>>>>>> Server config
> >>>>>>>>>>>> -------------
> >>>>>>>>>>>>
> >>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>>>>
> >>>>>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>>>>> weight.1=1
> >>>>>>>>>>>> weight.2=1
> >>>>>>>>>>>> weight.3=1
> >>>>>>>>>>>> weight.4=1
> >>>>>>>>>>>> weight.5=1
> >>>>>>>>>>>>
> >>>>>>>>>>>> group.2:6:7:8:9
> >>>>>>>>>>>> weight.6=0
> >>>>>>>>>>>> weight.7=0
> >>>>>>>>>>>> weight.8=0
> >>>>>>>>>>>> weight.9=0
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> >>>>>> different
> >>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> >>>> machines
> >>>>>>> in
> >>>>>>>>>>> dc1
> >>>>>>>>>>>> have voting rights, and the ability to become a leader.
> > The
> >>>>>>>>>>> machines
> >>>>>>>>>>>> in
> >>>>>>>>>>>> the pods all have a weight of zero, and are not expected
> > to
> >>>>>>>>> become
> >>>>>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Todd
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>
> >


Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi todd, 
 comments in line


On 8/4/09 12:38 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> Mahadev,
> 
> Some quick questions:
> 
> 1. Version
> 
> I see that the CHANGES.txt calls this 3.2.1, but the build.xml is still
> calling this 3.2.0. Should this be rev'd, and am I correct in calling
> this release 3.2.1?
Yes the release is 3.2.1. The build.xml will be fixed as soon as we tag the
release.

> 
> 2. Build targets
> 
> The package target fails b/c the create-cppunit-configure target fails
> due to various problems w/ respect to autoconf. Are these dependencies
> documented somewhere ? I'd like to have a fully building system.
> 
> create-cppunit-configure:
>      [exec] Can't exec "libtoolize": No such file or directory at
> /usr/bin/autoreconf line 188.
>      [exec] Use of uninitialized value $libtoolize in pattern match
> (m//) at /usr/bin/autoreconf line 188.
>      [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not found
> in library
>      [exec] configure.ac:33: error: possibly undefined macro:
> AM_PATH_CPPUNIT
>      [exec]       If this token and others are legitimate, please use
> m4_pattern_allow.
>      [exec]       See the Autoconf documentation.
>      [exec] configure.ac:53: error: possibly undefined macro:
> AC_PROG_LIBTOOL
>      [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1
> 
You need auto tools to run this. Please read the README for building c
client library at src/c/ for the installation requirements.
> 
> 3. Sync failure:
> 
> This is still failing.
> 
> svn: URL
> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> doesn't exist
> 

Yes this hasn't been fixed yet!

Thanks
mahadev
> -Todd
> 
>> -----Original Message-----
>> From: Todd Greenwood
>> Sent: Tuesday, August 04, 2009 11:26 AM
>> To: 'zookeeper-user@hadoop.apache.org'
>> Subject: RE: Unending Leader Elections in WAN deploy
>> 
>> Great news. Thank you Mahadev. I'll report our findings later today.
>> -Todd
>> 
>>> -----Original Message-----
>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>> Sent: Tuesday, August 04, 2009 11:20 AM
>>> To: zookeeper-user@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>> 
>>> Hi Todd,
>>>  I just committed 480 and 491. You can checkout the 3.2 branch now.
>>> 
>>> Thanks
>>> mahadev
>>> 
>>> 
>>> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
> wrote:
>>> 
>>>> That'd be perfect. Thanks!
>>>> 
>>>>> -----Original Message-----
>>>>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>>>>> Sent: Monday, August 03, 2009 4:24 PM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> Hi Todd,
>>>>>   Most of the patches that you mention should be in the branch
> 3.2 by
>>>> tomm
>>>>> or so. 481, 479 are already in. 480 and 491 should be in by tomm.
>>>> Would
>>>>> that
>>>>> suffice for you?
>>>>> 
>>>>> Thanks
>>>>> mahadev
>>>>> 
>>>>> 
>>>>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
>> wrote:
>>>>> 
>>>>>> Another problem...I've reverted to the latest versions of the
>>>> patches
>>>>>> that are not specific to branch-3.2, and I'm getting two
> compilation
>>>>>> errors:
>>>>>> 
>>>>>> build-generated:
>>>>>>     [javac] Compiling 44 source files to
>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> atched/branch-3.2/build/classes
>>>>>> 
>>>>>> compile-main:
>>>>>>     [javac] Compiling 2 source files to
>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> atched/branch-3.2/build/classes
>>>>>>     [javac]
>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> 
>>>> atched/branch-
>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>> mStats.java:30: name clash: getQuorumPeers() and
> getQuorumPeers()
>>>> have
>>>>>> the same erasure
>>>>>>     [javac]         public String[] getQuorumPeers();
>>>>>>     [javac]                         ^
>>>>>>     [javac]
>>>>>> 
>>>> 
>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>>>>> 
>>>> atched/branch-
>> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>>>>> mStats.java:31: name clash: getServerState() and
> getServerState()
>>>> have
>>>>>> the same erasure
>>>>>>     [javac]         public String getServerState();
>>>>>>     [javac]                       ^
>>>>>>     [javac] 2 errors
>>>>>> 
>>>>>> My build process is pretty simple:
>>>>>> 
>>>>>> 1. copy the branch-3.2 source to a temp directory
>>>>>> (src/patched/branch-3.2)
>>>>>> 2. apply the ZOOKEEPER patches in my patches directory
>>>>>> 3. build zookeeper in the temp directory
>>>>>> 
>>>>>> -Todd
>>>>>>> -----Original Message-----
>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> Flavio,
>>>>>>> I notice that you've updated the patches referenced for the WAN
>>>>>>> deployment. There appears to be an order dependency w/ respect
> to
>>>>>> these
>>>>>>> four patches...
>>>>>>> 
>>>>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>>>>> 
>>>>>>> 473 -> 479 (479 fails)
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>> /src/patched/branch-3.2$ patch -p0 <
>>>>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>>>>> patching file
>>>>>>> 
>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>>>>> ical.java
>>>>>>> patching file
>>>>>>> 
>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>>>>> patching file
>>>>>>> 
>>>>>> 
>>>> 
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>>>>> .java
>>>>>>> patching file
>>>>>>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>>>>> Hunk #1 FAILED at 93.
>>>>>>> Hunk #2 FAILED at 145.
>>>>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>>>>> 
>>>>>> 
>>>> 
>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>>>>>> 
>>>>>> 
>>>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>>>>> /src/patched/branch-3.2$ h ../patches/
>>>>>>> 
>>>>>>> Could you advise as to which patches I need to apply, and in
> what
>>>>>> order?
>>>>>>> 
>>>>>>> -Todd
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>> 
>>>>>>>> Perfect! Thanks for the update, Todd.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>>>>> 
>>>>>>>>> Thanks. You were right, I had a stale version of 479.
> Compilation
>>>>>>>>> succeeds and all tests pass on branch-3.2 with the latest
> patches
>>>>>>> 473,
>>>>>>>>> 479, 481, and 491.
>>>>>>>>> 
>>>>>>>>> -Todd
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>> 
>>>>>>>>>> It should be in 479. Perhaps you have a stale version of the
>>>>>> patch.
>>>>>>>>>> 
>>>>>>>>>> -Flavio
>>>>>>>>>> 
>>>>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>>>>> 
>>>>>>>>>>> Flavio,
>>>>>>>>>>> 
>>>>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>>>>> 
>>>>>>>>>>> compile-main:
>>>>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>>>>> 
>>>>>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>> src/p
>>>>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>>>>   [javac]
>>>>>>>>>>> 
>>>>>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>>>>> src/p
>>>>>>>>>>> 
>>>>>>> 
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>> FastL
>>>>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>>>>   [javac] location: interface
>>>>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>>>>   [javac]
>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>>   [javac]
> ^
>>>>>>>>>>>   [javac] 1 error
>>>>>>>>>>> 
>>>>>>>>>>> I see a reference to getWeight in both
> FastLeaderElection.java
>>>>>> in
>>>>>>>>>>> patch
>>>>>>>>>>> 491:
>>>>>>>>>>> 
>>>>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>>>>> FastLeaderElection.java
>>>>>>>>>>> :
>>>>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>>>>> 0)
>>>>>>>>>>> 
>>>>>>>>>>> However, I don't see a reference to this method in patches
> 473,
>>>>>>> 479,
>>>>>>>>>>> or
>>>>>>>>>>> 481. I also don't see a reference to this method in the
>>>> trunk...
>>>>>>>>>>> 
>>>>>>>>>>> -Todd
>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>>>>> -Todd
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>>>> 
>>>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This repro's in both branch-3.2, and
> branch-3.2+patches(473,
>>>>>>> 479,
>>>>>>>>>>>> 481).
>>>>>>>>>>>> 
>>>>>>>>>>>> Basically, it seems like the nodes are electing
> pd4-zook02
>>>> to
>>>>>>> be
>>>>>>>>>>> the
>>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>>>>> supposed
>>>>>>> to
>>>>>>>>>>> be
>>>>>>>>>>>> and
>>>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
> and
>>>>>> it
>>>>>>>>>>> loops
>>>>>>>>>>>> over and over.
>>>>>>>>>>>> 
>>>>>>>>>>>> -------------
>>>>>>>>>>>> Server config
>>>>>>>>>>>> -------------
>>>>>>>>>>>> 
>>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>>> 
>>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>>> weight.1=1
>>>>>>>>>>>> weight.2=1
>>>>>>>>>>>> weight.3=1
>>>>>>>>>>>> weight.4=1
>>>>>>>>>>>> weight.5=1
>>>>>>>>>>>> 
>>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>>> weight.6=0
>>>>>>>>>>>> weight.7=0
>>>>>>>>>>>> weight.8=0
>>>>>>>>>>>> weight.9=0
>>>>>>>>>>>> 
>>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>>>>> different
>>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
>>>> machines
>>>>>>> in
>>>>>>>>>>> dc1
>>>>>>>>>>>> have voting rights, and the ability to become a leader.
> The
>>>>>>>>>>> machines
>>>>>>>>>>>> in
>>>>>>>>>>>> the pods all have a weight of zero, and are not expected
> to
>>>>>>>>> become
>>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Todd
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>> 
> 


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Mahadev,

Some quick questions:

1. Version

I see that the CHANGES.txt calls this 3.2.1, but the build.xml is still
calling this 3.2.0. Should this be rev'd, and am I correct in calling
this release 3.2.1? 

2. Build targets

The package target fails b/c the create-cppunit-configure target fails
due to various problems w/ respect to autoconf. Are these dependencies
documented somewhere ? I'd like to have a fully building system.

create-cppunit-configure:
     [exec] Can't exec "libtoolize": No such file or directory at
/usr/bin/autoreconf line 188.
     [exec] Use of uninitialized value $libtoolize in pattern match
(m//) at /usr/bin/autoreconf line 188.
     [exec] configure.ac:33: warning: macro `AM_PATH_CPPUNIT' not found
in library
     [exec] configure.ac:33: error: possibly undefined macro:
AM_PATH_CPPUNIT
     [exec]       If this token and others are legitimate, please use
m4_pattern_allow.
     [exec]       See the Autoconf documentation.
     [exec] configure.ac:53: error: possibly undefined macro:
AC_PROG_LIBTOOL
     [exec] autoreconf: /usr/bin/autoconf failed with exit status: 1


3. Sync failure:

This is still failing.

svn: URL
'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
doesn't exist

-Todd

> -----Original Message-----
> From: Todd Greenwood
> Sent: Tuesday, August 04, 2009 11:26 AM
> To: 'zookeeper-user@hadoop.apache.org'
> Subject: RE: Unending Leader Elections in WAN deploy
> 
> Great news. Thank you Mahadev. I'll report our findings later today.
> -Todd
> 
> > -----Original Message-----
> > From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> > Sent: Tuesday, August 04, 2009 11:20 AM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: Unending Leader Elections in WAN deploy
> >
> > Hi Todd,
> >  I just committed 480 and 491. You can checkout the 3.2 branch now.
> >
> > Thanks
> > mahadev
> >
> >
> > On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com>
wrote:
> >
> > > That'd be perfect. Thanks!
> > >
> > >> -----Original Message-----
> > >> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> > >> Sent: Monday, August 03, 2009 4:24 PM
> > >> To: zookeeper-user@hadoop.apache.org
> > >> Subject: Re: Unending Leader Elections in WAN deploy
> > >>
> > >> Hi Todd,
> > >>   Most of the patches that you mention should be in the branch
3.2 by
> > > tomm
> > >> or so. 481, 479 are already in. 480 and 491 should be in by tomm.
> > > Would
> > >> that
> > >> suffice for you?
> > >>
> > >> Thanks
> > >> mahadev
> > >>
> > >>
> > >> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
> wrote:
> > >>
> > >>> Another problem...I've reverted to the latest versions of the
> > > patches
> > >>> that are not specific to branch-3.2, and I'm getting two
compilation
> > >>> errors:
> > >>>
> > >>> build-generated:
> > >>>     [javac] Compiling 44 source files to
> > >>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>> atched/branch-3.2/build/classes
> > >>>
> > >>> compile-main:
> > >>>     [javac] Compiling 2 source files to
> > >>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>> atched/branch-3.2/build/classes
> > >>>     [javac]
> > >>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>
> > > atched/branch-
> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > >>> mStats.java:30: name clash: getQuorumPeers() and
getQuorumPeers()
> > > have
> > >>> the same erasure
> > >>>     [javac]         public String[] getQuorumPeers();
> > >>>     [javac]                         ^
> > >>>     [javac]
> > >>>
> > >
>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > >>>
> > > atched/branch-
> 3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > >>> mStats.java:31: name clash: getServerState() and
getServerState()
> > > have
> > >>> the same erasure
> > >>>     [javac]         public String getServerState();
> > >>>     [javac]                       ^
> > >>>     [javac] 2 errors
> > >>>
> > >>> My build process is pretty simple:
> > >>>
> > >>> 1. copy the branch-3.2 source to a temp directory
> > >>> (src/patched/branch-3.2)
> > >>> 2. apply the ZOOKEEPER patches in my patches directory
> > >>> 3. build zookeeper in the temp directory
> > >>>
> > >>> -Todd
> > >>>> -----Original Message-----
> > >>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> > >>>> Sent: Monday, August 03, 2009 4:09 PM
> > >>>> To: zookeeper-user@hadoop.apache.org
> > >>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>
> > >>>> Flavio,
> > >>>> I notice that you've updated the patches referenced for the WAN
> > >>>> deployment. There appears to be an order dependency w/ respect
to
> > >>> these
> > >>>> four patches...
> > >>>>
> > >>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> > >>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> > >>>>
> > >>>> 473 -> 479 (479 fails)
> > >>>>
> > >>>>
> > >>>
> > >
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > >>>> /src/patched/branch-3.2$ patch -p0 <
> > >>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> > >>>> patching file
> > >>>>
> > >>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> > >>>> ical.java
> > >>>> patching file
> > >>>>
> > >>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> > >>>> patching file
> > >>>>
> > >>>
> > >
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> > >>>> .java
> > >>>> patching file
> > >>>>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> > >>>> Hunk #1 FAILED at 93.
> > >>>> Hunk #2 FAILED at 145.
> > >>>> 2 out of 2 hunks FAILED -- saving rejects to file
> > >>>>
> > >>>
> > >
>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> > >>>>
> > >>>
> > >
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> > >>>> /src/patched/branch-3.2$ h ../patches/
> > >>>>
> > >>>> Could you advise as to which patches I need to apply, and in
what
> > >>> order?
> > >>>>
> > >>>> -Todd
> > >>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>> Sent: Friday, July 31, 2009 9:51 PM
> > >>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>
> > >>>>> Perfect! Thanks for the update, Todd.
> > >>>>>
> > >>>>> -Flavio
> > >>>>>
> > >>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> > >>>>>
> > >>>>>> Thanks. You were right, I had a stale version of 479.
Compilation
> > >>>>>> succeeds and all tests pass on branch-3.2 with the latest
patches
> > >>>> 473,
> > >>>>>> 479, 481, and 491.
> > >>>>>>
> > >>>>>> -Todd
> > >>>>>>
> > >>>>>>> -----Original Message-----
> > >>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> > >>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>
> > >>>>>>> It should be in 479. Perhaps you have a stale version of the
> > >>> patch.
> > >>>>>>>
> > >>>>>>> -Flavio
> > >>>>>>>
> > >>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> > >>>>>>>
> > >>>>>>>> Flavio,
> > >>>>>>>>
> > >>>>>>>> I'm getting a compilation error for patch 491:
> > >>>>>>>>
> > >>>>>>>> compile-main:
> > >>>>>>>>   [javac] Compiling 1 source file to
> > >>>>>>>>
> > >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>>>>>>> src/p
> > >>>>>>>> atched/branch-3.2/build/classes
> > >>>>>>>>   [javac]
> > >>>>>>>>
> > >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>>>>>>> src/p
> > >>>>>>>>
> > >>>>
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> > >>>>>>>> FastL
> > >>>>>>>> eaderElection.java:601: cannot find symbol
> > >>>>>>>>   [javac] symbol  : method getWeight(long)
> > >>>>>>>>   [javac] location: interface
> > >>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> > >>>>>>>>   [javac]
> > >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>>>>>>>   [javac]
^
> > >>>>>>>>   [javac] 1 error
> > >>>>>>>>
> > >>>>>>>> I see a reference to getWeight in both
FastLeaderElection.java
> > >>> in
> > >>>>>>>> patch
> > >>>>>>>> 491:
> > >>>>>>>>
> > >>>>>>>> patches/ZOOKEEPER-491.patch:+
> > >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> > >>>>>>>> FastLeaderElection.java
> > >>>>>>>> :
> > >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> > >>>>>>>> 0)
> > >>>>>>>>
> > >>>>>>>> However, I don't see a reference to this method in patches
473,
> > >>>> 479,
> > >>>>>>>> or
> > >>>>>>>> 481. I also don't see a reference to this method in the
> > > trunk...
> > >>>>>>>>
> > >>>>>>>> -Todd
> > >>>>>>>>
> > >>>>>>>>> -----Original Message-----
> > >>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> > >>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> > >>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>>>>>>
> > >>>>>>>>> Ok, I'll apply that patch and report back.
> > >>>>>>>>> -Todd
> > >>>>>>>>>
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> > >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>>>>>>
> > >>>>>>>>>> You're missing 491 from your set of patches.
> > >>>>>>>>>>
> > >>>>>>>>>> -Flavio
> > >>>>>>>>>>
> > >>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> This repro's in both branch-3.2, and
branch-3.2+patches(473,
> > >>>> 479,
> > >>>>>>>>>>> 481).
> > >>>>>>>>>>>
> > >>>>>>>>>>> Basically, it seems like the nodes are electing
pd4-zook02
> > > to
> > >>>> be
> > >>>>>>>> the
> > >>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> > >>> supposed
> > >>>> to
> > >>>>>>>> be
> > >>>>>>>>>>> and
> > >>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
and
> > >>> it
> > >>>>>>>> loops
> > >>>>>>>>>>> over and over.
> > >>>>>>>>>>>
> > >>>>>>>>>>> -------------
> > >>>>>>>>>>> Server config
> > >>>>>>>>>>> -------------
> > >>>>>>>>>>>
> > >>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> > >>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> > >>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> > >>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> > >>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> > >>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> > >>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> > >>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> > >>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> > >>>>>>>>>>>
> > >>>>>>>>>>> group.1:1:2:3:4:5
> > >>>>>>>>>>> weight.1=1
> > >>>>>>>>>>> weight.2=1
> > >>>>>>>>>>> weight.3=1
> > >>>>>>>>>>> weight.4=1
> > >>>>>>>>>>> weight.5=1
> > >>>>>>>>>>>
> > >>>>>>>>>>> group.2:6:7:8:9
> > >>>>>>>>>>> weight.6=0
> > >>>>>>>>>>> weight.7=0
> > >>>>>>>>>>> weight.8=0
> > >>>>>>>>>>> weight.9=0
> > >>>>>>>>>>>
> > >>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> > >>> different
> > >>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> > > machines
> > >>>> in
> > >>>>>>>> dc1
> > >>>>>>>>>>> have voting rights, and the ability to become a leader.
The
> > >>>>>>>> machines
> > >>>>>>>>>>> in
> > >>>>>>>>>>> the pods all have a weight of zero, and are not expected
to
> > >>>>>> become
> > >>>>>>>>>>> leaders, or to vote on transactions.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Let me know what I can do to help resolve this issue.
> > >>>>>>>>>>>
> > >>>>>>>>>>> -Todd
> > >>>>>>>>
> > >>>>>>
> > >>>
> > >


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Great news. Thank you Mahadev. I'll report our findings later today.
-Todd

> -----Original Message-----
> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> Sent: Tuesday, August 04, 2009 11:20 AM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Hi Todd,
>  I just committed 480 and 491. You can checkout the 3.2 branch now.
> 
> Thanks
> mahadev
> 
> 
> On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
> 
> > That'd be perfect. Thanks!
> >
> >> -----Original Message-----
> >> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> >> Sent: Monday, August 03, 2009 4:24 PM
> >> To: zookeeper-user@hadoop.apache.org
> >> Subject: Re: Unending Leader Elections in WAN deploy
> >>
> >> Hi Todd,
> >>   Most of the patches that you mention should be in the branch 3.2
by
> > tomm
> >> or so. 481, 479 are already in. 480 and 491 should be in by tomm.
> > Would
> >> that
> >> suffice for you?
> >>
> >> Thanks
> >> mahadev
> >>
> >>
> >> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com>
wrote:
> >>
> >>> Another problem...I've reverted to the latest versions of the
> > patches
> >>> that are not specific to branch-3.2, and I'm getting two
compilation
> >>> errors:
> >>>
> >>> build-generated:
> >>>     [javac] Compiling 44 source files to
> >>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>> atched/branch-3.2/build/classes
> >>>
> >>> compile-main:
> >>>     [javac] Compiling 2 source files to
> >>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>> atched/branch-3.2/build/classes
> >>>     [javac]
> >>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>
> >
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>> mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers()
> > have
> >>> the same erasure
> >>>     [javac]         public String[] getQuorumPeers();
> >>>     [javac]                         ^
> >>>     [javac]
> >>>
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >>>
> >
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> >>> mStats.java:31: name clash: getServerState() and getServerState()
> > have
> >>> the same erasure
> >>>     [javac]         public String getServerState();
> >>>     [javac]                       ^
> >>>     [javac] 2 errors
> >>>
> >>> My build process is pretty simple:
> >>>
> >>> 1. copy the branch-3.2 source to a temp directory
> >>> (src/patched/branch-3.2)
> >>> 2. apply the ZOOKEEPER patches in my patches directory
> >>> 3. build zookeeper in the temp directory
> >>>
> >>> -Todd
> >>>> -----Original Message-----
> >>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>> Sent: Monday, August 03, 2009 4:09 PM
> >>>> To: zookeeper-user@hadoop.apache.org
> >>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>
> >>>> Flavio,
> >>>> I notice that you've updated the patches referenced for the WAN
> >>>> deployment. There appears to be an order dependency w/ respect to
> >>> these
> >>>> four patches...
> >>>>
> >>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>>>
> >>>> 473 -> 479 (479 fails)
> >>>>
> >>>>
> >>>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>> /src/patched/branch-3.2$ patch -p0 <
> >>>> ../patches/ZOOKEEPER-479-branch3.2.patch
> >>>> patching file
> >>>>
> >>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >>>> ical.java
> >>>> patching file
> >>>>
> >>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >>>> patching file
> >>>>
> >>>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >>>> .java
> >>>> patching file
> >>>>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >>>> Hunk #1 FAILED at 93.
> >>>> Hunk #2 FAILED at 145.
> >>>> 2 out of 2 hunks FAILED -- saving rejects to file
> >>>>
> >>>
> >
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >>>>
> >>>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >>>> /src/patched/branch-3.2$ h ../patches/
> >>>>
> >>>> Could you advise as to which patches I need to apply, and in what
> >>> order?
> >>>>
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>> Sent: Friday, July 31, 2009 9:51 PM
> >>>>> To: zookeeper-user@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> Perfect! Thanks for the update, Todd.
> >>>>>
> >>>>> -Flavio
> >>>>>
> >>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>>>
> >>>>>> Thanks. You were right, I had a stale version of 479.
Compilation
> >>>>>> succeeds and all tests pass on branch-3.2 with the latest
patches
> >>>> 473,
> >>>>>> 479, 481, and 491.
> >>>>>>
> >>>>>> -Todd
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>
> >>>>>>> It should be in 479. Perhaps you have a stale version of the
> >>> patch.
> >>>>>>>
> >>>>>>> -Flavio
> >>>>>>>
> >>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>>>
> >>>>>>>> Flavio,
> >>>>>>>>
> >>>>>>>> I'm getting a compilation error for patch 491:
> >>>>>>>>
> >>>>>>>> compile-main:
> >>>>>>>>   [javac] Compiling 1 source file to
> >>>>>>>>
> >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>> src/p
> >>>>>>>> atched/branch-3.2/build/classes
> >>>>>>>>   [javac]
> >>>>>>>>
> >>>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>>>> src/p
> >>>>>>>>
> >>>>
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>> FastL
> >>>>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>>>   [javac] location: interface
> >>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>>>   [javac]
> >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>>   [javac]
^
> >>>>>>>>   [javac] 1 error
> >>>>>>>>
> >>>>>>>> I see a reference to getWeight in both
FastLeaderElection.java
> >>> in
> >>>>>>>> patch
> >>>>>>>> 491:
> >>>>>>>>
> >>>>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>>>> FastLeaderElection.java
> >>>>>>>> :
> >>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>>>> 0)
> >>>>>>>>
> >>>>>>>> However, I don't see a reference to this method in patches
473,
> >>>> 479,
> >>>>>>>> or
> >>>>>>>> 481. I also don't see a reference to this method in the
> > trunk...
> >>>>>>>>
> >>>>>>>> -Todd
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>>>
> >>>>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>>>> -Todd
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>>>
> >>>>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>>>
> >>>>>>>>>> -Flavio
> >>>>>>>>>>
> >>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>>>
> >>>>>>>>>>> This repro's in both branch-3.2, and
branch-3.2+patches(473,
> >>>> 479,
> >>>>>>>>>>> 481).
> >>>>>>>>>>>
> >>>>>>>>>>> Basically, it seems like the nodes are electing pd4-zook02
> > to
> >>>> be
> >>>>>>>> the
> >>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> >>> supposed
> >>>> to
> >>>>>>>> be
> >>>>>>>>>>> and
> >>>>>>>>>>> then disconnects everyone. Then they re-elect it again,
and
> >>> it
> >>>>>>>> loops
> >>>>>>>>>>> over and over.
> >>>>>>>>>>>
> >>>>>>>>>>> -------------
> >>>>>>>>>>> Server config
> >>>>>>>>>>> -------------
> >>>>>>>>>>>
> >>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>>>
> >>>>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>>>> weight.1=1
> >>>>>>>>>>> weight.2=1
> >>>>>>>>>>> weight.3=1
> >>>>>>>>>>> weight.4=1
> >>>>>>>>>>> weight.5=1
> >>>>>>>>>>>
> >>>>>>>>>>> group.2:6:7:8:9
> >>>>>>>>>>> weight.6=0
> >>>>>>>>>>> weight.7=0
> >>>>>>>>>>> weight.8=0
> >>>>>>>>>>> weight.9=0
> >>>>>>>>>>>
> >>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> >>> different
> >>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> > machines
> >>>> in
> >>>>>>>> dc1
> >>>>>>>>>>> have voting rights, and the ability to become a leader.
The
> >>>>>>>> machines
> >>>>>>>>>>> in
> >>>>>>>>>>> the pods all have a weight of zero, and are not expected
to
> >>>>>> become
> >>>>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>>>
> >>>>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>>>
> >>>>>>>>>>> -Todd
> >>>>>>>>
> >>>>>>
> >>>
> >


Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Todd, 
 I just committed 480 and 491. You can checkout the 3.2 branch now.

Thanks
mahadev


On 8/3/09 4:29 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> That'd be perfect. Thanks!
> 
>> -----Original Message-----
>> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
>> Sent: Monday, August 03, 2009 4:24 PM
>> To: zookeeper-user@hadoop.apache.org
>> Subject: Re: Unending Leader Elections in WAN deploy
>> 
>> Hi Todd,
>>   Most of the patches that you mention should be in the branch 3.2 by
> tomm
>> or so. 481, 479 are already in. 480 and 491 should be in by tomm.
> Would
>> that
>> suffice for you?
>> 
>> Thanks
>> mahadev
>> 
>> 
>> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
>> 
>>> Another problem...I've reverted to the latest versions of the
> patches
>>> that are not specific to branch-3.2, and I'm getting two compilation
>>> errors:
>>> 
>>> build-generated:
>>>     [javac] Compiling 44 source files to
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>> atched/branch-3.2/build/classes
>>> 
>>> compile-main:
>>>     [javac] Compiling 2 source files to
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>> atched/branch-3.2/build/classes
>>>     [javac]
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>> 
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>> mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers()
> have
>>> the same erasure
>>>     [javac]         public String[] getQuorumPeers();
>>>     [javac]                         ^
>>>     [javac]
>>> 
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
>>> 
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
>>> mStats.java:31: name clash: getServerState() and getServerState()
> have
>>> the same erasure
>>>     [javac]         public String getServerState();
>>>     [javac]                       ^
>>>     [javac] 2 errors
>>> 
>>> My build process is pretty simple:
>>> 
>>> 1. copy the branch-3.2 source to a temp directory
>>> (src/patched/branch-3.2)
>>> 2. apply the ZOOKEEPER patches in my patches directory
>>> 3. build zookeeper in the temp directory
>>> 
>>> -Todd
>>>> -----Original Message-----
>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>> Sent: Monday, August 03, 2009 4:09 PM
>>>> To: zookeeper-user@hadoop.apache.org
>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>> 
>>>> Flavio,
>>>> I notice that you've updated the patches referenced for the WAN
>>>> deployment. There appears to be an order dependency w/ respect to
>>> these
>>>> four patches...
>>>> 
>>>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>>>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>>>> 
>>>> 473 -> 479 (479 fails)
>>>> 
>>>> 
>>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>> /src/patched/branch-3.2$ patch -p0 <
>>>> ../patches/ZOOKEEPER-479-branch3.2.patch
>>>> patching file
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>>>> ical.java
>>>> patching file
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>>>> patching file
>>>> 
>>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>>>> .java
>>>> patching file
>>>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>>>> Hunk #1 FAILED at 93.
>>>> Hunk #2 FAILED at 145.
>>>> 2 out of 2 hunks FAILED -- saving rejects to file
>>>> 
>>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>>>> 
>>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>>>> /src/patched/branch-3.2$ h ../patches/
>>>> 
>>>> Could you advise as to which patches I need to apply, and in what
>>> order?
>>>> 
>>>> -Todd
>>>> 
>>>>> -----Original Message-----
>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>> Sent: Friday, July 31, 2009 9:51 PM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> Perfect! Thanks for the update, Todd.
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>>>> 
>>>>>> Thanks. You were right, I had a stale version of 479. Compilation
>>>>>> succeeds and all tests pass on branch-3.2 with the latest patches
>>>> 473,
>>>>>> 479, 481, and 491.
>>>>>> 
>>>>>> -Todd
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> It should be in 479. Perhaps you have a stale version of the
>>> patch.
>>>>>>> 
>>>>>>> -Flavio
>>>>>>> 
>>>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>>>> 
>>>>>>>> Flavio,
>>>>>>>> 
>>>>>>>> I'm getting a compilation error for patch 491:
>>>>>>>> 
>>>>>>>> compile-main:
>>>>>>>>   [javac] Compiling 1 source file to
>>>>>>>> 
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>> src/p
>>>>>>>> atched/branch-3.2/build/classes
>>>>>>>>   [javac]
>>>>>>>> 
>>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>>>> src/p
>>>>>>>> 
>>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>> FastL
>>>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>>>   [javac] location: interface
>>>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>>>   [javac]
>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>>   [javac]                                                    ^
>>>>>>>>   [javac] 1 error
>>>>>>>> 
>>>>>>>> I see a reference to getWeight in both FastLeaderElection.java
>>> in
>>>>>>>> patch
>>>>>>>> 491:
>>>>>>>> 
>>>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>>>> FastLeaderElection.java
>>>>>>>> :
>>>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>>>> 0)
>>>>>>>> 
>>>>>>>> However, I don't see a reference to this method in patches 473,
>>>> 479,
>>>>>>>> or
>>>>>>>> 481. I also don't see a reference to this method in the
> trunk...
>>>>>>>> 
>>>>>>>> -Todd
>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>>>> 
>>>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>>>> -Todd
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>>>> 
>>>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>>>> 
>>>>>>>>>> -Flavio
>>>>>>>>>> 
>>>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>>>> 
>>>>>>>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473,
>>>> 479,
>>>>>>>>>>> 481).
>>>>>>>>>>> 
>>>>>>>>>>> Basically, it seems like the nodes are electing pd4-zook02
> to
>>>> be
>>>>>>>> the
>>>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
>>> supposed
>>>> to
>>>>>>>> be
>>>>>>>>>>> and
>>>>>>>>>>> then disconnects everyone. Then they re-elect it again, and
>>> it
>>>>>>>> loops
>>>>>>>>>>> over and over.
>>>>>>>>>>> 
>>>>>>>>>>> -------------
>>>>>>>>>>> Server config
>>>>>>>>>>> -------------
>>>>>>>>>>> 
>>>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>>>> 
>>>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>>>> weight.1=1
>>>>>>>>>>> weight.2=1
>>>>>>>>>>> weight.3=1
>>>>>>>>>>> weight.4=1
>>>>>>>>>>> weight.5=1
>>>>>>>>>>> 
>>>>>>>>>>> group.2:6:7:8:9
>>>>>>>>>>> weight.6=0
>>>>>>>>>>> weight.7=0
>>>>>>>>>>> weight.8=0
>>>>>>>>>>> weight.9=0
>>>>>>>>>>> 
>>>>>>>>>>> Note that we have 2 groups, composed of machines in 3
>>> different
>>>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
> machines
>>>> in
>>>>>>>> dc1
>>>>>>>>>>> have voting rights, and the ability to become a leader. The
>>>>>>>> machines
>>>>>>>>>>> in
>>>>>>>>>>> the pods all have a weight of zero, and are not expected to
>>>>>> become
>>>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>>>> 
>>>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>>>> 
>>>>>>>>>>> -Todd
>>>>>>>> 
>>>>>> 
>>> 
> 


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
That'd be perfect. Thanks!

> -----Original Message-----
> From: Mahadev Konar [mailto:mahadev@yahoo-inc.com]
> Sent: Monday, August 03, 2009 4:24 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Hi Todd,
>   Most of the patches that you mention should be in the branch 3.2 by
tomm
> or so. 481, 479 are already in. 480 and 491 should be in by tomm.
Would
> that
> suffice for you?
> 
> Thanks
> mahadev
> 
> 
> On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:
> 
> > Another problem...I've reverted to the latest versions of the
patches
> > that are not specific to branch-3.2, and I'm getting two compilation
> > errors:
> >
> > build-generated:
> >     [javac] Compiling 44 source files to
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > atched/branch-3.2/build/classes
> >
> > compile-main:
> >     [javac] Compiling 2 source files to
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> > atched/branch-3.2/build/classes
> >     [javac]
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers()
have
> > the same erasure
> >     [javac]         public String[] getQuorumPeers();
> >     [javac]                         ^
> >     [javac]
> >
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> >
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> > mStats.java:31: name clash: getServerState() and getServerState()
have
> > the same erasure
> >     [javac]         public String getServerState();
> >     [javac]                       ^
> >     [javac] 2 errors
> >
> > My build process is pretty simple:
> >
> > 1. copy the branch-3.2 source to a temp directory
> > (src/patched/branch-3.2)
> > 2. apply the ZOOKEEPER patches in my patches directory
> > 3. build zookeeper in the temp directory
> >
> > -Todd
> >> -----Original Message-----
> >> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >> Sent: Monday, August 03, 2009 4:09 PM
> >> To: zookeeper-user@hadoop.apache.org
> >> Subject: RE: Unending Leader Elections in WAN deploy
> >>
> >> Flavio,
> >> I notice that you've updated the patches referenced for the WAN
> >> deployment. There appears to be an order dependency w/ respect to
> > these
> >> four patches...
> >>
> >> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> >> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> >>
> >> 473 -> 479 (479 fails)
> >>
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >> /src/patched/branch-3.2$ patch -p0 <
> >> ../patches/ZOOKEEPER-479-branch3.2.patch
> >> patching file
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> >> ical.java
> >> patching file
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> >> patching file
> >>
> >
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> >> .java
> >> patching file
> >> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> >> Hunk #1 FAILED at 93.
> >> Hunk #2 FAILED at 145.
> >> 2 out of 2 hunks FAILED -- saving rejects to file
> >>
> >
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
> >>
> >
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> >> /src/patched/branch-3.2$ h ../patches/
> >>
> >> Could you advise as to which patches I need to apply, and in what
> > order?
> >>
> >> -Todd
> >>
> >>> -----Original Message-----
> >>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>> Sent: Friday, July 31, 2009 9:51 PM
> >>> To: zookeeper-user@hadoop.apache.org
> >>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>
> >>> Perfect! Thanks for the update, Todd.
> >>>
> >>> -Flavio
> >>>
> >>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >>>
> >>>> Thanks. You were right, I had a stale version of 479. Compilation
> >>>> succeeds and all tests pass on branch-3.2 with the latest patches
> >> 473,
> >>>> 479, 481, and 491.
> >>>>
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>> Sent: Friday, July 31, 2009 7:48 PM
> >>>>> To: zookeeper-user@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> It should be in 479. Perhaps you have a stale version of the
> > patch.
> >>>>>
> >>>>> -Flavio
> >>>>>
> >>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>>>>
> >>>>>> Flavio,
> >>>>>>
> >>>>>> I'm getting a compilation error for patch 491:
> >>>>>>
> >>>>>> compile-main:
> >>>>>>   [javac] Compiling 1 source file to
> >>>>>>
> >> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>> src/p
> >>>>>> atched/branch-3.2/build/classes
> >>>>>>   [javac]
> >>>>>>
> >> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>>>>> src/p
> >>>>>>
> >> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>> FastL
> >>>>>> eaderElection.java:601: cannot find symbol
> >>>>>>   [javac] symbol  : method getWeight(long)
> >>>>>>   [javac] location: interface
> >>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>>>>   [javac]
> >>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>>   [javac]                                                    ^
> >>>>>>   [javac] 1 error
> >>>>>>
> >>>>>> I see a reference to getWeight in both FastLeaderElection.java
> > in
> >>>>>> patch
> >>>>>> 491:
> >>>>>>
> >>>>>> patches/ZOOKEEPER-491.patch:+
> >>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>>>> src/java/main/org/apache/zookeeper/server/quorum/
> >>>>>> FastLeaderElection.java
> >>>>>> :
> >>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>>>>> 0)
> >>>>>>
> >>>>>> However, I don't see a reference to this method in patches 473,
> >> 479,
> >>>>>> or
> >>>>>> 481. I also don't see a reference to this method in the
trunk...
> >>>>>>
> >>>>>> -Todd
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>>>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>>>>
> >>>>>>> Ok, I'll apply that patch and report back.
> >>>>>>> -Todd
> >>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>>>>> To: zookeeper-user@hadoop.apache.org
> >>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>>>>
> >>>>>>>> You're missing 491 from your set of patches.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>>>>
> >>>>>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473,
> >> 479,
> >>>>>>>>> 481).
> >>>>>>>>>
> >>>>>>>>> Basically, it seems like the nodes are electing pd4-zook02
to
> >> be
> >>>>>> the
> >>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> > supposed
> >> to
> >>>>>> be
> >>>>>>>>> and
> >>>>>>>>> then disconnects everyone. Then they re-elect it again, and
> > it
> >>>>>> loops
> >>>>>>>>> over and over.
> >>>>>>>>>
> >>>>>>>>> -------------
> >>>>>>>>> Server config
> >>>>>>>>> -------------
> >>>>>>>>>
> >>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>>>>
> >>>>>>>>> group.1:1:2:3:4:5
> >>>>>>>>> weight.1=1
> >>>>>>>>> weight.2=1
> >>>>>>>>> weight.3=1
> >>>>>>>>> weight.4=1
> >>>>>>>>> weight.5=1
> >>>>>>>>>
> >>>>>>>>> group.2:6:7:8:9
> >>>>>>>>> weight.6=0
> >>>>>>>>> weight.7=0
> >>>>>>>>> weight.8=0
> >>>>>>>>> weight.9=0
> >>>>>>>>>
> >>>>>>>>> Note that we have 2 groups, composed of machines in 3
> > different
> >>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only
machines
> >> in
> >>>>>> dc1
> >>>>>>>>> have voting rights, and the ability to become a leader. The
> >>>>>> machines
> >>>>>>>>> in
> >>>>>>>>> the pods all have a weight of zero, and are not expected to
> >>>> become
> >>>>>>>>> leaders, or to vote on transactions.
> >>>>>>>>>
> >>>>>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>>>>
> >>>>>>>>> -Todd
> >>>>>>
> >>>>
> >


Re: Unending Leader Elections in WAN deploy

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Todd,
  Most of the patches that you mention should be in the branch 3.2 by tomm
or so. 481, 479 are already in. 480 and 491 should be in by tomm. Would that
suffice for you?

Thanks
mahadev 


On 8/3/09 4:21 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> Another problem...I've reverted to the latest versions of the patches
> that are not specific to branch-3.2, and I'm getting two compilation
> errors:
> 
> build-generated:
>     [javac] Compiling 44 source files to
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> atched/branch-3.2/build/classes
> 
> compile-main:
>     [javac] Compiling 2 source files to
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> atched/branch-3.2/build/classes
>     [javac]
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers() have
> the same erasure
>     [javac]         public String[] getQuorumPeers();
>     [javac]                         ^
>     [javac]
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
> mStats.java:31: name clash: getServerState() and getServerState() have
> the same erasure
>     [javac]         public String getServerState();
>     [javac]                       ^
>     [javac] 2 errors
> 
> My build process is pretty simple:
> 
> 1. copy the branch-3.2 source to a temp directory
> (src/patched/branch-3.2)
> 2. apply the ZOOKEEPER patches in my patches directory
> 3. build zookeeper in the temp directory
> 
> -Todd
>> -----Original Message-----
>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>> Sent: Monday, August 03, 2009 4:09 PM
>> To: zookeeper-user@hadoop.apache.org
>> Subject: RE: Unending Leader Elections in WAN deploy
>> 
>> Flavio,
>> I notice that you've updated the patches referenced for the WAN
>> deployment. There appears to be an order dependency w/ respect to
> these
>> four patches...
>> 
>> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
>> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
>> 
>> 473 -> 479 (479 fails)
>> 
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>> /src/patched/branch-3.2$ patch -p0 <
>> ../patches/ZOOKEEPER-479-branch3.2.patch
>> patching file
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
>> ical.java
>> patching file
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
>> patching file
>> 
> src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
>> .java
>> patching file
>> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
>> Hunk #1 FAILED at 93.
>> Hunk #2 FAILED at 145.
>> 2 out of 2 hunks FAILED -- saving rejects to file
>> 
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>> 
> toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
>> /src/patched/branch-3.2$ h ../patches/
>> 
>> Could you advise as to which patches I need to apply, and in what
> order?
>> 
>> -Todd
>> 
>>> -----Original Message-----
>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>> Sent: Friday, July 31, 2009 9:51 PM
>>> To: zookeeper-user@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>> 
>>> Perfect! Thanks for the update, Todd.
>>> 
>>> -Flavio
>>> 
>>> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
>>> 
>>>> Thanks. You were right, I had a stale version of 479. Compilation
>>>> succeeds and all tests pass on branch-3.2 with the latest patches
>> 473,
>>>> 479, 481, and 491.
>>>> 
>>>> -Todd
>>>> 
>>>>> -----Original Message-----
>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>> Sent: Friday, July 31, 2009 7:48 PM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>> 
>>>>> It should be in 479. Perhaps you have a stale version of the
> patch.
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>>>> 
>>>>>> Flavio,
>>>>>> 
>>>>>> I'm getting a compilation error for patch 491:
>>>>>> 
>>>>>> compile-main:
>>>>>>   [javac] Compiling 1 source file to
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>> src/p
>>>>>> atched/branch-3.2/build/classes
>>>>>>   [javac]
>>>>>> 
>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>>>>> src/p
>>>>>> 
>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>>>>> FastL
>>>>>> eaderElection.java:601: cannot find symbol
>>>>>>   [javac] symbol  : method getWeight(long)
>>>>>>   [javac] location: interface
>>>>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>>>>   [javac]
>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>>   [javac]                                                    ^
>>>>>>   [javac] 1 error
>>>>>> 
>>>>>> I see a reference to getWeight in both FastLeaderElection.java
> in
>>>>>> patch
>>>>>> 491:
>>>>>> 
>>>>>> patches/ZOOKEEPER-491.patch:+
>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>>>> src/java/main/org/apache/zookeeper/server/quorum/
>>>>>> FastLeaderElection.java
>>>>>> :
>>>>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>>>>> 0)
>>>>>> 
>>>>>> However, I don't see a reference to this method in patches 473,
>> 479,
>>>>>> or
>>>>>> 481. I also don't see a reference to this method in the trunk...
>>>>>> 
>>>>>> -Todd
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>>>> 
>>>>>>> Ok, I'll apply that patch and report back.
>>>>>>> -Todd
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>>>>> To: zookeeper-user@hadoop.apache.org
>>>>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>>>> 
>>>>>>>> You're missing 491 from your set of patches.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>>>> 
>>>>>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473,
>> 479,
>>>>>>>>> 481).
>>>>>>>>> 
>>>>>>>>> Basically, it seems like the nodes are electing pd4-zook02 to
>> be
>>>>>> the
>>>>>>>>> leader. However, pd4-zook02 seems to realize it's not
> supposed
>> to
>>>>>> be
>>>>>>>>> and
>>>>>>>>> then disconnects everyone. Then they re-elect it again, and
> it
>>>>>> loops
>>>>>>>>> over and over.
>>>>>>>>> 
>>>>>>>>> -------------
>>>>>>>>> Server config
>>>>>>>>> -------------
>>>>>>>>> 
>>>>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>>>> 
>>>>>>>>> group.1:1:2:3:4:5
>>>>>>>>> weight.1=1
>>>>>>>>> weight.2=1
>>>>>>>>> weight.3=1
>>>>>>>>> weight.4=1
>>>>>>>>> weight.5=1
>>>>>>>>> 
>>>>>>>>> group.2:6:7:8:9
>>>>>>>>> weight.6=0
>>>>>>>>> weight.7=0
>>>>>>>>> weight.8=0
>>>>>>>>> weight.9=0
>>>>>>>>> 
>>>>>>>>> Note that we have 2 groups, composed of machines in 3
> different
>>>>>>>>> locations (dc1, pd1, and pd4). The idea is that only machines
>> in
>>>>>> dc1
>>>>>>>>> have voting rights, and the ability to become a leader. The
>>>>>> machines
>>>>>>>>> in
>>>>>>>>> the pods all have a weight of zero, and are not expected to
>>>> become
>>>>>>>>> leaders, or to vote on transactions.
>>>>>>>>> 
>>>>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>>>> 
>>>>>>>>> -Todd
>>>>>> 
>>>> 
> 


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Another problem...I've reverted to the latest versions of the patches
that are not specific to branch-3.2, and I'm getting two compilation
errors:

build-generated:
    [javac] Compiling 44 source files to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes

compile-main:
    [javac] Compiling 2 source files to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes
    [javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers() have
the same erasure
    [javac]         public String[] getQuorumPeers();
    [javac]                         ^
    [javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru
mStats.java:31: name clash: getServerState() and getServerState() have
the same erasure
    [javac]         public String getServerState();
    [javac]                       ^
    [javac] 2 errors

My build process is pretty simple:

1. copy the branch-3.2 source to a temp directory
(src/patched/branch-3.2)
2. apply the ZOOKEEPER patches in my patches directory
3. build zookeeper in the temp directory

-Todd
> -----Original Message-----
> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> Sent: Monday, August 03, 2009 4:09 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: RE: Unending Leader Elections in WAN deploy
> 
> Flavio,
> I notice that you've updated the patches referenced for the WAN
> deployment. There appears to be an order dependency w/ respect to
these
> four patches...
> 
> ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
> ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch
> 
> 473 -> 479 (479 fails)
> 
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> /src/patched/branch-3.2$ patch -p0 <
> ../patches/ZOOKEEPER-479-branch3.2.patch
> patching file
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
> ical.java
> patching file
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
> patching file
>
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
> .java
> patching file
> src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
> Hunk #1 FAILED at 93.
> Hunk #2 FAILED at 145.
> 2 out of 2 hunks FAILED -- saving rejects to file
>
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
>
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
> /src/patched/branch-3.2$ h ../patches/
> 
> Could you advise as to which patches I need to apply, and in what
order?
> 
> -Todd
> 
> > -----Original Message-----
> > From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > Sent: Friday, July 31, 2009 9:51 PM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: Unending Leader Elections in WAN deploy
> >
> > Perfect! Thanks for the update, Todd.
> >
> > -Flavio
> >
> > On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> >
> > > Thanks. You were right, I had a stale version of 479. Compilation
> > > succeeds and all tests pass on branch-3.2 with the latest patches
> 473,
> > > 479, 481, and 491.
> > >
> > > -Todd
> > >
> > >> -----Original Message-----
> > >> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >> Sent: Friday, July 31, 2009 7:48 PM
> > >> To: zookeeper-user@hadoop.apache.org
> > >> Subject: Re: Unending Leader Elections in WAN deploy
> > >>
> > >> It should be in 479. Perhaps you have a stale version of the
patch.
> > >>
> > >> -Flavio
> > >>
> > >> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> > >>
> > >>> Flavio,
> > >>>
> > >>> I'm getting a compilation error for patch 491:
> > >>>
> > >>> compile-main:
> > >>>   [javac] Compiling 1 source file to
> > >>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>> src/p
> > >>> atched/branch-3.2/build/classes
> > >>>   [javac]
> > >>>
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > >>> src/p
> > >>>
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> > >>> FastL
> > >>> eaderElection.java:601: cannot find symbol
> > >>>   [javac] symbol  : method getWeight(long)
> > >>>   [javac] location: interface
> > >>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> > >>>   [javac]
> > >>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>>   [javac]                                                    ^
> > >>>   [javac] 1 error
> > >>>
> > >>> I see a reference to getWeight in both FastLeaderElection.java
in
> > >>> patch
> > >>> 491:
> > >>>
> > >>> patches/ZOOKEEPER-491.patch:+
> > >>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > >>> src/java/main/org/apache/zookeeper/server/quorum/
> > >>> FastLeaderElection.java
> > >>> :
> > >>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> > >>> 0)
> > >>>
> > >>> However, I don't see a reference to this method in patches 473,
> 479,
> > >>> or
> > >>> 481. I also don't see a reference to this method in the trunk...
> > >>>
> > >>> -Todd
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> > >>>> Sent: Friday, July 31, 2009 7:30 PM
> > >>>> To: zookeeper-user@hadoop.apache.org
> > >>>> Subject: RE: Unending Leader Elections in WAN deploy
> > >>>>
> > >>>> Ok, I'll apply that patch and report back.
> > >>>> -Todd
> > >>>>
> > >>>>> -----Original Message-----
> > >>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > >>>>> Sent: Friday, July 31, 2009 7:18 PM
> > >>>>> To: zookeeper-user@hadoop.apache.org
> > >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> > >>>>>
> > >>>>> You're missing 491 from your set of patches.
> > >>>>>
> > >>>>> -Flavio
> > >>>>>
> > >>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> > >>>>>
> > >>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473,
> 479,
> > >>>>>> 481).
> > >>>>>>
> > >>>>>> Basically, it seems like the nodes are electing pd4-zook02 to
> be
> > >>> the
> > >>>>>> leader. However, pd4-zook02 seems to realize it's not
supposed
> to
> > >>> be
> > >>>>>> and
> > >>>>>> then disconnects everyone. Then they re-elect it again, and
it
> > >>> loops
> > >>>>>> over and over.
> > >>>>>>
> > >>>>>> -------------
> > >>>>>> Server config
> > >>>>>> -------------
> > >>>>>>
> > >>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> > >>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> > >>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> > >>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> > >>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> > >>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> > >>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> > >>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> > >>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> > >>>>>>
> > >>>>>> group.1:1:2:3:4:5
> > >>>>>> weight.1=1
> > >>>>>> weight.2=1
> > >>>>>> weight.3=1
> > >>>>>> weight.4=1
> > >>>>>> weight.5=1
> > >>>>>>
> > >>>>>> group.2:6:7:8:9
> > >>>>>> weight.6=0
> > >>>>>> weight.7=0
> > >>>>>> weight.8=0
> > >>>>>> weight.9=0
> > >>>>>>
> > >>>>>> Note that we have 2 groups, composed of machines in 3
different
> > >>>>>> locations (dc1, pd1, and pd4). The idea is that only machines
> in
> > >>> dc1
> > >>>>>> have voting rights, and the ability to become a leader. The
> > >>> machines
> > >>>>>> in
> > >>>>>> the pods all have a weight of zero, and are not expected to
> > > become
> > >>>>>> leaders, or to vote on transactions.
> > >>>>>>
> > >>>>>> Let me know what I can do to help resolve this issue.
> > >>>>>>
> > >>>>>> -Todd
> > >>>
> > >


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Flavio,
I notice that you've updated the patches referenced for the WAN
deployment. There appears to be an order dependency w/ respect to these
four patches...

ZOOKEEPER-473.patch  ZOOKEEPER-479-branch3.2.patch
ZOOKEEPER-481-branch3.2.patch  ZOOKEEPER-491.patch

473 -> 479 (479 fails)

toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/patched/branch-3.2$ patch -p0 <
../patches/ZOOKEEPER-479-branch3.2.patch 
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch
ical.java
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java
patching file
src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier
.java
patching file
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java
Hunk #1 FAILED at 93.
Hunk #2 FAILED at 145.
2 out of 2 hunks FAILED -- saving rejects to file
src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej
toddg@TODDG01LT:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper
/src/patched/branch-3.2$ h ../patches/

Could you advise as to which patches I need to apply, and in what order?

-Todd

> -----Original Message-----
> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> Sent: Friday, July 31, 2009 9:51 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> Perfect! Thanks for the update, Todd.
> 
> -Flavio
> 
> On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:
> 
> > Thanks. You were right, I had a stale version of 479. Compilation
> > succeeds and all tests pass on branch-3.2 with the latest patches
473,
> > 479, 481, and 491.
> >
> > -Todd
> >
> >> -----Original Message-----
> >> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >> Sent: Friday, July 31, 2009 7:48 PM
> >> To: zookeeper-user@hadoop.apache.org
> >> Subject: Re: Unending Leader Elections in WAN deploy
> >>
> >> It should be in 479. Perhaps you have a stale version of the patch.
> >>
> >> -Flavio
> >>
> >> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> >>
> >>> Flavio,
> >>>
> >>> I'm getting a compilation error for patch 491:
> >>>
> >>> compile-main:
> >>>   [javac] Compiling 1 source file to
> >>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>> src/p
> >>> atched/branch-3.2/build/classes
> >>>   [javac]
> >>>
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> >>> src/p
> >>>
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> >>> FastL
> >>> eaderElection.java:601: cannot find symbol
> >>>   [javac] symbol  : method getWeight(long)
> >>>   [javac] location: interface
> >>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >>>   [javac]
> >>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>>   [javac]                                                    ^
> >>>   [javac] 1 error
> >>>
> >>> I see a reference to getWeight in both FastLeaderElection.java in
> >>> patch
> >>> 491:
> >>>
> >>> patches/ZOOKEEPER-491.patch:+
> >>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >>> src/java/main/org/apache/zookeeper/server/quorum/
> >>> FastLeaderElection.java
> >>> :
> >>> if(self.getQuorumVerifier().getWeight(n.sid) !=
> >>> 0)
> >>>
> >>> However, I don't see a reference to this method in patches 473,
479,
> >>> or
> >>> 481. I also don't see a reference to this method in the trunk...
> >>>
> >>> -Todd
> >>>
> >>>> -----Original Message-----
> >>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >>>> Sent: Friday, July 31, 2009 7:30 PM
> >>>> To: zookeeper-user@hadoop.apache.org
> >>>> Subject: RE: Unending Leader Elections in WAN deploy
> >>>>
> >>>> Ok, I'll apply that patch and report back.
> >>>> -Todd
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>>>> Sent: Friday, July 31, 2009 7:18 PM
> >>>>> To: zookeeper-user@hadoop.apache.org
> >>>>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>>>
> >>>>> You're missing 491 from your set of patches.
> >>>>>
> >>>>> -Flavio
> >>>>>
> >>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>>>
> >>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473,
479,
> >>>>>> 481).
> >>>>>>
> >>>>>> Basically, it seems like the nodes are electing pd4-zook02 to
be
> >>> the
> >>>>>> leader. However, pd4-zook02 seems to realize it's not supposed
to
> >>> be
> >>>>>> and
> >>>>>> then disconnects everyone. Then they re-elect it again, and it
> >>> loops
> >>>>>> over and over.
> >>>>>>
> >>>>>> -------------
> >>>>>> Server config
> >>>>>> -------------
> >>>>>>
> >>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>>>
> >>>>>> group.1:1:2:3:4:5
> >>>>>> weight.1=1
> >>>>>> weight.2=1
> >>>>>> weight.3=1
> >>>>>> weight.4=1
> >>>>>> weight.5=1
> >>>>>>
> >>>>>> group.2:6:7:8:9
> >>>>>> weight.6=0
> >>>>>> weight.7=0
> >>>>>> weight.8=0
> >>>>>> weight.9=0
> >>>>>>
> >>>>>> Note that we have 2 groups, composed of machines in 3 different
> >>>>>> locations (dc1, pd1, and pd4). The idea is that only machines
in
> >>> dc1
> >>>>>> have voting rights, and the ability to become a leader. The
> >>> machines
> >>>>>> in
> >>>>>> the pods all have a weight of zero, and are not expected to
> > become
> >>>>>> leaders, or to vote on transactions.
> >>>>>>
> >>>>>> Let me know what I can do to help resolve this issue.
> >>>>>>
> >>>>>> -Todd
> >>>
> >


Re: Unending Leader Elections in WAN deploy

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Perfect! Thanks for the update, Todd.

-Flavio

On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote:

> Thanks. You were right, I had a stale version of 479. Compilation
> succeeds and all tests pass on branch-3.2 with the latest patches 473,
> 479, 481, and 491.
>
> -Todd
>
>> -----Original Message-----
>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>> Sent: Friday, July 31, 2009 7:48 PM
>> To: zookeeper-user@hadoop.apache.org
>> Subject: Re: Unending Leader Elections in WAN deploy
>>
>> It should be in 479. Perhaps you have a stale version of the patch.
>>
>> -Flavio
>>
>> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
>>
>>> Flavio,
>>>
>>> I'm getting a compilation error for patch 491:
>>>
>>> compile-main:
>>>   [javac] Compiling 1 source file to
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>> src/p
>>> atched/branch-3.2/build/classes
>>>   [javac]
>>> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
>>> src/p
>>> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
>>> FastL
>>> eaderElection.java:601: cannot find symbol
>>>   [javac] symbol  : method getWeight(long)
>>>   [javac] location: interface
>>> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>>>   [javac]
>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>>   [javac]                                                    ^
>>>   [javac] 1 error
>>>
>>> I see a reference to getWeight in both FastLeaderElection.java in
>>> patch
>>> 491:
>>>
>>> patches/ZOOKEEPER-491.patch:+
>>> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>>> src/java/main/org/apache/zookeeper/server/quorum/
>>> FastLeaderElection.java
>>> :
>>> if(self.getQuorumVerifier().getWeight(n.sid) !=
>>> 0)
>>>
>>> However, I don't see a reference to this method in patches 473, 479,
>>> or
>>> 481. I also don't see a reference to this method in the trunk...
>>>
>>> -Todd
>>>
>>>> -----Original Message-----
>>>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>>>> Sent: Friday, July 31, 2009 7:30 PM
>>>> To: zookeeper-user@hadoop.apache.org
>>>> Subject: RE: Unending Leader Elections in WAN deploy
>>>>
>>>> Ok, I'll apply that patch and report back.
>>>> -Todd
>>>>
>>>>> -----Original Message-----
>>>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>>>> Sent: Friday, July 31, 2009 7:18 PM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>>>
>>>>> You're missing 491 from your set of patches.
>>>>>
>>>>> -Flavio
>>>>>
>>>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>>>
>>>>>> This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
>>>>>> 481).
>>>>>>
>>>>>> Basically, it seems like the nodes are electing pd4-zook02 to be
>>> the
>>>>>> leader. However, pd4-zook02 seems to realize it's not supposed to
>>> be
>>>>>> and
>>>>>> then disconnects everyone. Then they re-elect it again, and it
>>> loops
>>>>>> over and over.
>>>>>>
>>>>>> -------------
>>>>>> Server config
>>>>>> -------------
>>>>>>
>>>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>>>
>>>>>> group.1:1:2:3:4:5
>>>>>> weight.1=1
>>>>>> weight.2=1
>>>>>> weight.3=1
>>>>>> weight.4=1
>>>>>> weight.5=1
>>>>>>
>>>>>> group.2:6:7:8:9
>>>>>> weight.6=0
>>>>>> weight.7=0
>>>>>> weight.8=0
>>>>>> weight.9=0
>>>>>>
>>>>>> Note that we have 2 groups, composed of machines in 3 different
>>>>>> locations (dc1, pd1, and pd4). The idea is that only machines in
>>> dc1
>>>>>> have voting rights, and the ability to become a leader. The
>>> machines
>>>>>> in
>>>>>> the pods all have a weight of zero, and are not expected to
> become
>>>>>> leaders, or to vote on transactions.
>>>>>>
>>>>>> Let me know what I can do to help resolve this issue.
>>>>>>
>>>>>> -Todd
>>>
>


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Thanks. You were right, I had a stale version of 479. Compilation
succeeds and all tests pass on branch-3.2 with the latest patches 473,
479, 481, and 491.

-Todd
 
> -----Original Message-----
> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> Sent: Friday, July 31, 2009 7:48 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> It should be in 479. Perhaps you have a stale version of the patch.
> 
> -Flavio
> 
> On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:
> 
> > Flavio,
> >
> > I'm getting a compilation error for patch 491:
> >
> > compile-main:
> >    [javac] Compiling 1 source file to
> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > src/p
> > atched/branch-3.2/build/classes
> >    [javac]
> > /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/
> > src/p
> > atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/
> > FastL
> > eaderElection.java:601: cannot find symbol
> >    [javac] symbol  : method getWeight(long)
> >    [javac] location: interface
> > org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
> >    [javac]
> > if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> >    [javac]                                                    ^
> >    [javac] 1 error
> >
> > I see a reference to getWeight in both FastLeaderElection.java in
> > patch
> > 491:
> >
> > patches/ZOOKEEPER-491.patch:+
> > if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> > src/java/main/org/apache/zookeeper/server/quorum/
> > FastLeaderElection.java
> > :
> > if(self.getQuorumVerifier().getWeight(n.sid) !=
> > 0)
> >
> > However, I don't see a reference to this method in patches 473, 479,
> > or
> > 481. I also don't see a reference to this method in the trunk...
> >
> > -Todd
> >
> >> -----Original Message-----
> >> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> >> Sent: Friday, July 31, 2009 7:30 PM
> >> To: zookeeper-user@hadoop.apache.org
> >> Subject: RE: Unending Leader Elections in WAN deploy
> >>
> >> Ok, I'll apply that patch and report back.
> >> -Todd
> >>
> >>> -----Original Message-----
> >>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> >>> Sent: Friday, July 31, 2009 7:18 PM
> >>> To: zookeeper-user@hadoop.apache.org
> >>> Subject: Re: Unending Leader Elections in WAN deploy
> >>>
> >>> You're missing 491 from your set of patches.
> >>>
> >>> -Flavio
> >>>
> >>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >>>
> >>>> This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
> >>>> 481).
> >>>>
> >>>> Basically, it seems like the nodes are electing pd4-zook02 to be
> > the
> >>>> leader. However, pd4-zook02 seems to realize it's not supposed to
> > be
> >>>> and
> >>>> then disconnects everyone. Then they re-elect it again, and it
> > loops
> >>>> over and over.
> >>>>
> >>>> -------------
> >>>> Server config
> >>>> -------------
> >>>>
> >>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> >>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> >>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> >>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> >>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> >>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> >>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> >>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> >>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >>>>
> >>>> group.1:1:2:3:4:5
> >>>> weight.1=1
> >>>> weight.2=1
> >>>> weight.3=1
> >>>> weight.4=1
> >>>> weight.5=1
> >>>>
> >>>> group.2:6:7:8:9
> >>>> weight.6=0
> >>>> weight.7=0
> >>>> weight.8=0
> >>>> weight.9=0
> >>>>
> >>>> Note that we have 2 groups, composed of machines in 3 different
> >>>> locations (dc1, pd1, and pd4). The idea is that only machines in
> > dc1
> >>>> have voting rights, and the ability to become a leader. The
> > machines
> >>>> in
> >>>> the pods all have a weight of zero, and are not expected to
become
> >>>> leaders, or to vote on transactions.
> >>>>
> >>>> Let me know what I can do to help resolve this issue.
> >>>>
> >>>> -Todd
> >


Re: Unending Leader Elections in WAN deploy

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
It should be in 479. Perhaps you have a stale version of the patch.

-Flavio

On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote:

> Flavio,
>
> I'm getting a compilation error for patch 491:
>
> compile-main:
>    [javac] Compiling 1 source file to
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ 
> src/p
> atched/branch-3.2/build/classes
>    [javac]
> /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ 
> src/p
> atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/ 
> FastL
> eaderElection.java:601: cannot find symbol
>    [javac] symbol  : method getWeight(long)
>    [javac] location: interface
> org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
>    [javac]
> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
>    [javac]                                                    ^
>    [javac] 1 error
>
> I see a reference to getWeight in both FastLeaderElection.java in  
> patch
> 491:
>
> patches/ZOOKEEPER-491.patch:+
> if(self.getQuorumVerifier().getWeight(n.sid) != 0)
> src/java/main/org/apache/zookeeper/server/quorum/ 
> FastLeaderElection.java
> :                         
> if(self.getQuorumVerifier().getWeight(n.sid) !=
> 0)
>
> However, I don't see a reference to this method in patches 473, 479,  
> or
> 481. I also don't see a reference to this method in the trunk...
>
> -Todd
>
>> -----Original Message-----
>> From: Todd Greenwood [mailto:toddg@audiencescience.com]
>> Sent: Friday, July 31, 2009 7:30 PM
>> To: zookeeper-user@hadoop.apache.org
>> Subject: RE: Unending Leader Elections in WAN deploy
>>
>> Ok, I'll apply that patch and report back.
>> -Todd
>>
>>> -----Original Message-----
>>> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
>>> Sent: Friday, July 31, 2009 7:18 PM
>>> To: zookeeper-user@hadoop.apache.org
>>> Subject: Re: Unending Leader Elections in WAN deploy
>>>
>>> You're missing 491 from your set of patches.
>>>
>>> -Flavio
>>>
>>> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
>>>
>>>> This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
>>>> 481).
>>>>
>>>> Basically, it seems like the nodes are electing pd4-zook02 to be
> the
>>>> leader. However, pd4-zook02 seems to realize it's not supposed to
> be
>>>> and
>>>> then disconnects everyone. Then they re-elect it again, and it
> loops
>>>> over and over.
>>>>
>>>> -------------
>>>> Server config
>>>> -------------
>>>>
>>>> server.1=dc1-zook01.dc01.revsci.net:2888:3888
>>>> server.2=dc1-zook02.dc01.revsci.net:2888:3888
>>>> server.3=dc1-zook03.dc01.revsci.net:2888:3888
>>>> server.4=dc1-zook04.dc01.revsci.net:2888:3888
>>>> server.5=dc1-zook05.dc01.revsci.net:2888:3888
>>>> server.6=pd1-zook01.pd01.revsci.net:2888:3888
>>>> server.7=pd1-zook02.pd01.revsci.net:2888:3888
>>>> server.8=pd4-zook01.iad1.audsci.net:2888:3888
>>>> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>>>>
>>>> group.1:1:2:3:4:5
>>>> weight.1=1
>>>> weight.2=1
>>>> weight.3=1
>>>> weight.4=1
>>>> weight.5=1
>>>>
>>>> group.2:6:7:8:9
>>>> weight.6=0
>>>> weight.7=0
>>>> weight.8=0
>>>> weight.9=0
>>>>
>>>> Note that we have 2 groups, composed of machines in 3 different
>>>> locations (dc1, pd1, and pd4). The idea is that only machines in
> dc1
>>>> have voting rights, and the ability to become a leader. The
> machines
>>>> in
>>>> the pods all have a weight of zero, and are not expected to become
>>>> leaders, or to vote on transactions.
>>>>
>>>> Let me know what I can do to help resolve this issue.
>>>>
>>>> -Todd
>


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Flavio,

I'm getting a compilation error for patch 491:

compile-main:
    [javac] Compiling 1 source file to
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/build/classes
    [javac]
/home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p
atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/FastL
eaderElection.java:601: cannot find symbol
    [javac] symbol  : method getWeight(long)
    [javac] location: interface
org.apache.zookeeper.server.quorum.flexible.QuorumVerifier
    [javac]
if(self.getQuorumVerifier().getWeight(n.sid) != 0) 
    [javac]                                                    ^
    [javac] 1 error

I see a reference to getWeight in both FastLeaderElection.java in patch
491:

patches/ZOOKEEPER-491.patch:+
if(self.getQuorumVerifier().getWeight(n.sid) != 0) 
src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java
:                        if(self.getQuorumVerifier().getWeight(n.sid) !=
0)

However, I don't see a reference to this method in patches 473, 479, or
481. I also don't see a reference to this method in the trunk...

-Todd

> -----Original Message-----
> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> Sent: Friday, July 31, 2009 7:30 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: RE: Unending Leader Elections in WAN deploy
> 
> Ok, I'll apply that patch and report back.
> -Todd
> 
> > -----Original Message-----
> > From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> > Sent: Friday, July 31, 2009 7:18 PM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: Unending Leader Elections in WAN deploy
> >
> > You're missing 491 from your set of patches.
> >
> > -Flavio
> >
> > On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> >
> > > This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
> > > 481).
> > >
> > > Basically, it seems like the nodes are electing pd4-zook02 to be
the
> > > leader. However, pd4-zook02 seems to realize it's not supposed to
be
> > > and
> > > then disconnects everyone. Then they re-elect it again, and it
loops
> > > over and over.
> > >
> > > -------------
> > > Server config
> > > -------------
> > >
> > > server.1=dc1-zook01.dc01.revsci.net:2888:3888
> > > server.2=dc1-zook02.dc01.revsci.net:2888:3888
> > > server.3=dc1-zook03.dc01.revsci.net:2888:3888
> > > server.4=dc1-zook04.dc01.revsci.net:2888:3888
> > > server.5=dc1-zook05.dc01.revsci.net:2888:3888
> > > server.6=pd1-zook01.pd01.revsci.net:2888:3888
> > > server.7=pd1-zook02.pd01.revsci.net:2888:3888
> > > server.8=pd4-zook01.iad1.audsci.net:2888:3888
> > > server.9=pd4-zook02.iad1.audsci.net:2888:3888
> > >
> > > group.1:1:2:3:4:5
> > > weight.1=1
> > > weight.2=1
> > > weight.3=1
> > > weight.4=1
> > > weight.5=1
> > >
> > > group.2:6:7:8:9
> > > weight.6=0
> > > weight.7=0
> > > weight.8=0
> > > weight.9=0
> > >
> > > Note that we have 2 groups, composed of machines in 3 different
> > > locations (dc1, pd1, and pd4). The idea is that only machines in
dc1
> > > have voting rights, and the ability to become a leader. The
machines
> > > in
> > > the pods all have a weight of zero, and are not expected to become
> > > leaders, or to vote on transactions.
> > >
> > > Let me know what I can do to help resolve this issue.
> > >
> > > -Todd


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Ok, I'll apply that patch and report back.
-Todd

> -----Original Message-----
> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> Sent: Friday, July 31, 2009 7:18 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Unending Leader Elections in WAN deploy
> 
> You're missing 491 from your set of patches.
> 
> -Flavio
> 
> On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:
> 
> > This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
> > 481).
> >
> > Basically, it seems like the nodes are electing pd4-zook02 to be the
> > leader. However, pd4-zook02 seems to realize it's not supposed to be
> > and
> > then disconnects everyone. Then they re-elect it again, and it loops
> > over and over.
> >
> > -------------
> > Server config
> > -------------
> >
> > server.1=dc1-zook01.dc01.revsci.net:2888:3888
> > server.2=dc1-zook02.dc01.revsci.net:2888:3888
> > server.3=dc1-zook03.dc01.revsci.net:2888:3888
> > server.4=dc1-zook04.dc01.revsci.net:2888:3888
> > server.5=dc1-zook05.dc01.revsci.net:2888:3888
> > server.6=pd1-zook01.pd01.revsci.net:2888:3888
> > server.7=pd1-zook02.pd01.revsci.net:2888:3888
> > server.8=pd4-zook01.iad1.audsci.net:2888:3888
> > server.9=pd4-zook02.iad1.audsci.net:2888:3888
> >
> > group.1:1:2:3:4:5
> > weight.1=1
> > weight.2=1
> > weight.3=1
> > weight.4=1
> > weight.5=1
> >
> > group.2:6:7:8:9
> > weight.6=0
> > weight.7=0
> > weight.8=0
> > weight.9=0
> >
> > Note that we have 2 groups, composed of machines in 3 different
> > locations (dc1, pd1, and pd4). The idea is that only machines in dc1
> > have voting rights, and the ability to become a leader. The machines
> > in
> > the pods all have a weight of zero, and are not expected to become
> > leaders, or to vote on transactions.
> >
> > Let me know what I can do to help resolve this issue.
> >
> > -Todd


Re: Unending Leader Elections in WAN deploy

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
You're missing 491 from your set of patches.

-Flavio

On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote:

> This repro's in both branch-3.2, and branch-3.2+patches(473, 479,  
> 481).
>
> Basically, it seems like the nodes are electing pd4-zook02 to be the
> leader. However, pd4-zook02 seems to realize it's not supposed to be  
> and
> then disconnects everyone. Then they re-elect it again, and it loops
> over and over.
>
> -------------
> Server config
> -------------
>
> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> server.9=pd4-zook02.iad1.audsci.net:2888:3888
>
> group.1:1:2:3:4:5
> weight.1=1
> weight.2=1
> weight.3=1
> weight.4=1
> weight.5=1
>
> group.2:6:7:8:9
> weight.6=0
> weight.7=0
> weight.8=0
> weight.9=0
>
> Note that we have 2 groups, composed of machines in 3 different
> locations (dc1, pd1, and pd4). The idea is that only machines in dc1
> have voting rights, and the ability to become a leader. The machines  
> in
> the pods all have a weight of zero, and are not expected to become
> leaders, or to vote on transactions.
>
> Let me know what I can do to help resolve this issue.
>
> -Todd


RE: Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
Some how the logs did not attach. Zookeeper logs should be attached.

> -----Original Message-----
> From: Todd Greenwood [mailto:toddg@audiencescience.com]
> Sent: Friday, July 31, 2009 7:15 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Unending Leader Elections in WAN deploy
> 
> This repro's in both branch-3.2, and branch-3.2+patches(473, 479,
481).
> 
> Basically, it seems like the nodes are electing pd4-zook02 to be the
> leader. However, pd4-zook02 seems to realize it's not supposed to be
and
> then disconnects everyone. Then they re-elect it again, and it loops
> over and over.
> 
> -------------
> Server config
> -------------
> 
> server.1=dc1-zook01.dc01.revsci.net:2888:3888
> server.2=dc1-zook02.dc01.revsci.net:2888:3888
> server.3=dc1-zook03.dc01.revsci.net:2888:3888
> server.4=dc1-zook04.dc01.revsci.net:2888:3888
> server.5=dc1-zook05.dc01.revsci.net:2888:3888
> server.6=pd1-zook01.pd01.revsci.net:2888:3888
> server.7=pd1-zook02.pd01.revsci.net:2888:3888
> server.8=pd4-zook01.iad1.audsci.net:2888:3888
> server.9=pd4-zook02.iad1.audsci.net:2888:3888
> 
> group.1:1:2:3:4:5
> weight.1=1
> weight.2=1
> weight.3=1
> weight.4=1
> weight.5=1
> 
> group.2:6:7:8:9
> weight.6=0
> weight.7=0
> weight.8=0
> weight.9=0
> 
> Note that we have 2 groups, composed of machines in 3 different
> locations (dc1, pd1, and pd4). The idea is that only machines in dc1
> have voting rights, and the ability to become a leader. The machines
in
> the pods all have a weight of zero, and are not expected to become
> leaders, or to vote on transactions.
> 
> Let me know what I can do to help resolve this issue.
> 
> -Todd

Unending Leader Elections in WAN deploy

Posted by Todd Greenwood <to...@audiencescience.com>.
This repro's in both branch-3.2, and branch-3.2+patches(473, 479, 481). 

Basically, it seems like the nodes are electing pd4-zook02 to be the
leader. However, pd4-zook02 seems to realize it's not supposed to be and
then disconnects everyone. Then they re-elect it again, and it loops
over and over.

-------------
Server config
-------------

server.1=dc1-zook01.dc01.revsci.net:2888:3888
server.2=dc1-zook02.dc01.revsci.net:2888:3888
server.3=dc1-zook03.dc01.revsci.net:2888:3888
server.4=dc1-zook04.dc01.revsci.net:2888:3888
server.5=dc1-zook05.dc01.revsci.net:2888:3888
server.6=pd1-zook01.pd01.revsci.net:2888:3888
server.7=pd1-zook02.pd01.revsci.net:2888:3888
server.8=pd4-zook01.iad1.audsci.net:2888:3888
server.9=pd4-zook02.iad1.audsci.net:2888:3888

group.1:1:2:3:4:5               
weight.1=1
weight.2=1
weight.3=1
weight.4=1
weight.5=1

group.2:6:7:8:9
weight.6=0
weight.7=0
weight.8=0
weight.9=0

Note that we have 2 groups, composed of machines in 3 different
locations (dc1, pd1, and pd4). The idea is that only machines in dc1
have voting rights, and the ability to become a leader. The machines in
the pods all have a weight of zero, and are not expected to become
leaders, or to vote on transactions.

Let me know what I can do to help resolve this issue.

-Todd

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
Todd Greenwood wrote:
> On a plus note, I'm finding that this morning, @work rather than @home,
> the tests continue to completion. However, there are other issues that
> I'll bring up on the dev list, such as a requirement to have autoconf
> installed, and problems in the create-cppunit-configure task that can't
> exec libtoolize, fun stuff like tha.

Great, good to hear. At some point figuring out what's up with your 
@home would be interesting to us. :-)

Yes, there are some basic requirements such as autotool, cppunit, etc... 
but please do raise all this on the dev list.

> I need to proceed with the manual patches to branch-3.2, as I am under
> some time constraints to get our infrastructure deployed such that QA
> can start playing with it. However, I'll switch to 3.2.1 as soon as I
> can.

Understood.

Patrick

>> -----Original Message-----
>> From: Patrick Hunt [mailto:phunt@apache.org]
>> Sent: Friday, July 31, 2009 11:38 AM
>> To: zookeeper-user@hadoop.apache.org; Todd Greenwood
>> Subject: Re: test failures in branch-3.2
>>
>> Hi Todd,
>>
>> Sorry for the clutter/confusion. Usually things aren't this cumbersome
> ;-)
>> In particular:
>>    1 committer is on vacation
>>    Mahadev's been out sick for multiple days
>>    I'm sick but trying to hang in there, but def not 100%
>>
>> Hudson (CI) has been offline for effectively the past 3 weeks (that
>> gates all our commits) and is just now back but flaky.
>>
>> 3.2 had some bugs that we are trying to address, but the afore
> mentioned
>> issues are slowing us down. Otw we'd have all this straightened out by
>> now ....
>>
>> At this point you should move this discussion to the dev list - Apache
>> doesn't really like us to discuss code changes/futures here (user
> list).
>> On that list you'll also see the plan for upcoming releases - I
> mention
>> all this because we are actively working toward 3.2.1 which will
> include
>> the JIRAs slated for that release (I'm sure you've seen).
>>
>> If you can wait a bit you might be able to avoid some pain by using
> the
>> upcoming 3.2.1 release. Once the patches land into that branch your
>> issues will be resolved w/o you needing to manually apply patches,
> etc...
>>
>> I did look at the files you attached - it looks fine so I'm not sure
> the
>> issue. The form of this test makes it harder - we are verifying that
> the
>> log contains sufficient information when a particular error occurs. We
>> fiddle with log4j in order to do this, which means that the log you
> are
>> including doesn't specify the problem.
>>
>> Try instrumenting this test with a try/catch around the content of the
>> test method (all the code in the failing method inside a big try/catch
>> is what I mean). Then print the error to std out as part of the catch.
>> That should shed some light. If you could debug it a bit that would
> help
>> - because we aren't seeing this in our environment.
>>
>> Again, sort of a moot point if you can wait a week or so...
>>
>> Regards,
>>
>> Patrick
>>
>> Todd Greenwood wrote:
>>> Inline.
>>>
>>>> -----Original Message-----
>>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>>> Sent: Thursday, July 30, 2009 10:57 PM
>>>> To: zookeeper-user@hadoop.apache.org
>>>> Subject: Re: test failures in branch-3.2
>>>>
>>>> Todd Greenwood wrote:
>>>>> Starting w/ branch-3.2 (no changes) I applied patches in this
> order:
>>>>> 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest
>>> fails.
>>>>> 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file
> -
>>>>> PortAssignment.java.
>>>>>
>>>>> PortAssignment.java was added by Patrick as part of
>>> ZOOKEEPER-473.patch,
>>>>> which is a pretty hefty patch (> 2k lines) and touches a large
>>> number of
>>>>> files.
>>>> Hrm, those patches were probably created against the trunk. We'll
> have
>>>> to have separate patches for trunk and 3.2 branch on 481.
>>>>
>>>> If you could update the jira with this detail (481 needs two
> patches,
>>>> one for each branch) that would be great!
>>>>
>>> Done.
>>>
>>>>> 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails
>>> (jvm
>>>>> crashes).
>>>> 473 is "special" (unique) in the sense that it changes log4j while
> the
>>>> the vm is running. In general though it's a pretty boring test and
>>>> shouldn't be failing.
>>>>
>>>> Are you sure you have the right patch file? there are 2 patch files
> on
>>>> the JIRA for 473, make sure that you have the one from 7/16, NOT
> the
>>> one
>>>> from 7/15. Check that the patch file, the correct one should NOT
>>> contain
>>>> changes to build.xml or conf/log4j* files. If this still happens
> send
>>> me
>>>> your build.xml, conf/log4j* and QuroumPeerMainTest.java files in
> email
>>>> for review. I'll take a look.
>>>>
>>>
>>> I've annotated the files w/ their date while downloading:
>>> 112700 2009-07-31 11:02 ZOOKEEPER-473-7-15.patch
>>> 110607 2009-07-31 11:01 ZOOKEEPER-473-7-16.patch
>>>
>>> It appears I applied the 7-16 patch, as that is the matching file
> size
>>> of the patch file I applied.
>>>
>>> If there are to be multiple patch files for multiple branches (3.2,
>>> trunk, etc.) would it make sense to lable the patch files
> accordingly?
>>> Requested files in attached tar.
>>>
>>> -Todd
>>>
>>>> Patrick
>>>>
>>>>
>>>>> [junit] Running
>>> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>>>     [junit] Running
>>>>> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>>>     [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0
>>> sec
>>>>>     [junit] Test
>>> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>>> FAILED (crashed)
>>>>>
>>>>> ------------
>>>>> Test Log
>>>>> ------------
>>>>> Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>>> Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
>>>>>
>>>>> Testcase: testBadPeerAddressInQuorum took 0.004 sec
>>>>>     Caused an ERROR
>>>>> Forked Java VM exited abnormally. Please note the time in the
> report
>>>>> does not reflect the time until the VM exit.
>>>>> junit.framework.AssertionFailedError: Forked Java VM exited
>>> abnormally.
>>>>> Please note the time in the report does not reflect the time until
>>> the
>>>>> VM exit.
>>>>>
>>>>> -Todd
>>>>>
>>>>> -----Original Message-----
>>>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>>>> Sent: Thursday, July 30, 2009 10:13 PM
>>>>> To: zookeeper-user@hadoop.apache.org
>>>>> Subject: Re: test failures in branch-3.2
>>>>>
>>>>> Todd Greenwood wrote:
>>>>>> ....
>>>>>> [Todd] Yes, I believe "address in use" was the problem w/
> FLETest.
>>> I
>>>>>> assumed it was a timing issue w/ respect to test A not fully
>>> releasing
>>>>>> resources before test B started.
>>>>> Might be, but actually I think it's related to this:
>>>>> http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
>>>>>
>>>>> Patrick

RE: test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
Patrick,
Thank you for the background (and I hope you and Mahadev recover
quickly).

On a plus note, I'm finding that this morning, @work rather than @home,
the tests continue to completion. However, there are other issues that
I'll bring up on the dev list, such as a requirement to have autoconf
installed, and problems in the create-cppunit-configure task that can't
exec libtoolize, fun stuff like tha.

I need to proceed with the manual patches to branch-3.2, as I am under
some time constraints to get our infrastructure deployed such that QA
can start playing with it. However, I'll switch to 3.2.1 as soon as I
can.

-Todd

> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org]
> Sent: Friday, July 31, 2009 11:38 AM
> To: zookeeper-user@hadoop.apache.org; Todd Greenwood
> Subject: Re: test failures in branch-3.2
> 
> Hi Todd,
> 
> Sorry for the clutter/confusion. Usually things aren't this cumbersome
;-)
> 
> In particular:
>    1 committer is on vacation
>    Mahadev's been out sick for multiple days
>    I'm sick but trying to hang in there, but def not 100%
> 
> Hudson (CI) has been offline for effectively the past 3 weeks (that
> gates all our commits) and is just now back but flaky.
> 
> 3.2 had some bugs that we are trying to address, but the afore
mentioned
> issues are slowing us down. Otw we'd have all this straightened out by
> now ....
> 
> At this point you should move this discussion to the dev list - Apache
> doesn't really like us to discuss code changes/futures here (user
list).
> On that list you'll also see the plan for upcoming releases - I
mention
> all this because we are actively working toward 3.2.1 which will
include
> the JIRAs slated for that release (I'm sure you've seen).
> 
> If you can wait a bit you might be able to avoid some pain by using
the
> upcoming 3.2.1 release. Once the patches land into that branch your
> issues will be resolved w/o you needing to manually apply patches,
etc...
> 
> 
> I did look at the files you attached - it looks fine so I'm not sure
the
> issue. The form of this test makes it harder - we are verifying that
the
> log contains sufficient information when a particular error occurs. We
> fiddle with log4j in order to do this, which means that the log you
are
> including doesn't specify the problem.
> 
> Try instrumenting this test with a try/catch around the content of the
> test method (all the code in the failing method inside a big try/catch
> is what I mean). Then print the error to std out as part of the catch.
> That should shed some light. If you could debug it a bit that would
help
> - because we aren't seeing this in our environment.
> 
> Again, sort of a moot point if you can wait a week or so...
> 
> Regards,
> 
> Patrick
> 
> Todd Greenwood wrote:
> > Inline.
> >
> >> -----Original Message-----
> >> From: Patrick Hunt [mailto:phunt@apache.org]
> >> Sent: Thursday, July 30, 2009 10:57 PM
> >> To: zookeeper-user@hadoop.apache.org
> >> Subject: Re: test failures in branch-3.2
> >>
> >> Todd Greenwood wrote:
> >>> Starting w/ branch-3.2 (no changes) I applied patches in this
order:
> >>>
> >>> 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest
> > fails.
> >>> 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file
-
> >>> PortAssignment.java.
> >>>
> >>> PortAssignment.java was added by Patrick as part of
> > ZOOKEEPER-473.patch,
> >>> which is a pretty hefty patch (> 2k lines) and touches a large
> > number of
> >>> files.
> >> Hrm, those patches were probably created against the trunk. We'll
have
> >> to have separate patches for trunk and 3.2 branch on 481.
> >>
> >> If you could update the jira with this detail (481 needs two
patches,
> >> one for each branch) that would be great!
> >>
> >
> > Done.
> >
> >>> 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails
> > (jvm
> >>> crashes).
> >> 473 is "special" (unique) in the sense that it changes log4j while
the
> >> the vm is running. In general though it's a pretty boring test and
> >> shouldn't be failing.
> >>
> >> Are you sure you have the right patch file? there are 2 patch files
on
> >> the JIRA for 473, make sure that you have the one from 7/16, NOT
the
> > one
> >> from 7/15. Check that the patch file, the correct one should NOT
> > contain
> >> changes to build.xml or conf/log4j* files. If this still happens
send
> > me
> >> your build.xml, conf/log4j* and QuroumPeerMainTest.java files in
email
> >> for review. I'll take a look.
> >>
> >
> >
> > I've annotated the files w/ their date while downloading:
> > 112700 2009-07-31 11:02 ZOOKEEPER-473-7-15.patch
> > 110607 2009-07-31 11:01 ZOOKEEPER-473-7-16.patch
> >
> > It appears I applied the 7-16 patch, as that is the matching file
size
> > of the patch file I applied.
> >
> > If there are to be multiple patch files for multiple branches (3.2,
> > trunk, etc.) would it make sense to lable the patch files
accordingly?
> >
> > Requested files in attached tar.
> >
> > -Todd
> >
> >> Patrick
> >>
> >>
> >>> [junit] Running
> > org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >>>     [junit] Running
> >>> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >>>     [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0
> > sec
> >>>     [junit] Test
> > org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >>> FAILED (crashed)
> >>>
> >>> ------------
> >>> Test Log
> >>> ------------
> >>> Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >>> Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
> >>>
> >>> Testcase: testBadPeerAddressInQuorum took 0.004 sec
> >>>     Caused an ERROR
> >>> Forked Java VM exited abnormally. Please note the time in the
report
> >>> does not reflect the time until the VM exit.
> >>> junit.framework.AssertionFailedError: Forked Java VM exited
> > abnormally.
> >>> Please note the time in the report does not reflect the time until
> > the
> >>> VM exit.
> >>>
> >>> -Todd
> >>>
> >>> -----Original Message-----
> >>> From: Patrick Hunt [mailto:phunt@apache.org]
> >>> Sent: Thursday, July 30, 2009 10:13 PM
> >>> To: zookeeper-user@hadoop.apache.org
> >>> Subject: Re: test failures in branch-3.2
> >>>
> >>> Todd Greenwood wrote:
> >>>> ....
> >>>> [Todd] Yes, I believe "address in use" was the problem w/
FLETest.
> > I
> >>>> assumed it was a timing issue w/ respect to test A not fully
> > releasing
> >>>> resources before test B started.
> >>> Might be, but actually I think it's related to this:
> >>> http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
> >>>
> >>> Patrick

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
Hi Todd,

Sorry for the clutter/confusion. Usually things aren't this cumbersome ;-)

In particular:
   1 committer is on vacation
   Mahadev's been out sick for multiple days
   I'm sick but trying to hang in there, but def not 100%

Hudson (CI) has been offline for effectively the past 3 weeks (that 
gates all our commits) and is just now back but flaky.

3.2 had some bugs that we are trying to address, but the afore mentioned 
issues are slowing us down. Otw we'd have all this straightened out by 
now ....

At this point you should move this discussion to the dev list - Apache 
doesn't really like us to discuss code changes/futures here (user list). 
On that list you'll also see the plan for upcoming releases - I mention 
all this because we are actively working toward 3.2.1 which will include 
the JIRAs slated for that release (I'm sure you've seen).

If you can wait a bit you might be able to avoid some pain by using the 
upcoming 3.2.1 release. Once the patches land into that branch your 
issues will be resolved w/o you needing to manually apply patches, etc...


I did look at the files you attached - it looks fine so I'm not sure the 
issue. The form of this test makes it harder - we are verifying that the 
log contains sufficient information when a particular error occurs. We 
fiddle with log4j in order to do this, which means that the log you are 
including doesn't specify the problem.

Try instrumenting this test with a try/catch around the content of the 
test method (all the code in the failing method inside a big try/catch 
is what I mean). Then print the error to std out as part of the catch. 
That should shed some light. If you could debug it a bit that would help 
- because we aren't seeing this in our environment.

Again, sort of a moot point if you can wait a week or so...

Regards,

Patrick

Todd Greenwood wrote:
> Inline.
> 
>> -----Original Message-----
>> From: Patrick Hunt [mailto:phunt@apache.org]
>> Sent: Thursday, July 30, 2009 10:57 PM
>> To: zookeeper-user@hadoop.apache.org
>> Subject: Re: test failures in branch-3.2
>>
>> Todd Greenwood wrote:
>>> Starting w/ branch-3.2 (no changes) I applied patches in this order:
>>>
>>> 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest
> fails.
>>> 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file -
>>> PortAssignment.java.
>>>
>>> PortAssignment.java was added by Patrick as part of
> ZOOKEEPER-473.patch,
>>> which is a pretty hefty patch (> 2k lines) and touches a large
> number of
>>> files.
>> Hrm, those patches were probably created against the trunk. We'll have
>> to have separate patches for trunk and 3.2 branch on 481.
>>
>> If you could update the jira with this detail (481 needs two patches,
>> one for each branch) that would be great!
>>
> 
> Done.
> 
>>> 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails
> (jvm
>>> crashes).
>> 473 is "special" (unique) in the sense that it changes log4j while the
>> the vm is running. In general though it's a pretty boring test and
>> shouldn't be failing.
>>
>> Are you sure you have the right patch file? there are 2 patch files on
>> the JIRA for 473, make sure that you have the one from 7/16, NOT the
> one
>> from 7/15. Check that the patch file, the correct one should NOT
> contain
>> changes to build.xml or conf/log4j* files. If this still happens send
> me
>> your build.xml, conf/log4j* and QuroumPeerMainTest.java files in email
>> for review. I'll take a look.
>>
> 
> 
> I've annotated the files w/ their date while downloading:
> 112700 2009-07-31 11:02 ZOOKEEPER-473-7-15.patch
> 110607 2009-07-31 11:01 ZOOKEEPER-473-7-16.patch
> 
> It appears I applied the 7-16 patch, as that is the matching file size
> of the patch file I applied.
> 
> If there are to be multiple patch files for multiple branches (3.2,
> trunk, etc.) would it make sense to lable the patch files accordingly?
> 
> Requested files in attached tar.
> 
> -Todd
> 
>> Patrick
>>
>>
>>> [junit] Running
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>     [junit] Running
>>> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>>     [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0
> sec
>>>     [junit] Test
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>> FAILED (crashed)
>>>
>>> ------------
>>> Test Log
>>> ------------
>>> Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>>> Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
>>>
>>> Testcase: testBadPeerAddressInQuorum took 0.004 sec
>>>     Caused an ERROR
>>> Forked Java VM exited abnormally. Please note the time in the report
>>> does not reflect the time until the VM exit.
>>> junit.framework.AssertionFailedError: Forked Java VM exited
> abnormally.
>>> Please note the time in the report does not reflect the time until
> the
>>> VM exit.
>>>
>>> -Todd
>>>
>>> -----Original Message-----
>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>> Sent: Thursday, July 30, 2009 10:13 PM
>>> To: zookeeper-user@hadoop.apache.org
>>> Subject: Re: test failures in branch-3.2
>>>
>>> Todd Greenwood wrote:
>>>> ....
>>>> [Todd] Yes, I believe "address in use" was the problem w/ FLETest.
> I
>>>> assumed it was a timing issue w/ respect to test A not fully
> releasing
>>>> resources before test B started.
>>> Might be, but actually I think it's related to this:
>>> http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
>>>
>>> Patrick

RE: test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
Inline.

> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org]
> Sent: Thursday, July 30, 2009 10:57 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: test failures in branch-3.2
> 
> Todd Greenwood wrote:
> > Starting w/ branch-3.2 (no changes) I applied patches in this order:
> >
> > 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest
fails.
> > 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file -
> > PortAssignment.java.
> >
> > PortAssignment.java was added by Patrick as part of
ZOOKEEPER-473.patch,
> > which is a pretty hefty patch (> 2k lines) and touches a large
number of
> > files.
> 
> Hrm, those patches were probably created against the trunk. We'll have
> to have separate patches for trunk and 3.2 branch on 481.
> 
> If you could update the jira with this detail (481 needs two patches,
> one for each branch) that would be great!
> 

Done.

> > 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails
(jvm
> > crashes).
> 
> 473 is "special" (unique) in the sense that it changes log4j while the
> the vm is running. In general though it's a pretty boring test and
> shouldn't be failing.
> 
> Are you sure you have the right patch file? there are 2 patch files on
> the JIRA for 473, make sure that you have the one from 7/16, NOT the
one
> from 7/15. Check that the patch file, the correct one should NOT
contain
> changes to build.xml or conf/log4j* files. If this still happens send
me
> your build.xml, conf/log4j* and QuroumPeerMainTest.java files in email
> for review. I'll take a look.
> 


I've annotated the files w/ their date while downloading:
112700 2009-07-31 11:02 ZOOKEEPER-473-7-15.patch
110607 2009-07-31 11:01 ZOOKEEPER-473-7-16.patch

It appears I applied the 7-16 patch, as that is the matching file size
of the patch file I applied.

If there are to be multiple patch files for multiple branches (3.2,
trunk, etc.) would it make sense to lable the patch files accordingly?

Requested files in attached tar.

-Todd

> Patrick
> 
> 
> > [junit] Running
org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >     [junit] Running
> > org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> >     [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0
sec
> >     [junit] Test
org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> > FAILED (crashed)
> >
> > ------------
> > Test Log
> > ------------
> > Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> > Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
> >
> > Testcase: testBadPeerAddressInQuorum took 0.004 sec
> >     Caused an ERROR
> > Forked Java VM exited abnormally. Please note the time in the report
> > does not reflect the time until the VM exit.
> > junit.framework.AssertionFailedError: Forked Java VM exited
abnormally.
> > Please note the time in the report does not reflect the time until
the
> > VM exit.
> >
> > -Todd
> >
> > -----Original Message-----
> > From: Patrick Hunt [mailto:phunt@apache.org]
> > Sent: Thursday, July 30, 2009 10:13 PM
> > To: zookeeper-user@hadoop.apache.org
> > Subject: Re: test failures in branch-3.2
> >
> > Todd Greenwood wrote:
> >> ....
> >> [Todd] Yes, I believe "address in use" was the problem w/ FLETest.
I
> >> assumed it was a timing issue w/ respect to test A not fully
releasing
> >> resources before test B started.
> >
> > Might be, but actually I think it's related to this:
> > http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
> >
> > Patrick

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
Todd Greenwood wrote:
> Starting w/ branch-3.2 (no changes) I applied patches in this order:
> 
> 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest fails.
> 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file -
> PortAssignment.java.
> 
> PortAssignment.java was added by Patrick as part of ZOOKEEPER-473.patch,
> which is a pretty hefty patch (> 2k lines) and touches a large number of
> files. 

Hrm, those patches were probably created against the trunk. We'll have 
to have separate patches for trunk and 3.2 branch on 481.

If you could update the jira with this detail (481 needs two patches, 
one for each branch) that would be great!

> 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails (jvm
> crashes).

473 is "special" (unique) in the sense that it changes log4j while the 
the vm is running. In general though it's a pretty boring test and 
shouldn't be failing.

Are you sure you have the right patch file? there are 2 patch files on 
the JIRA for 473, make sure that you have the one from 7/16, NOT the one 
from 7/15. Check that the patch file, the correct one should NOT contain 
changes to build.xml or conf/log4j* files. If this still happens send me 
your build.xml, conf/log4j* and QuroumPeerMainTest.java files in email 
for review. I'll take a look.

Patrick


> [junit] Running org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>     [junit] Running
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest
>     [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
>     [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> FAILED (crashed)
> 
> ------------
> Test Log
> ------------
> Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec 
> 
> Testcase: testBadPeerAddressInQuorum took 0.004 sec 
>     Caused an ERROR
> Forked Java VM exited abnormally. Please note the time in the report
> does not reflect the time until the VM exit.
> junit.framework.AssertionFailedError: Forked Java VM exited abnormally.
> Please note the time in the report does not reflect the time until the
> VM exit.
> 
> -Todd
> 
> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org] 
> Sent: Thursday, July 30, 2009 10:13 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: test failures in branch-3.2
> 
> Todd Greenwood wrote:
>> ....
>> [Todd] Yes, I believe "address in use" was the problem w/ FLETest. I
>> assumed it was a timing issue w/ respect to test A not fully releasing
>> resources before test B started.
> 
> Might be, but actually I think it's related to this:
> http://hea-www.harvard.edu/~fine/Tech/addrinuse.html
> 
> Patrick

RE: test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
Patrick/Flavio -

Starting w/ branch-3.2 (no changes) I applied patches in this order:

1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest fails.
2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file -
PortAssignment.java.

PortAssignment.java was added by Patrick as part of ZOOKEEPER-473.patch,
which is a pretty hefty patch (> 2k lines) and touches a large number of
files. 

3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails (jvm
crashes).

[junit] Running org.apache.zookeeper.server.quorum.QuorumPeerMainTest
    [junit] Running
org.apache.zookeeper.server.quorum.QuorumPeerMainTest
    [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
    [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
FAILED (crashed)

------------
Test Log
------------
Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest
Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec 

Testcase: testBadPeerAddressInQuorum took 0.004 sec 
    Caused an ERROR
Forked Java VM exited abnormally. Please note the time in the report
does not reflect the time until the VM exit.
junit.framework.AssertionFailedError: Forked Java VM exited abnormally.
Please note the time in the report does not reflect the time until the
VM exit.

-Todd

-----Original Message-----
From: Patrick Hunt [mailto:phunt@apache.org] 
Sent: Thursday, July 30, 2009 10:13 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: test failures in branch-3.2

Todd Greenwood wrote:
> ....
> [Todd] Yes, I believe "address in use" was the problem w/ FLETest. I
> assumed it was a timing issue w/ respect to test A not fully releasing
> resources before test B started.

Might be, but actually I think it's related to this:
http://hea-www.harvard.edu/~fine/Tech/addrinuse.html

Patrick

Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
Todd Greenwood wrote:
> ....
> [Todd] Yes, I believe "address in use" was the problem w/ FLETest. I
> assumed it was a timing issue w/ respect to test A not fully releasing
> resources before test B started.

Might be, but actually I think it's related to this:
http://hea-www.harvard.edu/~fine/Tech/addrinuse.html

Patrick

RE: test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
Patrick, inline.

-----Original Message-----
From: Patrick Hunt [mailto:phunt@apache.org] 
Sent: Thursday, July 30, 2009 9:13 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: test failures in branch-3.2

Todd Greenwood wrote:
> The build succeeds, but not the all of the tests. In previous test
runs,
> I noticed an error in org.apache.zookeeper.test.FLETest. It was not
able
> to bind to a port or something. Now, after a machine reboot, I'm
getting
> different failures. 

"address in use"? That's a problem in the test framework pre-3.3. In 3.3

(current svn trunk) I fixed it but it's not in 3.2.x. This is a problem 
with the test framework though and not a real problem, it shows up 
occasionally (depends on timing).

[Todd] Yes, I believe "address in use" was the problem w/ FLETest. I
assumed it was a timing issue w/ respect to test A not fully releasing
resources before test B started.

> branch-3.2 $ ant test
> 
> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> FAILED (crashed)
> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
> 
> Test logs for these two tests attached.

This is unusual though - looking at the log it seems that the JVM itself

crashed for the QPMainTest! for HQT we are seeing:

junit.framework.AssertionFailedError: Threads didn't join

which Flavio mentioned to me once is possible to happen but not a real 
problem (he can elaborate).

What version of java are you using? OS, other environment that might be 
interesting? (vm? etc...) You might try looking at the jvm crash dump 
file (I think it's in /tmp)

[Todd] ---------------------------
$ uname -a
Linux TODDG01LT 2.6.28-14-generic #47-Ubuntu SMP Sat Jul 25 01:19:55 UTC
2009 x86_64 GNU/Linux

$ which java
/home/toddg/bin/x64/java/jdk1.6.0_13/bin/java

$ java -version
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

Memory = 4GB
[Todd] ---------------------------

If you run each of these two tests individually do they run? example:
ant -Dtestcase=FLENewEpochTest test-core-java

[Todd] Will try this once my local build is working and report back.
I'll open a separate mail thread on applying patches.

> My goal here is to get to a known state (all tests succeeding or have
> workarounds for the failures). Following that, I plan to apply the
> patches Flavio recommended for a WAN deploy (479 and 481). After I
> verify that the tests continue to run, I'll package this up and deploy
> it to our WAN for testing. 

Sounds like a good plan.

> So, are these known issues? Do the tests normally run en masse, or do
> some of the tests hold on to resources and prevent other tests from
> passing?

Typically they do run to completion, but occasionally on my machine 
(java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
random failure due to address in use, or the same "didn't join" that you

saw. Usually I see this if I'm multitasking (vs just letting the tests 
run w/o using the box). As I said this is addressed in 3.3 (address 
reuse at the very least, and I haven't see the other issues).

Patrick



Re: test failures in branch-3.2

Posted by Patrick Hunt <ph...@apache.org>.
Todd Greenwood wrote:
> The build succeeds, but not the all of the tests. In previous test runs,
> I noticed an error in org.apache.zookeeper.test.FLETest. It was not able
> to bind to a port or something. Now, after a machine reboot, I'm getting
> different failures. 

"address in use"? That's a problem in the test framework pre-3.3. In 3.3 
(current svn trunk) I fixed it but it's not in 3.2.x. This is a problem 
with the test framework though and not a real problem, it shows up 
occasionally (depends on timing).

> branch-3.2 $ ant test
> 
> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> FAILED (crashed)
> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
> 
> Test logs for these two tests attached.

This is unusual though - looking at the log it seems that the JVM itself 
crashed for the QPMainTest! for HQT we are seeing:

junit.framework.AssertionFailedError: Threads didn't join

which Flavio mentioned to me once is possible to happen but not a real 
problem (he can elaborate).

What version of java are you using? OS, other environment that might be 
interesting? (vm? etc...) You might try looking at the jvm crash dump 
file (I think it's in /tmp)

If you run each of these two tests individually do they run? example:
ant -Dtestcase=FLENewEpochTest test-core-java

> My goal here is to get to a known state (all tests succeeding or have
> workarounds for the failures). Following that, I plan to apply the
> patches Flavio recommended for a WAN deploy (479 and 481). After I
> verify that the tests continue to run, I'll package this up and deploy
> it to our WAN for testing. 

Sounds like a good plan.

> So, are these known issues? Do the tests normally run en masse, or do
> some of the tests hold on to resources and prevent other tests from
> passing?

Typically they do run to completion, but occasionally on my machine 
(java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some 
random failure due to address in use, or the same "didn't join" that you 
saw. Usually I see this if I'm multitasking (vs just letting the tests 
run w/o using the box). As I said this is addressed in 3.3 (address 
reuse at the very least, and I haven't see the other issues).

Patrick



Re: test failures in branch-3.2

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Todd,

On Jul 30, 2009, at 5:08 PM, Todd Greenwood wrote:

> The build succeeds, but not the all of the tests. In previous test  
> runs,
> I noticed an error in org.apache.zookeeper.test.FLETest. It was not  
> able
> to bind to a port or something. Now, after a machine reboot, I'm  
> getting
> different failures.
>

This issue might be fixed in trunk, but not in the 3.2 distribution.

> branch-3.2 $ ant test
>
> [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
> FAILED (crashed)
> [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED
>

HierarchicalQuorumTest is supposed to fail until you apply the patches  
I mentioned. I don't know what could have caused the crash of the jvm  
in the other one.

-Flavio

test failures in branch-3.2

Posted by Todd Greenwood <to...@audiencescience.com>.
The build succeeds, but not the all of the tests. In previous test runs,
I noticed an error in org.apache.zookeeper.test.FLETest. It was not able
to bind to a port or something. Now, after a machine reboot, I'm getting
different failures. 

branch-3.2 $ ant test

[junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest
FAILED (crashed)
[junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED

Test logs for these two tests attached.

My goal here is to get to a known state (all tests succeeding or have
workarounds for the failures). Following that, I plan to apply the
patches Flavio recommended for a WAN deploy (479 and 481). After I
verify that the tests continue to run, I'll package this up and deploy
it to our WAN for testing. 

So, are these known issues? Do the tests normally run en masse, or do
some of the tests hold on to resources and prevent other tests from
passing?

-Todd

RE: bad svn url : test-patch

Posted by Todd Greenwood <to...@audiencescience.com>.
Thanks Mahadev.

-----Original Message-----
From: Mahadev Konar [mailto:mahadev@yahoo-inc.com] 
Sent: Thursday, July 30, 2009 3:00 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: bad svn url : test-patch

Hi Todd,
  Yes this happens with the branch 3.2. The test-patch  link is broken
becasuse of the hadoop split. This file is used for hudson test
environment.
It isnt used anywhere else, so the svn co otherwise should be fine. We
should fix it anyways.

Thanks
mahadev


On 7/30/09 2:57 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> FYI - looks like there is a bad url in svn...
> 
> $ svn co
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> branch-3.2
> 
> ...
> A    branch-3.2/build.xml
> 
> Fetching external item into 'branch-3.2/src/java/test/bin'
> svn: URL
> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> doesn't exist
> 
> This does not repro w/ 3.1:
> 
> $ svn co
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.1
> branch-3.1
> 
> -Todd
> 


Re: bad svn url : test-patch

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Todd,
  Yes this happens with the branch 3.2. The test-patch  link is broken
becasuse of the hadoop split. This file is used for hudson test environment.
It isnt used anywhere else, so the svn co otherwise should be fine. We
should fix it anyways.

Thanks
mahadev


On 7/30/09 2:57 PM, "Todd Greenwood" <to...@audiencescience.com> wrote:

> FYI - looks like there is a bad url in svn...
> 
> $ svn co
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
> branch-3.2
> 
> ...
> A    branch-3.2/build.xml
> 
> Fetching external item into 'branch-3.2/src/java/test/bin'
> svn: URL
> 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
> doesn't exist
> 
> This does not repro w/ 3.1:
> 
> $ svn co
> http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.1
> branch-3.1
> 
> -Todd
> 


bad svn url : test-patch

Posted by Todd Greenwood <to...@audiencescience.com>.
FYI - looks like there is a bad url in svn...

$ svn co
http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2
branch-3.2

...
A    branch-3.2/build.xml

Fetching external item into 'branch-3.2/src/java/test/bin'
svn: URL
'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch'
doesn't exist

This does not repro w/ 3.1:

$ svn co
http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.1
branch-3.1

-Todd


RE: Zookeeper WAN Configuration

Posted by Todd Greenwood <to...@audiencescience.com>.
Patrick - Thank you, I'll proceed accordingly. -Todd

-----Original Message-----
From: Patrick Hunt [mailto:phunt@apache.org] 
Sent: Wednesday, July 29, 2009 10:30 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

> [Todd] What is the recommended policy regarding patching zookeeper
> locally? As an external user, should I patch and compile in the trunk
or
> in the branch (branch-3.2)? 
> 
> I've looked at :
> http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute
> http://wiki.apache.org/hadoop/HowToRelease
> 
> And both of these seem well thought out but aimed at commiters
commiting
> to the trunk. 
> 

In your context (want 3.2 features) you probably want to build based on 
the 3.2 tag, that way you are working off a known quantity. I'd suggest 
strongly that as part of your build you document the source base and 
which patches/changes you have applied. Having this information will be 
critical for you (or someone using your build) in case bugs have to be 
filed, or further changes/patches have to be applied, etc...

Patrick

Re: Zookeeper WAN Configuration

Posted by Patrick Hunt <ph...@apache.org>.
> [Todd] What is the recommended policy regarding patching zookeeper
> locally? As an external user, should I patch and compile in the trunk or
> in the branch (branch-3.2)? 
> 
> I've looked at :
> http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute
> http://wiki.apache.org/hadoop/HowToRelease
> 
> And both of these seem well thought out but aimed at commiters commiting
> to the trunk. 
> 

In your context (want 3.2 features) you probably want to build based on 
the 3.2 tag, that way you are working off a known quantity. I'd suggest 
strongly that as part of your build you document the source base and 
which patches/changes you have applied. Having this information will be 
critical for you (or someone using your build) in case bugs have to be 
filed, or further changes/patches have to be applied, etc...

Patrick

RE: Zookeeper WAN Configuration

Posted by Todd Greenwood <to...@audiencescience.com>.
Flavio -

Inline in the top snippet.

> We should have a twiki page on this. For now, you can find an example
in 
> the header of QuorumHierarchical.java.

[Todd] Got it, QuorumHierarchical.java comments are very clear.

> 
> Also, I found a couple of bugs recently that may or may not affect
your 
> setup, so I suggest that you apply the patches in ZOOKEEPER-481 and 
> ZOOKEEPER-479. We would like to have these patches in for the next 
> release (3.2.1), which should be out in two or three weeks, if there
is 
> no further complication.
> 

[Todd] What is the recommended policy regarding patching zookeeper
locally? As an external user, should I patch and compile in the trunk or
in the branch (branch-3.2)? 

I've looked at :
http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute
http://wiki.apache.org/hadoop/HowToRelease

And both of these seem well thought out but aimed at commiters commiting
to the trunk. 

> Another issue that I realized that won't work in your case, but the
fix 
> would be relatively easy, is the guarantee that no zero-weight
follower 
> will be elected. Currently, we don't check the weight during leader 
> election. I'll open a jira and put up a patch soon.

[Todd] What source file(s) this would be in? I'll take a look at it. 

-Todd

-----Original Message-----
From: Patrick Hunt [mailto:phunt@apache.org] 
Sent: Tuesday, July 28, 2009 9:50 AM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

Flavio, please enter a doc jira for this if there are no docs, it should

be in forrest, not twiki btw. It would be good if you could review the 
current quorum docs (any type) and create a jira/patch that addresses 
any/all shortfall.

Patrick

Flavio Junqueira wrote:
> Todd, Some more answers. Please check out carefully the information at

> the bottom of this message.
> 
> On Jul 27, 2009, at 4:02 PM, Todd Greenwood wrote:
> 
>>
>> I'm assuming that you're setting the weight of ZooKeeper servers in
>> PODs to zero, which means that their votes when ordering updates do
>> not count.
>>
>> [Todd] Correct.
>>
>> If my assumption is correct, then you should see a significant
>> improvement in read performance. I would say that write performance
>> wouldn't be very different from clients in PODs opening a direct
>> connection to DC.
>>
>> [Todd] So the Leader, knowing that machine(s) have a voting weight of

>> zero, doesn't have to wait for their responses in order to form a 
>> quorum vote? Does the leader even send voting requests to the weight 
>> zero followers?
>>
> 
> In the current implementation, it does. When we have observers 
> implemented, the leader won't do it.
> 
>>
>>
>>> 3. ZK Servers within the POD would be resilient to network
>>> connectivity failure between the POD and the DC. Once connectivity
>>> re-established, the ZK Servers in the POD would sync with the ZK
>>> servers in the DC, and, from the perspective of a client within the
>>> POD, everything just worked, and there was no network failure.
>>>
>>
>> We want to have servers switching to read-only mode upon network
>> partitions, but this is a feature under development. We don't have
>> plans for implementing any model of eventual consistency that would
>> allow updates even when not being able to form a quorum, and I
>> personally believe that it would be a major change, with major
>> implications not only to the code base, but also to the semantics of
>> our API.
>>
>> [Todd] What is the current (3.2) behaviour in the case of a network 
>> failure that prevents connectivity between ZK Servers in a pod? 
>> Assuming the pod is composed of weight=0 followers...are the clients 
>> connected to these zookeeper servers still able to read? do they get 
>> exceptions on write? do the clients hang if it's a synchronous call?
> 
> The clients won't be able to read because we don't have this feature
of 
> going read-only upon partitions.
> 
>>
>>
>>> 4. A WAN topology of co-located ZK servers in both the DC and (n)
>>> PODs would not significantly degrade the performance of the
>>> ensemble, provided large blobs of traffic were not being sent across
>>> the network.
>>
>> If the zk servers in the PODs are assigned weight zero, then I don't
>> see a reason for having lower performance in the scenario you
>> describe. If weights are greater than zero for zk servers in PODs,
>> then your performance might be affected, but there are ways of
>> assigning weights that do not require receiving votes from all co-
>> locations for progress.
>>
>> [Todd] Great, we'll proceed with hierarchical configuration w/ ZK 
>> Servers in pods having a voting weight of zero. Could you provide a 
>> pointer to a configuration that shows this? The docs are a bit lean
in 
>> this regard...
>>
> 
> We should have a twiki page on this. For now, you can find an example
in 
> the header of QuorumHierarchical.java.
> 
> Also, I found a couple of bugs recently that may or may not affect
your 
> setup, so I suggest that you apply the patches in ZOOKEEPER-481 and 
> ZOOKEEPER-479. We would like to have these patches in for the next 
> release (3.2.1), which should be out in two or three weeks, if there
is 
> no further complication.
> 
> Another issue that I realized that won't work in your case, but the
fix 
> would be relatively easy, is the guarantee that no zero-weight
follower 
> will be elected. Currently, we don't check the weight during leader 
> election. I'll open a jira and put up a patch soon.
> 
> -Flavio
> 
> 
> 

Re: Zookeeper WAN Configuration

Posted by Patrick Hunt <ph...@apache.org>.
Flavio, please enter a doc jira for this if there are no docs, it should 
be in forrest, not twiki btw. It would be good if you could review the 
current quorum docs (any type) and create a jira/patch that addresses 
any/all shortfall.

Patrick

Flavio Junqueira wrote:
> Todd, Some more answers. Please check out carefully the information at 
> the bottom of this message.
> 
> On Jul 27, 2009, at 4:02 PM, Todd Greenwood wrote:
> 
>>
>> I'm assuming that you're setting the weight of ZooKeeper servers in
>> PODs to zero, which means that their votes when ordering updates do
>> not count.
>>
>> [Todd] Correct.
>>
>> If my assumption is correct, then you should see a significant
>> improvement in read performance. I would say that write performance
>> wouldn't be very different from clients in PODs opening a direct
>> connection to DC.
>>
>> [Todd] So the Leader, knowing that machine(s) have a voting weight of 
>> zero, doesn't have to wait for their responses in order to form a 
>> quorum vote? Does the leader even send voting requests to the weight 
>> zero followers?
>>
> 
> In the current implementation, it does. When we have observers 
> implemented, the leader won't do it.
> 
>>
>>
>>> 3. ZK Servers within the POD would be resilient to network
>>> connectivity failure between the POD and the DC. Once connectivity
>>> re-established, the ZK Servers in the POD would sync with the ZK
>>> servers in the DC, and, from the perspective of a client within the
>>> POD, everything just worked, and there was no network failure.
>>>
>>
>> We want to have servers switching to read-only mode upon network
>> partitions, but this is a feature under development. We don't have
>> plans for implementing any model of eventual consistency that would
>> allow updates even when not being able to form a quorum, and I
>> personally believe that it would be a major change, with major
>> implications not only to the code base, but also to the semantics of
>> our API.
>>
>> [Todd] What is the current (3.2) behaviour in the case of a network 
>> failure that prevents connectivity between ZK Servers in a pod? 
>> Assuming the pod is composed of weight=0 followers...are the clients 
>> connected to these zookeeper servers still able to read? do they get 
>> exceptions on write? do the clients hang if it's a synchronous call?
> 
> The clients won't be able to read because we don't have this feature of 
> going read-only upon partitions.
> 
>>
>>
>>> 4. A WAN topology of co-located ZK servers in both the DC and (n)
>>> PODs would not significantly degrade the performance of the
>>> ensemble, provided large blobs of traffic were not being sent across
>>> the network.
>>
>> If the zk servers in the PODs are assigned weight zero, then I don't
>> see a reason for having lower performance in the scenario you
>> describe. If weights are greater than zero for zk servers in PODs,
>> then your performance might be affected, but there are ways of
>> assigning weights that do not require receiving votes from all co-
>> locations for progress.
>>
>> [Todd] Great, we'll proceed with hierarchical configuration w/ ZK 
>> Servers in pods having a voting weight of zero. Could you provide a 
>> pointer to a configuration that shows this? The docs are a bit lean in 
>> this regard...
>>
> 
> We should have a twiki page on this. For now, you can find an example in 
> the header of QuorumHierarchical.java.
> 
> Also, I found a couple of bugs recently that may or may not affect your 
> setup, so I suggest that you apply the patches in ZOOKEEPER-481 and 
> ZOOKEEPER-479. We would like to have these patches in for the next 
> release (3.2.1), which should be out in two or three weeks, if there is 
> no further complication.
> 
> Another issue that I realized that won't work in your case, but the fix 
> would be relatively easy, is the guarantee that no zero-weight follower 
> will be elected. Currently, we don't check the weight during leader 
> election. I'll open a jira and put up a patch soon.
> 
> -Flavio
> 
> 
> 

Re: Zookeeper WAN Configuration

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Todd, Some more answers. Please check out carefully the information at  
the bottom of this message.

On Jul 27, 2009, at 4:02 PM, Todd Greenwood wrote:

>
> I'm assuming that you're setting the weight of ZooKeeper servers in
> PODs to zero, which means that their votes when ordering updates do
> not count.
>
> [Todd] Correct.
>
> If my assumption is correct, then you should see a significant
> improvement in read performance. I would say that write performance
> wouldn't be very different from clients in PODs opening a direct
> connection to DC.
>
> [Todd] So the Leader, knowing that machine(s) have a voting weight  
> of zero, doesn't have to wait for their responses in order to form a  
> quorum vote? Does the leader even send voting requests to the weight  
> zero followers?
>

In the current implementation, it does. When we have observers  
implemented, the leader won't do it.

>
>
>> 3. ZK Servers within the POD would be resilient to network
>> connectivity failure between the POD and the DC. Once connectivity
>> re-established, the ZK Servers in the POD would sync with the ZK
>> servers in the DC, and, from the perspective of a client within the
>> POD, everything just worked, and there was no network failure.
>>
>
> We want to have servers switching to read-only mode upon network
> partitions, but this is a feature under development. We don't have
> plans for implementing any model of eventual consistency that would
> allow updates even when not being able to form a quorum, and I
> personally believe that it would be a major change, with major
> implications not only to the code base, but also to the semantics of
> our API.
>
> [Todd] What is the current (3.2) behaviour in the case of a network  
> failure that prevents connectivity between ZK Servers in a pod?  
> Assuming the pod is composed of weight=0 followers...are the clients  
> connected to these zookeeper servers still able to read? do they get  
> exceptions on write? do the clients hang if it's a synchronous call?

The clients won't be able to read because we don't have this feature  
of going read-only upon partitions.

>
>
>> 4. A WAN topology of co-located ZK servers in both the DC and (n)
>> PODs would not significantly degrade the performance of the
>> ensemble, provided large blobs of traffic were not being sent across
>> the network.
>
> If the zk servers in the PODs are assigned weight zero, then I don't
> see a reason for having lower performance in the scenario you
> describe. If weights are greater than zero for zk servers in PODs,
> then your performance might be affected, but there are ways of
> assigning weights that do not require receiving votes from all co-
> locations for progress.
>
> [Todd] Great, we'll proceed with hierarchical configuration w/ ZK  
> Servers in pods having a voting weight of zero. Could you provide a  
> pointer to a configuration that shows this? The docs are a bit lean  
> in this regard...
>

We should have a twiki page on this. For now, you can find an example  
in the header of QuorumHierarchical.java.

Also, I found a couple of bugs recently that may or may not affect  
your setup, so I suggest that you apply the patches in ZOOKEEPER-481  
and ZOOKEEPER-479. We would like to have these patches in for the next  
release (3.2.1), which should be out in two or three weeks, if there  
is no further complication.

Another issue that I realized that won't work in your case, but the  
fix would be relatively easy, is the guarantee that no zero-weight  
follower will be elected. Currently, we don't check the weight during  
leader election. I'll open a jira and put up a patch soon.

-Flavio



RE: Zookeeper WAN Configuration

Posted by Todd Greenwood <to...@audiencescience.com>.
Flavio, more questions inline:

-----Original Message-----
From: Flavio Junqueira [mailto:fpj@yahoo-inc.com] 
Sent: Sunday, July 26, 2009 12:49 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

Todd, Answers inline:

On Jul 26, 2009, at 11:05 AM, Todd Greenwood wrote:

> Flavio, thank you for the suggestion.
>
> I have looked at the documention (relevant snippets pasted in  
> below), and looked at the presentations (http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations 
> ),
> but I still have some questions about WAN configuration:
>
> ---------------------------------------------------------------
> WAN
> ----
> A <-> B
> A <-> C
> A <-> D
>
> A is a central processing hub (DC).
> B-D are remote colo edge nodes (PODS).
> Each POD contains (m) ZK Servers with (q) client connections.
> ---------------------------------------------------------------
>
> What are the advantages and disadvantages to co-locating ZK Servers  
> across a WAN? Could you correct my admitedly naïve assumtions here?
>
> 1. ZK Servers within a POD would significantly improve read/write  
> performance within a given POD, v.s. clients within the POD opening  
> connections to the DC.
>

I'm assuming that you're setting the weight of ZooKeeper servers in  
PODs to zero, which means that their votes when ordering updates do  
not count.

[Todd] Correct.

If my assumption is correct, then you should see a significant  
improvement in read performance. I would say that write performance  
wouldn't be very different from clients in PODs opening a direct  
connection to DC.

[Todd] So the Leader, knowing that machine(s) have a voting weight of zero, doesn't have to wait for their responses in order to form a quorum vote? Does the leader even send voting requests to the weight zero followers?

> 2. ZK Servers within a POD would provide local file transacted  
> storage of writes, obviating the need to write that code ourselves.
>

Yes, local zk servers in PODs receive all updates and process them as  
any other zk server.

> 3. ZK Servers within the POD would be resilient to network  
> connectivity failure between the POD and the DC. Once connectivity  
> re-established, the ZK Servers in the POD would sync with the ZK  
> servers in the DC, and, from the perspective of a client within the  
> POD, everything just worked, and there was no network failure.
>

We want to have servers switching to read-only mode upon network  
partitions, but this is a feature under development. We don't have  
plans for implementing any model of eventual consistency that would  
allow updates even when not being able to form a quorum, and I  
personally believe that it would be a major change, with major  
implications not only to the code base, but also to the semantics of  
our API.

[Todd] What is the current (3.2) behaviour in the case of a network failure that prevents connectivity between ZK Servers in a pod? Assuming the pod is composed of weight=0 followers...are the clients connected to these zookeeper servers still able to read? do they get exceptions on write? do the clients hang if it's a synchronous call? 

> 4. A WAN topology of co-located ZK servers in both the DC and (n)  
> PODs would not significantly degrade the performance of the  
> ensemble, provided large blobs of traffic were not being sent across  
> the network.

If the zk servers in the PODs are assigned weight zero, then I don't  
see a reason for having lower performance in the scenario you  
describe. If weights are greater than zero for zk servers in PODs,  
then your performance might be affected, but there are ways of  
assigning weights that do not require receiving votes from all co- 
locations for progress.

[Todd] Great, we'll proceed with hierarchical configuration w/ ZK Servers in pods having a voting weight of zero. Could you provide a pointer to a configuration that shows this? The docs are a bit lean in this regard...



-Flavio

Re: Zookeeper WAN Configuration

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Todd, Answers inline:

On Jul 26, 2009, at 11:05 AM, Todd Greenwood wrote:

> Flavio, thank you for the suggestion.
>
> I have looked at the documention (relevant snippets pasted in  
> below), and looked at the presentations (http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations 
> ),
> but I still have some questions about WAN configuration:
>
> ---------------------------------------------------------------
> WAN
> ----
> A <-> B
> A <-> C
> A <-> D
>
> A is a central processing hub (DC).
> B-D are remote colo edge nodes (PODS).
> Each POD contains (m) ZK Servers with (q) client connections.
> ---------------------------------------------------------------
>
> What are the advantages and disadvantages to co-locating ZK Servers  
> across a WAN? Could you correct my admitedly naïve assumtions here?
>
> 1. ZK Servers within a POD would significantly improve read/write  
> performance within a given POD, v.s. clients within the POD opening  
> connections to the DC.
>

I'm assuming that you're setting the weight of ZooKeeper servers in  
PODs to zero, which means that their votes when ordering updates do  
not count.

If my assumption is correct, then you should see a significant  
improvement in read performance. I would say that write performance  
wouldn't be very different from clients in PODs opening a direct  
connection to DC.

> 2. ZK Servers within a POD would provide local file transacted  
> storage of writes, obviating the need to write that code ourselves.
>

Yes, local zk servers in PODs receive all updates and process them as  
any other zk server.

> 3. ZK Servers within the POD would be resilient to network  
> connectivity failure between the POD and the DC. Once connectivity  
> re-established, the ZK Servers in the POD would sync with the ZK  
> servers in the DC, and, from the perspective of a client within the  
> POD, everything just worked, and there was no network failure.
>

We want to have servers switching to read-only mode upon network  
partitions, but this is a feature under development. We don't have  
plans for implementing any model of eventual consistency that would  
allow updates even when not being able to form a quorum, and I  
personally believe that it would be a major change, with major  
implications not only to the code base, but also to the semantics of  
our API.

> 4. A WAN topology of co-located ZK servers in both the DC and (n)  
> PODs would not significantly degrade the performance of the  
> ensemble, provided large blobs of traffic were not being sent across  
> the network.

If the zk servers in the PODs are assigned weight zero, then I don't  
see a reason for having lower performance in the scenario you  
describe. If weights are greater than zero for zk servers in PODs,  
then your performance might be affected, but there are ways of  
assigning weights that do not require receiving votes from all co- 
locations for progress.

-Flavio

Re: Zookeeper WAN Configuration

Posted by Ted Dunning <te...@gmail.com>.
This is the problem.

ALL writes go from the leader to all nodes and the transaction isn't done
until a quorum of machines have confirmed the write.  Unless you have a
quorum in the central facility, then all writes be as slow as several
round-trips to the peripheral installations.  This slows down every
transaction.

Observers might help because they are not considered to be part of the
quorum.

On Sun, Jul 26, 2009 at 11:05 AM, Todd Greenwood
<to...@audiencescience.com>wrote:

>
> 4. A WAN topology of co-located ZK servers in both the DC and (n) PODs
> would not significantly degrade the performance of the ensemble, provided
> large blobs of traffic were not being sent across the network.




-- 
Ted Dunning, CTO
DeepDyve

RE: Zookeeper WAN Configuration

Posted by Todd Greenwood <to...@audiencescience.com>.
Flavio, thank you for the suggestion.

I have looked at the documention (relevant snippets pasted in below), and looked at the presentations (http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations),
but I still have some questions about WAN configuration:

---------------------------------------------------------------
WAN
----
A <-> B
A <-> C
A <-> D

A is a central processing hub (DC).
B-D are remote colo edge nodes (PODS).
Each POD contains (m) ZK Servers with (q) client connections.
---------------------------------------------------------------
  
What are the advantages and disadvantages to co-locating ZK Servers across a WAN? Could you correct my admitedly naïve assumtions here?

1. ZK Servers within a POD would significantly improve read/write performance within a given POD, v.s. clients within the POD opening connections to the DC.

2. ZK Servers within a POD would provide local file transacted storage of writes, obviating the need to write that code ourselves.

3. ZK Servers within the POD would be resilient to network connectivity failure between the POD and the DC. Once connectivity re-established, the ZK Servers in the POD would sync with the ZK servers in the DC, and, from the perspective of a client within the POD, everything just worked, and there was no network failure.

4. A WAN topology of co-located ZK servers in both the DC and (n) PODs would not significantly degrade the performance of the ensemble, provided large blobs of traffic were not being sent across the network.

--------------------
Doc references below
--------------------

http://hadoop.apache.org/zookeeper/docs/r3.2.0/zookeeperAdmin.html

"""
group.x=nnnnn[:nnnnn]

    (No Java system property)

    Enables a hierarchical quorum construction."x" is a group identifier and the numbers following the "=" sign correspond to server identifiers. The left-hand side of the assignment is a colon-separated list of server identifiers. Note that groups must be disjoint and the union of all groups must be the ZooKeeper ensemble.
weight.x=nnnnn

    (No Java system property)

    Used along with "group", it assigns a weight to a server when forming quorums. Such a value corresponds to the weight of a server when voting. There are a few parts of ZooKeeper that require voting such as leader election and the atomic broadcast protocol. By default the weight of server is 1. If the configuration defines groups, but not weights, then a value of 1 will be assigned to all servers.
"""

http://hadoop.apache.org/zookeeper/docs/r3.2.0/zookeeperInternals.html

"""
A different construction that uses weights and is useful in wide-area deployments (co-locations) is a hierarchical one. With this construction, we split the servers into disjoint groups and assign weights to processes. To form a quorum, we have to get a hold of enough servers from a majority of groups G, such that for each group g in G, the sum of votes from g is larger than half of the sum of weights in g. Interestingly, this construction enables smaller quorums. If we have, for example, 9 servers, we split them into 3 groups, and assign a weight of 1 to each server, then we are able to form quorums of size 4. Note that two subsets of processes composed each of a majority of servers from each of a majority of groups necessarily have a non-empty intersection. It is reasonable to expect that a majority of co-locations will have a majority of servers available with high probability.

With ZooKeeper, we provide a user with the ability of configuring servers to use majority quorums, weights, or a hierarchy of groups.
"""

-----Original Message-----
From: Flavio Junqueira [mailto:fpj@yahoo-inc.com] 
Sent: Saturday, July 25, 2009 7:55 AM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

Todd, you can try using flexible quorums to implementing what your  
requesting. You can simulate the behavior I described of observers by  
setting the weight of the server to zero. Please check the  
documentation at:

	http://hadoop.apache.org/zookeeper/docs/r3.2.0/zookeeperAdmin.html

Check under "Cluster Options" options like group and weight.

-Flavio


On Jul 24, 2009, at 5:03 PM, Todd Greenwood wrote:

>
> In the future, once the Observers feature is implemented, then we  
> should
> be able to deploy zk servers to both the DC and to the pods...with all
> the goodness that Flavio mentions below.
>
>
> -----Original Message-----
> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> Sent: Friday, July 24, 2009 4:50 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Zookeeper WAN Configuration
>
> Just a few quick observations:
>
> On Jul 24, 2009, at 4:40 PM, Ted Dunning wrote:
>
>> On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
>> <to...@audiencescience.com>wrote:
>>
>>> Could you explain the idea behind the Observers feature, what this
>>> concept is supposed to address, and how it applies to the WAN
>>> configuration problem in particular?
>>>
>>
>> Not really.  I am just echoing comments on observers from them that
>> know.
>>
>
> Without observers, increasing the number of servers in an ensemble
> enables higher read throughput, but causes write throughput to drop
> because the number of votes to order each write operation increases.
> Essentially, observers are zookeeper servers that don't vote when
> ordering updates to the zookeeper state. Adding observers enables
> higher read throughput affecting minimally write throughput (leader
> still has to send commits to everyone, at least in the version we have
> been working on).
>
>>
>>> """
>>> The ideas for federating ZK or allowing observers would likely do
>>> what
>>> you
>>> want.  I can imagine that an observer would only care that it can  
>>> see
>>> it's
>>> local peers and one of the observers would be elected to get updates
>>> (and
>>> thus would care about the central service).
>>> """
>>> This certainly sounds like exactly what I want...Was this
>>> introduced in
>>> 3.2 in full, or only partially?
>>>
>>
>> I don't think it is even in trunk yet.  Look on Jira or at the
>> recent logs
>> of this mailing list.
>
> It is not on trunk yet.
>
> -Flavio
>


Re: Zookeeper WAN Configuration

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Todd, you can try using flexible quorums to implementing what your  
requesting. You can simulate the behavior I described of observers by  
setting the weight of the server to zero. Please check the  
documentation at:

	http://hadoop.apache.org/zookeeper/docs/r3.2.0/zookeeperAdmin.html

Check under "Cluster Options" options like group and weight.

-Flavio


On Jul 24, 2009, at 5:03 PM, Todd Greenwood wrote:

>
> In the future, once the Observers feature is implemented, then we  
> should
> be able to deploy zk servers to both the DC and to the pods...with all
> the goodness that Flavio mentions below.
>
>
> -----Original Message-----
> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com]
> Sent: Friday, July 24, 2009 4:50 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Re: Zookeeper WAN Configuration
>
> Just a few quick observations:
>
> On Jul 24, 2009, at 4:40 PM, Ted Dunning wrote:
>
>> On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
>> <to...@audiencescience.com>wrote:
>>
>>> Could you explain the idea behind the Observers feature, what this
>>> concept is supposed to address, and how it applies to the WAN
>>> configuration problem in particular?
>>>
>>
>> Not really.  I am just echoing comments on observers from them that
>> know.
>>
>
> Without observers, increasing the number of servers in an ensemble
> enables higher read throughput, but causes write throughput to drop
> because the number of votes to order each write operation increases.
> Essentially, observers are zookeeper servers that don't vote when
> ordering updates to the zookeeper state. Adding observers enables
> higher read throughput affecting minimally write throughput (leader
> still has to send commits to everyone, at least in the version we have
> been working on).
>
>>
>>> """
>>> The ideas for federating ZK or allowing observers would likely do
>>> what
>>> you
>>> want.  I can imagine that an observer would only care that it can  
>>> see
>>> it's
>>> local peers and one of the observers would be elected to get updates
>>> (and
>>> thus would care about the central service).
>>> """
>>> This certainly sounds like exactly what I want...Was this
>>> introduced in
>>> 3.2 in full, or only partially?
>>>
>>
>> I don't think it is even in trunk yet.  Look on Jira or at the
>> recent logs
>> of this mailing list.
>
> It is not on trunk yet.
>
> -Flavio
>


RE: Zookeeper WAN Configuration

Posted by Todd Greenwood <to...@audiencescience.com>.
Flavio & Ted, thank you for your comments.

So it sounds like the only way to currently deploy to the WAN is to
deploy ZK Servers to the central DC and open up client connections to
these ZK servers from the edge nodes. True?

In the future, once the Observers feature is implemented, then we should
be able to deploy zk servers to both the DC and to the pods...with all
the goodness that Flavio mentions below.

Flavio - do you have a doc that describes exactly what happens in the
transaction of a write operation? For instance, I'd like to know at
exactly what stage a write has been commited to the ensemble, and not
just the zk server the client is connected to. I figure it must be
something like:

clientA.write(path, value)
-> serverA writes to memory
-> serverA writes to transacted disk every n/seconds or m/bytes
-> serverA sends write to Leader
-> Leader stamps with transaction id
-> Leader responds to ensemble with update + transaction id

-Todd

-----Original Message-----
From: Flavio Junqueira [mailto:fpj@yahoo-inc.com] 
Sent: Friday, July 24, 2009 4:50 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

Just a few quick observations:

On Jul 24, 2009, at 4:40 PM, Ted Dunning wrote:

> On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
> <to...@audiencescience.com>wrote:
>
>> Could you explain the idea behind the Observers feature, what this
>> concept is supposed to address, and how it applies to the WAN
>> configuration problem in particular?
>>
>
> Not really.  I am just echoing comments on observers from them that  
> know.
>

Without observers, increasing the number of servers in an ensemble  
enables higher read throughput, but causes write throughput to drop  
because the number of votes to order each write operation increases.  
Essentially, observers are zookeeper servers that don't vote when  
ordering updates to the zookeeper state. Adding observers enables  
higher read throughput affecting minimally write throughput (leader  
still has to send commits to everyone, at least in the version we have  
been working on).

>
>> """
>> The ideas for federating ZK or allowing observers would likely do  
>> what
>> you
>> want.  I can imagine that an observer would only care that it can see
>> it's
>> local peers and one of the observers would be elected to get updates
>> (and
>> thus would care about the central service).
>> """
>> This certainly sounds like exactly what I want...Was this  
>> introduced in
>> 3.2 in full, or only partially?
>>
>
> I don't think it is even in trunk yet.  Look on Jira or at the  
> recent logs
> of this mailing list.

It is not on trunk yet.

-Flavio


Re: Zookeeper WAN Configuration

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Just a few quick observations:

On Jul 24, 2009, at 4:40 PM, Ted Dunning wrote:

> On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
> <to...@audiencescience.com>wrote:
>
>> Could you explain the idea behind the Observers feature, what this
>> concept is supposed to address, and how it applies to the WAN
>> configuration problem in particular?
>>
>
> Not really.  I am just echoing comments on observers from them that  
> know.
>

Without observers, increasing the number of servers in an ensemble  
enables higher read throughput, but causes write throughput to drop  
because the number of votes to order each write operation increases.  
Essentially, observers are zookeeper servers that don't vote when  
ordering updates to the zookeeper state. Adding observers enables  
higher read throughput affecting minimally write throughput (leader  
still has to send commits to everyone, at least in the version we have  
been working on).

>
>> """
>> The ideas for federating ZK or allowing observers would likely do  
>> what
>> you
>> want.  I can imagine that an observer would only care that it can see
>> it's
>> local peers and one of the observers would be elected to get updates
>> (and
>> thus would care about the central service).
>> """
>> This certainly sounds like exactly what I want...Was this  
>> introduced in
>> 3.2 in full, or only partially?
>>
>
> I don't think it is even in trunk yet.  Look on Jira or at the  
> recent logs
> of this mailing list.

It is not on trunk yet.

-Flavio


Re: Zookeeper WAN Configuration

Posted by Ted Dunning <te...@gmail.com>.
On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
<to...@audiencescience.com>wrote:

> Could you explain the idea behind the Observers feature, what this
> concept is supposed to address, and how it applies to the WAN
> configuration problem in particular?
>

Not really.  I am just echoing comments on observers from them that know.


>  """
> The ideas for federating ZK or allowing observers would likely do what
> you
> want.  I can imagine that an observer would only care that it can see
> it's
> local peers and one of the observers would be elected to get updates
> (and
> thus would care about the central service).
> """
> This certainly sounds like exactly what I want...Was this introduced in
> 3.2 in full, or only partially?
>

I don't think it is even in trunk yet.  Look on Jira or at the recent logs
of this mailing list.


>
> Here, do you mean the servers will log warnings untill all the ensemble
> members are visible to each other?
>

Again, I can only speculate.  I would guess if a quorum of sibling observers
is not available, ZK servers freeze all changes but continue to serve
read-only.


> Given that 3.2 has a serious bug, I've recommended that we proceed with
> our deploy based 3.1.1. For this version, it sounds like we will have to
> open up connectivity from each zk server to each zk server, across the
> various zones in the WAN. Is this correct?
>

I don't think so.  I think you would be ahead if you put all ZK machines in
the central zone and only have WAN connections between clients and servers,
not from server to server.  Otherwise, all clients pay for the geographical
sins of a few.


-- 
Ted Dunning, CTO
DeepDyve

Re: Zookeeper WAN Configuration

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Servers in a quorum need to be able to talk to each other to elect a  
leader. Once a leader is elected, followers only talk to the leader.  
Of course, if the leader fails, servers in some quorum will need to  
talk to each other again. If no quorum can be formed, the system is  
stalled.

-Flavio

On Jul 24, 2009, at 4:37 PM, Ted Dunning wrote:

> Each member needs a connection to a quorum.  The quorum is ceiling((N 
> +1) /
> 2) members of the cluster.
>
> This guarantees that network partition does not allow two leaders to  
> go on
> stamping out revisions independent of each other.
>
> On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
> <to...@audiencescience.com>wrote:
>
>> Ted, could you elaborate a bit more on this? I was under the (mis)
>> impression that each ZK server in an ensemble only needed  
>> connectivity
>> to another member in the ensemble, not to each member in the  
>> ensemble.
>> It sounds like you are saying the latter is true.
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve


Re: Zookeeper WAN Configuration

Posted by Ted Dunning <te...@gmail.com>.
Each member needs a connection to a quorum.  The quorum is ceiling((N+1) /
2) members of the cluster.

This guarantees that network partition does not allow two leaders to go on
stamping out revisions independent of each other.

On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood
<to...@audiencescience.com>wrote:

> Ted, could you elaborate a bit more on this? I was under the (mis)
> impression that each ZK server in an ensemble only needed connectivity
> to another member in the ensemble, not to each member in the ensemble.
> It sounds like you are saying the latter is true.
>



-- 
Ted Dunning, CTO
DeepDyve

RE: Zookeeper WAN Configuration

Posted by Todd Greenwood <to...@audiencescience.com>.
Ted, could you elaborate a bit more on this? I was under the (mis)
impression that each ZK server in an ensemble only needed connectivity
to another member in the ensemble, not to each member in the ensemble.
It sounds like you are saying the latter is true.

Could you explain the idea behind the Observers feature, what this
concept is supposed to address, and how it applies to the WAN
configuration problem 
in particular?
"""
The ideas for federating ZK or allowing observers would likely do what
you
want.  I can imagine that an observer would only care that it can see
it's
local peers and one of the observers would be elected to get updates
(and
thus would care about the central service).
"""
This certainly sounds like exactly what I want...Was this introduced in
3.2 in full, or only partially?


Here, do you mean the servers will log warnings untill all the ensemble
members are visible to each other?
"""
Any servers that see a minority of the other servers will go tharn until
the
"partition" is healed.  That isn't what you want (at all).
"""

Given that 3.2 has a serious bug, I've recommended that we proceed with
our deploy based 3.1.1. For this version, it sounds like we will have to
open up connectivity from each zk server to each zk server, across the
various zones in the WAN. Is this correct?

-Todd

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Friday, July 24, 2009 3:41 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: Zookeeper WAN Configuration

Vanilla ZK servers will see security constraints as a network partition.
Any servers that see a minority of the other servers will go tharn until
the
"partition" is healed.  That isn't what you want (at all).

The ideas for federating ZK or allowing observers would likely do what
you
want.  I can imagine that an observer would only care that it can see
it's
local peers and one of the observers would be elected to get updates
(and
thus would care about the central service).

On Fri, Jul 24, 2009 at 3:32 PM, Todd Greenwood
<to...@audiencescience.com>wrote:

> Like most folks, our WAN is composed of various zones, some central
> processing, some edge, some corp, and some in between (DMZs). In this
> model, a given Zookeeper server will not have direct connectivity to
all
> of it's peers in the ensemble due to various security constraints. Is
> this a problem? Are there special configurations for this model?
>
> Given 3 Zones
> -------------
>
> A <--> B
>         B <--> C
>
> A cannot see C, and vice versa.
> B can see A and C.
>
> 1. Will zookeeper servers function properly even if a given set of
> servers can only see some of the servers in the ensemble? For example,
> the shared config lists all zk servers in A, B, and C, but A can only
> see B, C can only see B, and B can see both A and C.
>
> 2. Will zookeeper servers flood the log with error messages if only a
> subset of the ensemble members are visible?
>
> 3. Will the zk ensemble function properly if the config used by each
> server only lists the servers in the ensemble that are visible?
Suppose
> that A has a config that only list servers in A and B, C a config for
C
> and B, and B has a config that lists servers in A, B, and C. Is this
the
> recommended approach?
>
> http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Zookeeper WAN Configuration

Posted by Ted Dunning <te...@gmail.com>.
Vanilla ZK servers will see security constraints as a network partition.
Any servers that see a minority of the other servers will go tharn until the
"partition" is healed.  That isn't what you want (at all).

The ideas for federating ZK or allowing observers would likely do what you
want.  I can imagine that an observer would only care that it can see it's
local peers and one of the observers would be elected to get updates (and
thus would care about the central service).

On Fri, Jul 24, 2009 at 3:32 PM, Todd Greenwood
<to...@audiencescience.com>wrote:

> Like most folks, our WAN is composed of various zones, some central
> processing, some edge, some corp, and some in between (DMZs). In this
> model, a given Zookeeper server will not have direct connectivity to all
> of it's peers in the ensemble due to various security constraints. Is
> this a problem? Are there special configurations for this model?
>
> Given 3 Zones
> -------------
>
> A <--> B
>         B <--> C
>
> A cannot see C, and vice versa.
> B can see A and C.
>
> 1. Will zookeeper servers function properly even if a given set of
> servers can only see some of the servers in the ensemble? For example,
> the shared config lists all zk servers in A, B, and C, but A can only
> see B, C can only see B, and B can see both A and C.
>
> 2. Will zookeeper servers flood the log with error messages if only a
> subset of the ensemble members are visible?
>
> 3. Will the zk ensemble function properly if the config used by each
> server only lists the servers in the ensemble that are visible? Suppose
> that A has a config that only list servers in A and B, C a config for C
> and B, and B has a config that lists servers in A, B, and C. Is this the
> recommended approach?
>
> http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html
>



-- 
Ted Dunning, CTO
DeepDyve

Zookeeper WAN Configuration

Posted by Todd Greenwood <to...@audiencescience.com>.
Like most folks, our WAN is composed of various zones, some central
processing, some edge, some corp, and some in between (DMZs). In this
model, a given Zookeeper server will not have direct connectivity to all
of it's peers in the ensemble due to various security constraints. Is
this a problem? Are there special configurations for this model?

Given 3 Zones
-------------

A <--> B
	 B <--> C

A cannot see C, and vice versa.
B can see A and C.

1. Will zookeeper servers function properly even if a given set of
servers can only see some of the servers in the ensemble? For example,
the shared config lists all zk servers in A, B, and C, but A can only
see B, C can only see B, and B can see both A and C.

2. Will zookeeper servers flood the log with error messages if only a
subset of the ensemble members are visible?

3. Will the zk ensemble function properly if the config used by each
server only lists the servers in the ensemble that are visible? Suppose
that A has a config that only list servers in A and B, C a config for C
and B, and B has a config that lists servers in A, B, and C. Is this the
recommended approach?

http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html