You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by Marshall McMullen <ma...@gmail.com> on 2011/07/13 00:41:21 UTC

Multiop submitted to non-leader fails

I'm trying to debug a really obscure problem we've been seeing integrating
multiop into our code base.

After a lot of debugging, I've isolated the error scenario down to when the
multi op gets submitted through any non-leader. The symptoms I see when I
submit the multi-op to a non-leader is that the multi-op, no matter how
small (in this case it only has one op in it to create a single zknode)
always times out. When submitting it to non-leaders, I am seeing a couple of
key messages in the log files:

2011-07-12 13:20:59,168 - WARN  [LearnerHandler-/127.0.0.7:53952:Leader@507][]
- *Commiting zxid 0x100000000 from /127.0.0.7:2182 not first!*


2011-07-12 13:20:59,169 - WARN  [LearnerHandler-/127.0.0.7:53952:Leader@509][]
- *First is 0           *


2011-07-12 13:20:59,169 - INFO  [LearnerHandler-/127.0.0.7:53952:Leader@533][]
- Have quorum of supporters; starting up and setting last processed zxid:
4294967296

2011-07-12 13:20:59,379 - INFO  [NIOServerCxn.Factory:/127.0.0.5:2181
:NIOServerCnxnFactory@197][] - Accepted socket connection from /
127.0.0.5:53553

2011-07-12 13:20:59,381 - WARN  [NIOServerCxn.Factory:/127.0.0.5:2181
:ZooKeeperServer@807][] - Connection request from old client /
127.0.0.5:53553; will be dropped if server is in r-o mode

2011-07-12 13:20:59,381 - INFO  [NIOServerCxn.Factory:/127.0.0.5:2181
:ZooKeeperServer@853][] - Client attempting to establish new session at /
127.0.0.5:53553

2011-07-12 13:20:59,385 - INFO  [SyncThread:3:FileTxnLog@195][] - Creating
new log file: log.100000001


2011-07-12 13:20:59,387 - WARN  [QuorumPeer[myid=1]/127.0.0.5:2181
:Follower@118][] - Got zxid 0x100000001 expected 0x1


2011-07-12 13:20:59,387 - INFO  [SyncThread:1:FileTxnLog@195][] - Creating
new log file: log.100000001


2011-07-12 13:20:59,387 - WARN  [QuorumPeer[myid=2]/127.0.0.6:2181
:Follower@118][] - Got zxid 0x100000001 expected 0x1


If I instead submit the exact same multi-op to the leader instead of a
follower, it succeeds every time.

I'm suspecting there is a bug in how a Multi-op is forwarded between a
follower and a leader (or vice versa). I'm going to be digging into this
tonight, but if anyone has any pointers on where to look, I'd be super
appreciative.

Thanks,
Marshall

Re: Multiop submitted to non-leader fails

Posted by Marshall McMullen <ma...@gmail.com>.
I actually think I've figured it out. It was a trivial (but really vital)
missing case statement. I'll send you more details later tonight...

On Tue, Jul 12, 2011 at 4:41 PM, Marshall McMullen <
marshall.mcmullen@gmail.com> wrote:

> I'm trying to debug a really obscure problem we've been seeing integrating
> multiop into our code base.
>
> After a lot of debugging, I've isolated the error scenario down to when the
> multi op gets submitted through any non-leader. The symptoms I see when I
> submit the multi-op to a non-leader is that the multi-op, no matter how
> small (in this case it only has one op in it to create a single zknode)
> always times out. When submitting it to non-leaders, I am seeing a couple
> of key messages in the log files:
>
> 2011-07-12 13:20:59,168 - WARN  [LearnerHandler-/127.0.0.7:53952
> :Leader@507][] - *Commiting zxid 0x100000000 from /127.0.0.7:2182 not
> first!*
>
> 2011-07-12 13:20:59,169 - WARN  [LearnerHandler-/127.0.0.7:53952
> :Leader@509][] - *First is 0           *
>
>
> 2011-07-12 13:20:59,169 - INFO  [LearnerHandler-/127.0.0.7:53952
> :Leader@533][] - Have quorum of supporters; starting up and setting last
> processed zxid: 4294967296
>
> 2011-07-12 13:20:59,379 - INFO  [NIOServerCxn.Factory:/127.0.0.5:2181
> :NIOServerCnxnFactory@197][] - Accepted socket connection from /
> 127.0.0.5:53553
>
> 2011-07-12 13:20:59,381 - WARN  [NIOServerCxn.Factory:/127.0.0.5:2181
> :ZooKeeperServer@807][] - Connection request from old client /
> 127.0.0.5:53553; will be dropped if server is in r-o mode
>
> 2011-07-12 13:20:59,381 - INFO  [NIOServerCxn.Factory:/127.0.0.5:2181
> :ZooKeeperServer@853][] - Client attempting to establish new session at /
> 127.0.0.5:53553
>
> 2011-07-12 13:20:59,385 - INFO  [SyncThread:3:FileTxnLog@195][] - Creating
> new log file: log.100000001
>
>
> 2011-07-12 13:20:59,387 - WARN  [QuorumPeer[myid=1]/127.0.0.5:2181
> :Follower@118][] - Got zxid 0x100000001 expected 0x1
>
>
> 2011-07-12 13:20:59,387 - INFO  [SyncThread:1:FileTxnLog@195][] - Creating
> new log file: log.100000001
>
>
> 2011-07-12 13:20:59,387 - WARN  [QuorumPeer[myid=2]/127.0.0.6:2181
> :Follower@118][] - Got zxid 0x100000001 expected 0x1
>
>
> If I instead submit the exact same multi-op to the leader instead of a
> follower, it succeeds every time.
>
> I'm suspecting there is a bug in how a Multi-op is forwarded between a
> follower and a leader (or vice versa). I'm going to be digging into this
> tonight, but if anyone has any pointers on where to look, I'd be super
> appreciative.
>
> Thanks,
> Marshall
>
>
>