You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Ted Dunning <te...@gmail.com> on 2010/04/20 23:14:58 UTC

odd error message

We have just done an upgrade of ZK to 3.3.0.  Previous to this, ZK has been
up for about a year with no problems.

On two nodes, we killed the previous instance and started the 3.3.0
instance.  The first node was a follower and the second a leader.

All went according to plan and no clients seemed to notice anything.  The
stat command showed connections moving around as expected and all other
indicators were normal.

When we did the third node, we saw this in the log:

2010-04-20 14:07:49,010 - FATAL [QuorumPeer:/0.0.0.0:2181:Follower@71] -
Leader epoch 18 is less than our epoch 19

The third node refused all connections.

We brought down the third node, wiped away its snapshot, restarted and it
joined without complaint.  Note that the third node
was originally a follower and had never been a leader during the upgrade
process.

Does anybody know why this happened?

We are fully upgraded and there was no interruption to normal service, but
this seems strange.

Re: odd error message

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Hi Ted, I think the problem you are seeing is due to this issue:

	https://issues.apache.org/jira/browse/ZOOKEEPER-790

which is fixed in 3.3.2.

-Flavio

On Apr 20, 2010, at 11:14 PM, Ted Dunning wrote:

> We have just done an upgrade of ZK to 3.3.0.  Previous to this, ZK  
> has been
> up for about a year with no problems.
>
> On two nodes, we killed the previous instance and started the 3.3.0
> instance.  The first node was a follower and the second a leader.
>
> All went according to plan and no clients seemed to notice  
> anything.  The
> stat command showed connections moving around as expected and all  
> other
> indicators were normal.
>
> When we did the third node, we saw this in the log:
>
> 2010-04-20 14:07:49,010 - FATAL [QuorumPeer:/ 
> 0.0.0.0:2181:Follower@71] -
> Leader epoch 18 is less than our epoch 19
>
> The third node refused all connections.
>
> We brought down the third node, wiped away its snapshot, restarted  
> and it
> joined without complaint.  Note that the third node
> was originally a follower and had never been a leader during the  
> upgrade
> process.
>
> Does anybody know why this happened?
>
> We are fully upgraded and there was no interruption to normal  
> service, but
> this seems strange.

flavio
junqueira

research scientist

fpj@yahoo-inc.com
direct +34 93-183-8828

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301




Re: odd error message

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Hi Ted, I think the problem you are seeing is due to this issue:

	https://issues.apache.org/jira/browse/ZOOKEEPER-790

which is fixed in 3.3.2.

-Flavio

On Apr 20, 2010, at 11:14 PM, Ted Dunning wrote:

> We have just done an upgrade of ZK to 3.3.0.  Previous to this, ZK  
> has been
> up for about a year with no problems.
>
> On two nodes, we killed the previous instance and started the 3.3.0
> instance.  The first node was a follower and the second a leader.
>
> All went according to plan and no clients seemed to notice  
> anything.  The
> stat command showed connections moving around as expected and all  
> other
> indicators were normal.
>
> When we did the third node, we saw this in the log:
>
> 2010-04-20 14:07:49,010 - FATAL [QuorumPeer:/ 
> 0.0.0.0:2181:Follower@71] -
> Leader epoch 18 is less than our epoch 19
>
> The third node refused all connections.
>
> We brought down the third node, wiped away its snapshot, restarted  
> and it
> joined without complaint.  Note that the third node
> was originally a follower and had never been a leader during the  
> upgrade
> process.
>
> Does anybody know why this happened?
>
> We are fully upgraded and there was no interruption to normal  
> service, but
> this seems strange.

flavio
junqueira

research scientist

fpj@yahoo-inc.com
direct +34 93-183-8828

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301




Re: odd error message

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Ok, I think this is possible.
So here is what happens currently. This has been a long standing bug and
should be fixed in 3.4!!!!

https://issues.apache.org/jira/browse/ZOOKEEPER-335

A newly elected leader currently doesn't log the new leader transaction to
its database

In your case, the follower (the 3rd server) did log it but the leader never
did. Now when you brought up the 3rd server it had the transaction log
present but the leader did not have that. In that case the 3rd server cried
fowl and shut down.

Removing the DB is totally fine. For now, we should update our docs on 3.3
and mention that this problem might occur during upgrade and fix it in 3.4.


Thanks for bringing it up Ted.


Thanks
mahadev

On 4/20/10 2:14 PM, "Ted Dunning" <te...@gmail.com> wrote:

> We have just done an upgrade of ZK to 3.3.0.  Previous to this, ZK has been
> up for about a year with no problems.
> 
> On two nodes, we killed the previous instance and started the 3.3.0
> instance.  The first node was a follower and the second a leader.
> 
> All went according to plan and no clients seemed to notice anything.  The
> stat command showed connections moving around as expected and all other
> indicators were normal.
> 
> When we did the third node, we saw this in the log:
> 
> 2010-04-20 14:07:49,010 - FATAL [QuorumPeer:/0.0.0.0:2181:Follower@71] -
> Leader epoch 18 is less than our epoch 19
> 
> The third node refused all connections.
> 
> We brought down the third node, wiped away its snapshot, restarted and it
> joined without complaint.  Note that the third node
> was originally a follower and had never been a leader during the upgrade
> process.
> 
> Does anybody know why this happened?
> 
> We are fully upgraded and there was no interruption to normal service, but
> this seems strange.