You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Ken Mamitsuka <ke...@airtime.com> on 2016/01/25 20:30:03 UTC

Zookeeper leader is terminated in AWS and cluster never recovers

This may be related to the post "All nodes shutting down simultaneously",
but I'm not sure. I was able to reproduce this quite a few times. I
terminated the leader of a 5 node zookeeper cluster (terminated from the
AWS console). All the remaining hosts saw the leader disappear:

Jan 20 18:40:05 localhost zookeeper: WARN
[QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader
Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed out

Then it looks like a shutdown of the ZooKeeperServer and  its processors:

Jan 20 18:40:05 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] - shutting
down

I’m not sure if we should see startup logs or not. Then we see the leader
election appear to be successful (I’m not sure if the “ZooKeeperServer not
running” log is ominous or not):

Jan 20 18:40:05 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
Jan 20 18:40:05 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
/zookeeper/data/version-2/snapshot.8005dc366
Jan 20 18:40:05 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
election. My id =  2, proposed zxid=0x8005e12e8
Jan 20 18:40:05 localhost zookeeper: INFO
 [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
...
Jan 20 18:40:06 localhost zookeeper: INFO
 [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - Created
server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER
ELECTION TOOK - 347
Jan 20 18:40:06 localhost zookeeper: INFO
 [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state)
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] -
Resolved hostname: zookeeper-5-internal.prod.airtime.com to address:
zookeeper-5-internal.prod.airtime.com/10.1.24.208

However, attempts to connect to the new leader just fail forever:

Jan 20 18:40:06 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
exception, tries=0, connecting to
zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
Jan 20 18:40:07 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
exception, tries=1, connecting to
zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
Jan 20 18:40:08 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
exception, tries=2, connecting to
zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
Jan 20 18:40:09 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
exception, tries=3, connecting to
zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888

Then it throws an exception and shuts down:

Jan 20 18:40:10 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader
Jan 20 18:40:10 localhost java.net.ConnectException: Connection refused
Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower
Jan 20 18:40:10 localhost     at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
Jan 20 18:40:10 localhost     at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
Shutting down
Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
election. My id =  2, proposed zxid=0x8005e12e8
Jan 20 18:40:10 localhost zookeeper: INFO
 [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round),
LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)

And then the fast leader election fails essentially forever:

Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 400
Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 800
Jan 20 18:40:11 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 1600
Jan 20 18:40:13 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 3200
Jan 20 18:40:16 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 6400
Jan 20 18:40:22 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 12800
Jan 20 18:40:35 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 25600
Jan 20 18:41:01 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 51200
Jan 20 18:41:52 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 60000

Looking at the node elected as the new leader (zookeeper-5.prod.airtime.com),
it doesn’t seem like it ever recognizes that it was the leader of the
original election. I’m not sure if it’s because it can’t start or something
else.

Jan 20 18:40:05 localhost zookeeper: INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
election. My id =  5, proposed zxid=0x8005e12e8
Jan 20 18:40:05 localhost zookeeper: INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 400
Jan 20 18:40:06 localhost zookeeper: INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 800
Jan 20 18:40:06 localhost zookeeper: INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
Jan 20 18:40:07 localhost zookeeper: INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 1600

A couple of additional notes:

   1. Restarting zookeeper on all hosts seemed to resolve the issue.
   2. This was easily reproducible for me in 3 different zookeeper clusters
   in AWS on Jan 20/21. Other things like killing follower hosts worked fine.
   3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder if
   anything network related could make this situation fail, while everything
   else worked fine.

Re: Zookeeper leader is terminated in AWS and cluster never recovers

Posted by Ken Mamitsuka <ke...@airtime.com>.

Thanks Raul. I haven't been able to repro the issue recently, but I'll get
that info when/if I am able to do so.

On Wed, Jan 27, 2016 at 5:39 PM, Raúl Gutiérrez Segalés <rgs@itevenworks.net
> wrote:

> Can you see traffic on the election and ZAB ports?
>
> You can sniff these messages via https://github.com/twitter/zktraffic.
>
> Maybe jstack the processes and post that as well?
>
> -rgs
> On Jan 27, 2016 10:45 AM, "Ken Mamitsuka" <ke...@airtime.com> wrote:
>
> > Thanks Flavio, I uploaded the log files to:
> >
> >
> >
> https://drive.google.com/folderview?id=0B_aqYHmVv9E-LUZJYnlJRDYyNjg&usp=sharing
> >
> > log_follower - log from one of the followers
> > log_new_leader - log from the elected leader (that everyone fails to
> > connect to)
> >
> > The issue is triggered when we killed the original leader at:
> >
> > Jan 20 18:40:05 localhost zookeeper: WARN
> >  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> > following the leader
> >
> > and recovered after i restarted all the zookeepers (monit restart
> > zookeeper-server which just wraps zkServer.sh):
> >
> > Jan 20 18:52:50 zookeeper-2 ansible-service: Invoked with
> > name=zookeeper-server pattern=None enabled=None state=restarted
> sleep=None
> > arguments= runlevel=default
> >
> > I don't believe this is related to ZOOKEEPER-1506 because nothing in DNS
> > changed, we just killed the leader (and the resolved ip address for
> > zookeeper-5.eng.airtime.com is valid).
> >
> > Thanks again for looking at this.
> >
> > On Wed, Jan 27, 2016 at 7:52 AM, Flavio Junqueira <fp...@apache.org>
> wrote:
> >
> > > Perhaps you want to upload your logs somewhere for us to take a look.
> > > There has been issues around name resolution, like this one:
> > >
> > > https://issues.apache.org/jira/browse/ZOOKEEPER-1506 <
> > > https://issues.apache.org/jira/browse/ZOOKEEPER-1506>
> > >
> > > Is there any chance it could be related to it?
> > >
> > > -Flavio
> > >
> > >
> > > > On 26 Jan 2016, at 22:19, Jim Wong <ji...@airtime.com> wrote:
> > > >
> > > > Folks,
> > > >
> > > > Does anyone have any thoughts on this one? We can semi-reliably cause
> > > our cluster to get into a seemingly permanent bad stat--at least
> several
> > > minutes, which is longer than we can tolerate in production--just by
> > > terminating the leader. We’ve seen recommendations to change things
> like
> > > the amount of time allotted for the database to be reconstructed from
> the
> > > last snapshot, but none of them have helped.
> > > >
> > > > If nothing else, are there any particular log messages we should be
> > > looking for or troubleshooting strategies we can pursue?
> > > >
> > > >
> > > >
> > > >> On Jan 25, 2016, at 11:30 AM, Ken Mamitsuka <ke...@airtime.com>
> wrote:
> > > >>
> > > >> This may be related to the post "All nodes shutting down
> > > simultaneously",
> > > >> but I'm not sure. I was able to reproduce this quite a few times. I
> > > >> terminated the leader of a 5 node zookeeper cluster (terminated from
> > the
> > > >> AWS console). All the remaining hosts saw the leader disappear:
> > > >>
> > > >> Jan 20 18:40:05 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception
> > when
> > > >> following the leader
> > > >> Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read
> timed
> > > out
> > > >>
> > > >> Then it looks like a shutdown of the ZooKeeperServer and  its
> > > processors:
> > > >>
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] -
> > > shutting
> > > >> down
> > > >>
> > > >> I’m not sure if we should see startup logs or not. Then we see the
> > > leader
> > > >> election appear to be successful (I’m not sure if the
> “ZooKeeperServer
> > > not
> > > >> running” log is ominous or not):
> > > >>
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading
> > > snapshot
> > > >> /zookeeper/data/version-2/snapshot.8005dc366
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] -
> > New
> > > >> election. My id =  2, proposed zxid=0x8005e12e8
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >> ...
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] -
> FOLLOWING
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] -
> > Created
> > > >> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout
> > 40000
> > > >> datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING -
> > > LEADER
> > > >> ELECTION TOOK - 347
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state)
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149
> ]
> > -
> > > >> Resolved hostname: zookeeper-5-internal.prod.airtime.com to
> address:
> > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208
> > > >>
> > > >> However, attempts to connect to the new leader just fail forever:
> > > >>
> > > >> Jan 20 18:40:06 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > > >> exception, tries=0, connecting to
> > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > > >> Jan 20 18:40:07 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > > >> exception, tries=1, connecting to
> > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > > >> Jan 20 18:40:08 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > > >> exception, tries=2, connecting to
> > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > > >> Jan 20 18:40:09 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > > >> exception, tries=3, connecting to
> > > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > > >>
> > > >> Then it throws an exception and shuts down:
> > > >>
> > > >> Jan 20 18:40:10 localhost zookeeper: WARN
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception
> > when
> > > >> following the leader
> > > >> Jan 20 18:40:10 localhost java.net.ConnectException: Connection
> > refused
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
> > > called
> > > >> Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower
> > > >> Jan 20 18:40:10 localhost     at
> > > >>
> > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> > > >> Jan 20 18:40:10 localhost     at
> > > >>
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140
> ]
> > -
> > > >> Shutting down
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] -
> > New
> > > >> election. My id =  2, proposed zxid=0x8005e12e8
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round),
> > > >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >>
> > > >> And then the fast leader election fails essentially forever:
> > > >>
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 400
> > > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 800
> > > >> Jan 20 18:40:11 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 1600
> > > >> Jan 20 18:40:13 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 3200
> > > >> Jan 20 18:40:16 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 6400
> > > >> Jan 20 18:40:22 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 12800
> > > >> Jan 20 18:40:35 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 25600
> > > >> Jan 20 18:41:01 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 51200
> > > >> Jan 20 18:41:52 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 60000
> > > >>
> > > >> Looking at the node elected as the new leader (
> > > zookeeper-5.prod.airtime.com),
> > > >> it doesn’t seem like it ever recognizes that it was the leader of
> the
> > > >> original election. I’m not sure if it’s because it can’t start or
> > > something
> > > >> else.
> > > >>
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] -
> > New
> > > >> election. My id =  5, proposed zxid=0x8005e12e8
> > > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 400
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 800
> > > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > > (message
> > > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > > >> Jan 20 18:40:07 localhost zookeeper: INFO
> > > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > > >> Notification time out: 1600
> > > >>
> > > >> A couple of additional notes:
> > > >>
> > > >>  1. Restarting zookeeper on all hosts seemed to resolve the issue.
> > > >>  2. This was easily reproducible for me in 3 different zookeeper
> > > clusters
> > > >>  in AWS on Jan 20/21. Other things like killing follower hosts
> worked
> > > fine.
> > > >>  3. I wasn't able to reproduce this yesterday (Jan 24th), so I
> wonder
> > if
> > > >>  anything network related could make this situation fail, while
> > > everything
> > > >>  else worked fine.
> > > >
> > >
> > >
> >
>

Re: Zookeeper leader is terminated in AWS and cluster never recovers

Posted by Raúl Gutiérrez Segalés <rg...@itevenworks.net>.

Can you see traffic on the election and ZAB ports?

You can sniff these messages via https://github.com/twitter/zktraffic.

Maybe jstack the processes and post that as well?

-rgs
On Jan 27, 2016 10:45 AM, "Ken Mamitsuka" <ke...@airtime.com> wrote:

> Thanks Flavio, I uploaded the log files to:
>
>
> https://drive.google.com/folderview?id=0B_aqYHmVv9E-LUZJYnlJRDYyNjg&usp=sharing
>
> log_follower - log from one of the followers
> log_new_leader - log from the elected leader (that everyone fails to
> connect to)
>
> The issue is triggered when we killed the original leader at:
>
> Jan 20 18:40:05 localhost zookeeper: WARN
>  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
>
> and recovered after i restarted all the zookeepers (monit restart
> zookeeper-server which just wraps zkServer.sh):
>
> Jan 20 18:52:50 zookeeper-2 ansible-service: Invoked with
> name=zookeeper-server pattern=None enabled=None state=restarted sleep=None
> arguments= runlevel=default
>
> I don't believe this is related to ZOOKEEPER-1506 because nothing in DNS
> changed, we just killed the leader (and the resolved ip address for
> zookeeper-5.eng.airtime.com is valid).
>
> Thanks again for looking at this.
>
> On Wed, Jan 27, 2016 at 7:52 AM, Flavio Junqueira <fp...@apache.org> wrote:
>
> > Perhaps you want to upload your logs somewhere for us to take a look.
> > There has been issues around name resolution, like this one:
> >
> > https://issues.apache.org/jira/browse/ZOOKEEPER-1506 <
> > https://issues.apache.org/jira/browse/ZOOKEEPER-1506>
> >
> > Is there any chance it could be related to it?
> >
> > -Flavio
> >
> >
> > > On 26 Jan 2016, at 22:19, Jim Wong <ji...@airtime.com> wrote:
> > >
> > > Folks,
> > >
> > > Does anyone have any thoughts on this one? We can semi-reliably cause
> > our cluster to get into a seemingly permanent bad stat--at least several
> > minutes, which is longer than we can tolerate in production--just by
> > terminating the leader. We’ve seen recommendations to change things like
> > the amount of time allotted for the database to be reconstructed from the
> > last snapshot, but none of them have helped.
> > >
> > > If nothing else, are there any particular log messages we should be
> > looking for or troubleshooting strategies we can pursue?
> > >
> > >
> > >
> > >> On Jan 25, 2016, at 11:30 AM, Ken Mamitsuka <ke...@airtime.com> wrote:
> > >>
> > >> This may be related to the post "All nodes shutting down
> > simultaneously",
> > >> but I'm not sure. I was able to reproduce this quite a few times. I
> > >> terminated the leader of a 5 node zookeeper cluster (terminated from
> the
> > >> AWS console). All the remaining hosts saw the leader disappear:
> > >>
> > >> Jan 20 18:40:05 localhost zookeeper: WARN
> > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception
> when
> > >> following the leader
> > >> Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed
> > out
> > >>
> > >> Then it looks like a shutdown of the ZooKeeperServer and  its
> > processors:
> > >>
> > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] -
> > shutting
> > >> down
> > >>
> > >> I’m not sure if we should see startup logs or not. Then we see the
> > leader
> > >> election appear to be successful (I’m not sure if the “ZooKeeperServer
> > not
> > >> running” log is ominous or not):
> > >>
> > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading
> > snapshot
> > >> /zookeeper/data/version-2/snapshot.8005dc366
> > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] -
> New
> > >> election. My id =  2, proposed zxid=0x8005e12e8
> > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > (message
> > >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > >> ...
> > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > (message
> > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
> > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] -
> Created
> > >> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout
> 40000
> > >> datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2
> > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING -
> > LEADER
> > >> ELECTION TOOK - 347
> > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > (message
> > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > >> LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state)
> > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149]
> -
> > >> Resolved hostname: zookeeper-5-internal.prod.airtime.com to address:
> > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208
> > >>
> > >> However, attempts to connect to the new leader just fail forever:
> > >>
> > >> Jan 20 18:40:06 localhost zookeeper: WARN
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > >> exception, tries=0, connecting to
> > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > >> Jan 20 18:40:07 localhost zookeeper: WARN
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > >> exception, tries=1, connecting to
> > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > >> Jan 20 18:40:08 localhost zookeeper: WARN
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > >> exception, tries=2, connecting to
> > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > >> Jan 20 18:40:09 localhost zookeeper: WARN
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> > >> exception, tries=3, connecting to
> > >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> > >>
> > >> Then it throws an exception and shuts down:
> > >>
> > >> Jan 20 18:40:10 localhost zookeeper: WARN
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception
> when
> > >> following the leader
> > >> Jan 20 18:40:10 localhost java.net.ConnectException: Connection
> refused
> > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
> > called
> > >> Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower
> > >> Jan 20 18:40:10 localhost     at
> > >>
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> > >> Jan 20 18:40:10 localhost     at
> > >> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
> > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140]
> -
> > >> Shutting down
> > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] -
> New
> > >> election. My id =  2, proposed zxid=0x8005e12e8
> > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> > (message
> > >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round),
> > >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > >>
> > >> And then the fast leader election fails essentially forever:
> > >>
> > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 400
> > >> Jan 20 18:40:10 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 800
> > >> Jan 20 18:40:11 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 1600
> > >> Jan 20 18:40:13 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 3200
> > >> Jan 20 18:40:16 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 6400
> > >> Jan 20 18:40:22 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 12800
> > >> Jan 20 18:40:35 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 25600
> > >> Jan 20 18:41:01 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 51200
> > >> Jan 20 18:41:52 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 60000
> > >>
> > >> Looking at the node elected as the new leader (
> > zookeeper-5.prod.airtime.com),
> > >> it doesn’t seem like it ever recognizes that it was the leader of the
> > >> original election. I’m not sure if it’s because it can’t start or
> > something
> > >> else.
> > >>
> > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] -
> New
> > >> election. My id =  5, proposed zxid=0x8005e12e8
> > >> Jan 20 18:40:05 localhost zookeeper: INFO
> > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 400
> > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 800
> > >> Jan 20 18:40:06 localhost zookeeper: INFO
> > >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> > (message
> > >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> > >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> > >> Jan 20 18:40:07 localhost zookeeper: INFO
> > >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> > >> Notification time out: 1600
> > >>
> > >> A couple of additional notes:
> > >>
> > >>  1. Restarting zookeeper on all hosts seemed to resolve the issue.
> > >>  2. This was easily reproducible for me in 3 different zookeeper
> > clusters
> > >>  in AWS on Jan 20/21. Other things like killing follower hosts worked
> > fine.
> > >>  3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder
> if
> > >>  anything network related could make this situation fail, while
> > everything
> > >>  else worked fine.
> > >
> >
> >
>

Re: Zookeeper leader is terminated in AWS and cluster never recovers

Posted by Ken Mamitsuka <ke...@airtime.com>.

Thanks Flavio, I uploaded the log files to:

https://drive.google.com/folderview?id=0B_aqYHmVv9E-LUZJYnlJRDYyNjg&usp=sharing

log_follower - log from one of the followers
log_new_leader - log from the elected leader (that everyone fails to
connect to)

The issue is triggered when we killed the original leader at:

Jan 20 18:40:05 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader

and recovered after i restarted all the zookeepers (monit restart
zookeeper-server which just wraps zkServer.sh):

Jan 20 18:52:50 zookeeper-2 ansible-service: Invoked with
name=zookeeper-server pattern=None enabled=None state=restarted sleep=None
arguments= runlevel=default

I don't believe this is related to ZOOKEEPER-1506 because nothing in DNS
changed, we just killed the leader (and the resolved ip address for
zookeeper-5.eng.airtime.com is valid).

Thanks again for looking at this.

On Wed, Jan 27, 2016 at 7:52 AM, Flavio Junqueira <fp...@apache.org> wrote:

> Perhaps you want to upload your logs somewhere for us to take a look.
> There has been issues around name resolution, like this one:
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-1506 <
> https://issues.apache.org/jira/browse/ZOOKEEPER-1506>
>
> Is there any chance it could be related to it?
>
> -Flavio
>
>
> > On 26 Jan 2016, at 22:19, Jim Wong <ji...@airtime.com> wrote:
> >
> > Folks,
> >
> > Does anyone have any thoughts on this one? We can semi-reliably cause
> our cluster to get into a seemingly permanent bad stat--at least several
> minutes, which is longer than we can tolerate in production--just by
> terminating the leader. We’ve seen recommendations to change things like
> the amount of time allotted for the database to be reconstructed from the
> last snapshot, but none of them have helped.
> >
> > If nothing else, are there any particular log messages we should be
> looking for or troubleshooting strategies we can pursue?
> >
> >
> >
> >> On Jan 25, 2016, at 11:30 AM, Ken Mamitsuka <ke...@airtime.com> wrote:
> >>
> >> This may be related to the post "All nodes shutting down
> simultaneously",
> >> but I'm not sure. I was able to reproduce this quite a few times. I
> >> terminated the leader of a 5 node zookeeper cluster (terminated from the
> >> AWS console). All the remaining hosts saw the leader disappear:
> >>
> >> Jan 20 18:40:05 localhost zookeeper: WARN
> >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> >> following the leader
> >> Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed
> out
> >>
> >> Then it looks like a shutdown of the ZooKeeperServer and  its
> processors:
> >>
> >> Jan 20 18:40:05 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] -
> shutting
> >> down
> >>
> >> I’m not sure if we should see startup logs or not. Then we see the
> leader
> >> election appear to be successful (I’m not sure if the “ZooKeeperServer
> not
> >> running” log is ominous or not):
> >>
> >> Jan 20 18:40:05 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> >> Jan 20 18:40:05 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading
> snapshot
> >> /zookeeper/data/version-2/snapshot.8005dc366
> >> Jan 20 18:40:05 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> >> election. My id =  2, proposed zxid=0x8005e12e8
> >> Jan 20 18:40:05 localhost zookeeper: INFO
> >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> (message
> >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> >> ...
> >> Jan 20 18:40:06 localhost zookeeper: INFO
> >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> (message
> >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> >> Jan 20 18:40:06 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
> >> Jan 20 18:40:06 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - Created
> >> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
> >> datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2
> >> Jan 20 18:40:06 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING -
> LEADER
> >> ELECTION TOOK - 347
> >> Jan 20 18:40:06 localhost zookeeper: INFO
> >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> (message
> >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> >> LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state)
> >> Jan 20 18:40:06 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] -
> >> Resolved hostname: zookeeper-5-internal.prod.airtime.com to address:
> >> zookeeper-5-internal.prod.airtime.com/10.1.24.208
> >>
> >> However, attempts to connect to the new leader just fail forever:
> >>
> >> Jan 20 18:40:06 localhost zookeeper: WARN
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> >> exception, tries=0, connecting to
> >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> >> Jan 20 18:40:07 localhost zookeeper: WARN
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> >> exception, tries=1, connecting to
> >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> >> Jan 20 18:40:08 localhost zookeeper: WARN
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> >> exception, tries=2, connecting to
> >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> >> Jan 20 18:40:09 localhost zookeeper: WARN
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> >> exception, tries=3, connecting to
> >> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> >>
> >> Then it throws an exception and shuts down:
> >>
> >> Jan 20 18:40:10 localhost zookeeper: WARN
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> >> following the leader
> >> Jan 20 18:40:10 localhost java.net.ConnectException: Connection refused
> >> Jan 20 18:40:10 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
> called
> >> Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower
> >> Jan 20 18:40:10 localhost     at
> >> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> >> Jan 20 18:40:10 localhost     at
> >> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
> >> Jan 20 18:40:10 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
> >> Shutting down
> >> Jan 20 18:40:10 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> >> Jan 20 18:40:10 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> >> election. My id =  2, proposed zxid=0x8005e12e8
> >> Jan 20 18:40:10 localhost zookeeper: INFO
> >> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1
> (message
> >> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round),
> >> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> >>
> >> And then the fast leader election fails essentially forever:
> >>
> >> Jan 20 18:40:10 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 400
> >> Jan 20 18:40:10 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 800
> >> Jan 20 18:40:11 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 1600
> >> Jan 20 18:40:13 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 3200
> >> Jan 20 18:40:16 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 6400
> >> Jan 20 18:40:22 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 12800
> >> Jan 20 18:40:35 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 25600
> >> Jan 20 18:41:01 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 51200
> >> Jan 20 18:41:52 localhost zookeeper: INFO
> >> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 60000
> >>
> >> Looking at the node elected as the new leader (
> zookeeper-5.prod.airtime.com),
> >> it doesn’t seem like it ever recognizes that it was the leader of the
> >> original election. I’m not sure if it’s because it can’t start or
> something
> >> else.
> >>
> >> Jan 20 18:40:05 localhost zookeeper: INFO
> >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> >> election. My id =  5, proposed zxid=0x8005e12e8
> >> Jan 20 18:40:05 localhost zookeeper: INFO
> >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> >> Jan 20 18:40:06 localhost zookeeper: INFO
> >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 400
> >> Jan 20 18:40:06 localhost zookeeper: INFO
> >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> >> Jan 20 18:40:06 localhost zookeeper: INFO
> >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 800
> >> Jan 20 18:40:06 localhost zookeeper: INFO
> >> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1
> (message
> >> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> >> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> >> Jan 20 18:40:07 localhost zookeeper: INFO
> >> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> >> Notification time out: 1600
> >>
> >> A couple of additional notes:
> >>
> >>  1. Restarting zookeeper on all hosts seemed to resolve the issue.
> >>  2. This was easily reproducible for me in 3 different zookeeper
> clusters
> >>  in AWS on Jan 20/21. Other things like killing follower hosts worked
> fine.
> >>  3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder if
> >>  anything network related could make this situation fail, while
> everything
> >>  else worked fine.
> >
>
>

Re: Zookeeper leader is terminated in AWS and cluster never recovers

Posted by Flavio Junqueira <fp...@apache.org>.

Perhaps you want to upload your logs somewhere for us to take a look. There has been issues around name resolution, like this one:

https://issues.apache.org/jira/browse/ZOOKEEPER-1506 <https://issues.apache.org/jira/browse/ZOOKEEPER-1506>

Is there any chance it could be related to it?

-Flavio 


> On 26 Jan 2016, at 22:19, Jim Wong <ji...@airtime.com> wrote:
> 
> Folks,
> 
> Does anyone have any thoughts on this one? We can semi-reliably cause our cluster to get into a seemingly permanent bad stat--at least several minutes, which is longer than we can tolerate in production--just by terminating the leader. We’ve seen recommendations to change things like the amount of time allotted for the database to be reconstructed from the last snapshot, but none of them have helped.
> 
> If nothing else, are there any particular log messages we should be looking for or troubleshooting strategies we can pursue?
> 
> 
> 
>> On Jan 25, 2016, at 11:30 AM, Ken Mamitsuka <ke...@airtime.com> wrote:
>> 
>> This may be related to the post "All nodes shutting down simultaneously",
>> but I'm not sure. I was able to reproduce this quite a few times. I
>> terminated the leader of a 5 node zookeeper cluster (terminated from the
>> AWS console). All the remaining hosts saw the leader disappear:
>> 
>> Jan 20 18:40:05 localhost zookeeper: WARN
>> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
>> following the leader
>> Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed out
>> 
>> Then it looks like a shutdown of the ZooKeeperServer and  its processors:
>> 
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] - shutting
>> down
>> 
>> I’m not sure if we should see startup logs or not. Then we see the leader
>> election appear to be successful (I’m not sure if the “ZooKeeperServer not
>> running” log is ominous or not):
>> 
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
>> /zookeeper/data/version-2/snapshot.8005dc366
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
>> election. My id =  2, proposed zxid=0x8005e12e8
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> ...
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - Created
>> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
>> datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER
>> ELECTION TOOK - 347
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state)
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] -
>> Resolved hostname: zookeeper-5-internal.prod.airtime.com to address:
>> zookeeper-5-internal.prod.airtime.com/10.1.24.208
>> 
>> However, attempts to connect to the new leader just fail forever:
>> 
>> Jan 20 18:40:06 localhost zookeeper: WARN
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
>> exception, tries=0, connecting to
>> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
>> Jan 20 18:40:07 localhost zookeeper: WARN
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
>> exception, tries=1, connecting to
>> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
>> Jan 20 18:40:08 localhost zookeeper: WARN
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
>> exception, tries=2, connecting to
>> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
>> Jan 20 18:40:09 localhost zookeeper: WARN
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
>> exception, tries=3, connecting to
>> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
>> 
>> Then it throws an exception and shuts down:
>> 
>> Jan 20 18:40:10 localhost zookeeper: WARN
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
>> following the leader
>> Jan 20 18:40:10 localhost java.net.ConnectException: Connection refused
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
>> Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower
>> Jan 20 18:40:10 localhost     at
>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>> Jan 20 18:40:10 localhost     at
>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
>> Shutting down
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
>> election. My id =  2, proposed zxid=0x8005e12e8
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round),
>> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> 
>> And then the fast leader election fails essentially forever:
>> 
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 400
>> Jan 20 18:40:10 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 800
>> Jan 20 18:40:11 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 1600
>> Jan 20 18:40:13 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 3200
>> Jan 20 18:40:16 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 6400
>> Jan 20 18:40:22 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 12800
>> Jan 20 18:40:35 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 25600
>> Jan 20 18:41:01 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 51200
>> Jan 20 18:41:52 localhost zookeeper: INFO
>> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 60000
>> 
>> Looking at the node elected as the new leader (zookeeper-5.prod.airtime.com),
>> it doesn’t seem like it ever recognizes that it was the leader of the
>> original election. I’m not sure if it’s because it can’t start or something
>> else.
>> 
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
>> election. My id =  5, proposed zxid=0x8005e12e8
>> Jan 20 18:40:05 localhost zookeeper: INFO
>> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 400
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 800
>> Jan 20 18:40:06 localhost zookeeper: INFO
>> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
>> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
>> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
>> Jan 20 18:40:07 localhost zookeeper: INFO
>> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
>> Notification time out: 1600
>> 
>> A couple of additional notes:
>> 
>>  1. Restarting zookeeper on all hosts seemed to resolve the issue.
>>  2. This was easily reproducible for me in 3 different zookeeper clusters
>>  in AWS on Jan 20/21. Other things like killing follower hosts worked fine.
>>  3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder if
>>  anything network related could make this situation fail, while everything
>>  else worked fine.
>

Re: Zookeeper leader is terminated in AWS and cluster never recovers

Posted by Jim Wong <ji...@airtime.com>.

Folks,

Does anyone have any thoughts on this one? We can semi-reliably cause our cluster to get into a seemingly permanent bad stat--at least several minutes, which is longer than we can tolerate in production--just by terminating the leader. We’ve seen recommendations to change things like the amount of time allotted for the database to be reconstructed from the last snapshot, but none of them have helped.

If nothing else, are there any particular log messages we should be looking for or troubleshooting strategies we can pursue?



> On Jan 25, 2016, at 11:30 AM, Ken Mamitsuka <ke...@airtime.com> wrote:
> 
> This may be related to the post "All nodes shutting down simultaneously",
> but I'm not sure. I was able to reproduce this quite a few times. I
> terminated the leader of a 5 node zookeeper cluster (terminated from the
> AWS console). All the remaining hosts saw the leader disappear:
> 
> Jan 20 18:40:05 localhost zookeeper: WARN
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
> Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed out
> 
> Then it looks like a shutdown of the ZooKeeperServer and  its processors:
> 
> Jan 20 18:40:05 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] - shutting
> down
> 
> I’m not sure if we should see startup logs or not. Then we see the leader
> election appear to be successful (I’m not sure if the “ZooKeeperServer not
> running” log is ominous or not):
> 
> Jan 20 18:40:05 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> Jan 20 18:40:05 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
> /zookeeper/data/version-2/snapshot.8005dc366
> Jan 20 18:40:05 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> election. My id =  2, proposed zxid=0x8005e12e8
> Jan 20 18:40:05 localhost zookeeper: INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> ...
> Jan 20 18:40:06 localhost zookeeper: INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - Created
> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
> datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER
> ELECTION TOOK - 347
> Jan 20 18:40:06 localhost zookeeper: INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state)
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] -
> Resolved hostname: zookeeper-5-internal.prod.airtime.com to address:
> zookeeper-5-internal.prod.airtime.com/10.1.24.208
> 
> However, attempts to connect to the new leader just fail forever:
> 
> Jan 20 18:40:06 localhost zookeeper: WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> exception, tries=0, connecting to
> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> Jan 20 18:40:07 localhost zookeeper: WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> exception, tries=1, connecting to
> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> Jan 20 18:40:08 localhost zookeeper: WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> exception, tries=2, connecting to
> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> Jan 20 18:40:09 localhost zookeeper: WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> exception, tries=3, connecting to
> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> 
> Then it throws an exception and shuts down:
> 
> Jan 20 18:40:10 localhost zookeeper: WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
> Jan 20 18:40:10 localhost java.net.ConnectException: Connection refused
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower
> Jan 20 18:40:10 localhost     at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> Jan 20 18:40:10 localhost     at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
> Shutting down
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> election. My id =  2, proposed zxid=0x8005e12e8
> Jan 20 18:40:10 localhost zookeeper: INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round),
> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> 
> And then the fast leader election fails essentially forever:
> 
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 400
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 800
> Jan 20 18:40:11 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 1600
> Jan 20 18:40:13 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 3200
> Jan 20 18:40:16 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 6400
> Jan 20 18:40:22 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 12800
> Jan 20 18:40:35 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 25600
> Jan 20 18:41:01 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 51200
> Jan 20 18:41:52 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> 
> Looking at the node elected as the new leader (zookeeper-5.prod.airtime.com),
> it doesn’t seem like it ever recognizes that it was the leader of the
> original election. I’m not sure if it’s because it can’t start or something
> else.
> 
> Jan 20 18:40:05 localhost zookeeper: INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> election. My id =  5, proposed zxid=0x8005e12e8
> Jan 20 18:40:05 localhost zookeeper: INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 400
> Jan 20 18:40:06 localhost zookeeper: INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 800
> Jan 20 18:40:06 localhost zookeeper: INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> Jan 20 18:40:07 localhost zookeeper: INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 1600
> 
> A couple of additional notes:
> 
>   1. Restarting zookeeper on all hosts seemed to resolve the issue.
>   2. This was easily reproducible for me in 3 different zookeeper clusters
>   in AWS on Jan 20/21. Other things like killing follower hosts worked fine.
>   3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder if
>   anything network related could make this situation fail, while everything
>   else worked fine.