You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Cee Tee <c....@gmail.com> on 2018/10/02 08:28:08 UTC

Re: Leader election failing

Quick update:
Apparently the election notifications disappeared somewhere between the
datacenters (firewall) when the sockets were not used for some time. We
fixed this with zookeeper.tcpKeepAlive=true.

Regards,

Chris

On Wed, Aug 8, 2018 at 5:05 PM Andor Molnar <an...@cloudera.com.invalid>
wrote:

> Some kind of a network split?
>
> It looks like 1-2 and 3-4 were able to communicate each other, but
> connection timed out between these 2 splits. When 5 came back online it
> started with supporters of (1,2) and later 3 and 4 also joined.
>
> There was no such issue the day after.
>
> Which version of ZooKeeper is this? 3.5.something?
>
> Regards,
> Andor
>
>
>
> On Wed, Aug 8, 2018 at 4:52 PM, Chris <c....@gmail.com> wrote:
>
> > Actually i have similar issues on my test and acceptance clusters where
> > leader election fails if the cluster has been running for a couple of
> days.
> > If you stop/start the Zookeepers once they will work fine on further
> > disruptions that day. Not sure yet what the treshold is.
> >
> >
> > On 8 August 2018 4:32:56 pm Camille Fournier <ca...@apache.org> wrote:
> >
> > Hard to say. It looks like about 15 minutes after your first incident
> where
> >> 5 goes down and then comes back up, servers 1 and 2 get socket errors to
> >> their connections with 3, 4, and 6. It's possible if you had waited
> those
> >> 15 minutes, once those errors cleared the quorum would've formed with
> the
> >> other servers. But as for why there were those errors in the first place
> >> it's not clear. Could be a network glitch, or an obscure bug in the
> >> connection logic. Has anyone else ever seen this?
> >> If you see it again, getting a stack trace of the servers when they
> can't
> >> form quorum might be helpful.
> >>
> >> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c....@gmail.com> wrote:
> >>
> >> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
> >>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
> >>> Yesterday one of the participants (id5, by chance was the leader) was
> >>> rebooted. Although all other servers were online and not suffering from
> >>> networking issues the leader election failed and the cluster remained
> >>> "looking" until the old leader came back online after which it was
> >>> promptly
> >>> elected as leader again.
> >>>
> >>> Today we tried the same exercise on the exact same servers, 5 was still
> >>> leader and was rebooted, and leader election worked fine with 4 as new
> >>> leader.
> >>>
> >>> I have included the logs.  From the logs i see that yesterday 1,2 never
> >>> received new leader proposals from 3,4 and vice versa.
> >>> Today all proposals came through. This is not the first time we've seen
> >>> this type of behavior, where some zookeepers can't seem to find each
> >>> other
> >>> after the leader goes down.
> >>> All servers use dynamic configuration and have the same config node.
> >>>
> >>> How could this be explained? These servers also host a replicated
> >>> database
> >>> cluster and have no history of db replication issues.
> >>>
> >>> Thanks,
> >>> Chris
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
>