You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by Vishal Kher <vi...@gmail.com> on 2010/11/10 20:26:15 UTC

What happens to a follower if leader hangs?

Hi,

In Follower.followLeader() after syncing with the leader, the follower does:
                while (self.isRunning()) {
                    readPacket(qp);
                    processPacket(qp);
                }

It looks like it relies on socket timeout expiry to figure out if the
connection with the leader has gone down.  So a follower *with no cilents*
may never notice a faulty leader if a Leader has a software hang, but the
TCP connections with the peers are still valid. Since it has not cilents, it
won't hearbeat with the Leader. If majority of followers are not connected
to any clients, then even if other followers attempt to elect a new leader
after detecting that the leader is unresponsive.

Please correct me if I am wrong. If I am not mistaken, should we add code at
the follower to monitor the heartbeat messages that it receives from the
leader and take action if it misses heartbeats for time > (syncLimit *
tickTime)? This certainly is a hypothetical case, however, I think it is
worth a fix.

Thanks.
-Vishal

Re: What happens to a follower if leader hangs?

Posted by Benjamin Reed <br...@yahoo-inc.com>.
have you been able to make this happen? the behavior you are suggesting 
is exactly what should be happening. When we sync with the leader we set 
the socket timeout: sock.setSoTimeout(self.tickTime * self.syncLimit);

if the leader hangs, we should get a timeout and disconnect from the leader.

ben


On 11/10/2010 11:57 AM, Vishal Kher wrote:
> Yes, thats what I was planning to do. At the follower, start FLE if the
> follower does not receive a ping for>  (syncLimit * tickTime).
>
>
> On Wed, Nov 10, 2010 at 2:48 PM, Mahadev Konar<ma...@yahoo-inc.com>wrote:
>
>> Hi Vishal,
>>   There are periodic pings sent from the leader to the followers.
>>
>> Take a look at Leader.java:
>>
>> syncedSet.add(self.getId());
>>                 synchronized (learners) {
>>                     for (LearnerHandler f : learners) {
>>                         if (f.synced()) {
>>                             syncedCount++;
>>                             syncedSet.add(f.getSid());
>>                         }
>>                         f.ping();
>>                     }
>>                 }
>>
>>
>> This code sends periodic pings to the followers to make sure they are
>> running fine. We should keep track of these pings and see if we havent seen
>> a ping packet from the leader for a long time and give up following the
>> leader in case we havent heard from him for a long time. This is definitely
>> worth fixing since we pride ourselves in being a highly available and
>> reliable service.
>>
>> Please feel free to open a jira and work on it.
>> 3.4 would be a good target for this.
>>
>> Thanks
>> mahadev
>>
>> On 11/10/10 12:26 PM, "Vishal Kher"<vi...@gmail.com>  wrote:
>>
>>> Hi,
>>>
>>> In Follower.followLeader() after syncing with the leader, the follower
>> does:
>>>                  while (self.isRunning()) {
>>>                      readPacket(qp);
>>>                      processPacket(qp);
>>>                  }
>>>
>>> It looks like it relies on socket timeout expiry to figure out if the
>>> connection with the leader has gone down.  So a follower *with no
>> cilents*
>>> may never notice a faulty leader if a Leader has a software hang, but the
>>> TCP connections with the peers are still valid. Since it has not cilents,
>> it
>>> won't hearbeat with the Leader. If majority of followers are not
>> connected
>>> to any clients, then even if other followers attempt to elect a new
>> leader
>>> after detecting that the leader is unresponsive.
>>>
>>> Please correct me if I am wrong. If I am not mistaken, should we add code
>> at
>>> the follower to monitor the heartbeat messages that it receives from the
>>> leader and take action if it misses heartbeats for time>  (syncLimit *
>>> tickTime)? This certainly is a hypothetical case, however, I think it is
>>> worth a fix.
>>>
>>> Thanks.
>>> -Vishal
>>>
>>


Re: What happens to a follower if leader hangs?

Posted by Vishal Kher <vi...@gmail.com>.
Yes, thats what I was planning to do. At the follower, start FLE if the
follower does not receive a ping for > (syncLimit * tickTime).


On Wed, Nov 10, 2010 at 2:48 PM, Mahadev Konar <ma...@yahoo-inc.com>wrote:

> Hi Vishal,
>  There are periodic pings sent from the leader to the followers.
>
> Take a look at Leader.java:
>
> syncedSet.add(self.getId());
>                synchronized (learners) {
>                    for (LearnerHandler f : learners) {
>                        if (f.synced()) {
>                            syncedCount++;
>                            syncedSet.add(f.getSid());
>                        }
>                        f.ping();
>                    }
>                }
>
>
> This code sends periodic pings to the followers to make sure they are
> running fine. We should keep track of these pings and see if we havent seen
> a ping packet from the leader for a long time and give up following the
> leader in case we havent heard from him for a long time. This is definitely
> worth fixing since we pride ourselves in being a highly available and
> reliable service.
>
> Please feel free to open a jira and work on it.
> 3.4 would be a good target for this.
>
> Thanks
> mahadev
>
> On 11/10/10 12:26 PM, "Vishal Kher" <vi...@gmail.com> wrote:
>
> > Hi,
> >
> > In Follower.followLeader() after syncing with the leader, the follower
> does:
> >                 while (self.isRunning()) {
> >                     readPacket(qp);
> >                     processPacket(qp);
> >                 }
> >
> > It looks like it relies on socket timeout expiry to figure out if the
> > connection with the leader has gone down.  So a follower *with no
> cilents*
> > may never notice a faulty leader if a Leader has a software hang, but the
> > TCP connections with the peers are still valid. Since it has not cilents,
> it
> > won't hearbeat with the Leader. If majority of followers are not
> connected
> > to any clients, then even if other followers attempt to elect a new
> leader
> > after detecting that the leader is unresponsive.
> >
> > Please correct me if I am wrong. If I am not mistaken, should we add code
> at
> > the follower to monitor the heartbeat messages that it receives from the
> > leader and take action if it misses heartbeats for time > (syncLimit *
> > tickTime)? This certainly is a hypothetical case, however, I think it is
> > worth a fix.
> >
> > Thanks.
> > -Vishal
> >
>
>

Re: What happens to a follower if leader hangs?

Posted by Patrick Hunt <ph...@apache.org>.
I'd go 3.3.3 and 3.4.0. Any of this (incl the other issues
Vishal/others have been finding recently) point to some particular set
of testing we might add to find problems like this? What are we
missing?

Once 3.3.2 is out and immediate tlp issues are addressed I'm going to
start pushing for 3.4 regardless of whether "everything" is in yet or
not.

Patrick

On Wed, Nov 10, 2010 at 11:48 AM, Mahadev Konar <ma...@yahoo-inc.com> wrote:
> Hi Vishal,
>  There are periodic pings sent from the leader to the followers.
>
> Take a look at Leader.java:
>
> syncedSet.add(self.getId());
>                synchronized (learners) {
>                    for (LearnerHandler f : learners) {
>                        if (f.synced()) {
>                            syncedCount++;
>                            syncedSet.add(f.getSid());
>                        }
>                        f.ping();
>                    }
>                }
>
>
> This code sends periodic pings to the followers to make sure they are
> running fine. We should keep track of these pings and see if we havent seen
> a ping packet from the leader for a long time and give up following the
> leader in case we havent heard from him for a long time. This is definitely
> worth fixing since we pride ourselves in being a highly available and
> reliable service.
>
> Please feel free to open a jira and work on it.
> 3.4 would be a good target for this.
>
> Thanks
> mahadev
>
> On 11/10/10 12:26 PM, "Vishal Kher" <vi...@gmail.com> wrote:
>
>> Hi,
>>
>> In Follower.followLeader() after syncing with the leader, the follower does:
>>                 while (self.isRunning()) {
>>                     readPacket(qp);
>>                     processPacket(qp);
>>                 }
>>
>> It looks like it relies on socket timeout expiry to figure out if the
>> connection with the leader has gone down.  So a follower *with no cilents*
>> may never notice a faulty leader if a Leader has a software hang, but the
>> TCP connections with the peers are still valid. Since it has not cilents, it
>> won't hearbeat with the Leader. If majority of followers are not connected
>> to any clients, then even if other followers attempt to elect a new leader
>> after detecting that the leader is unresponsive.
>>
>> Please correct me if I am wrong. If I am not mistaken, should we add code at
>> the follower to monitor the heartbeat messages that it receives from the
>> leader and take action if it misses heartbeats for time > (syncLimit *
>> tickTime)? This certainly is a hypothetical case, however, I think it is
>> worth a fix.
>>
>> Thanks.
>> -Vishal
>>
>
>

Re: What happens to a follower if leader hangs?

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Vishal,
 There are periodic pings sent from the leader to the followers.

Take a look at Leader.java:

syncedSet.add(self.getId());
                synchronized (learners) {
                    for (LearnerHandler f : learners) {
                        if (f.synced()) {
                            syncedCount++;
                            syncedSet.add(f.getSid());
                        }
                        f.ping();
                    }
                }


This code sends periodic pings to the followers to make sure they are
running fine. We should keep track of these pings and see if we havent seen
a ping packet from the leader for a long time and give up following the
leader in case we havent heard from him for a long time. This is definitely
worth fixing since we pride ourselves in being a highly available and
reliable service.

Please feel free to open a jira and work on it.
3.4 would be a good target for this.

Thanks
mahadev

On 11/10/10 12:26 PM, "Vishal Kher" <vi...@gmail.com> wrote:

> Hi,
> 
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> 
> It looks like it relies on socket timeout expiry to figure out if the
> connection with the leader has gone down.  So a follower *with no cilents*
> may never notice a faulty leader if a Leader has a software hang, but the
> TCP connections with the peers are still valid. Since it has not cilents, it
> won't hearbeat with the Leader. If majority of followers are not connected
> to any clients, then even if other followers attempt to elect a new leader
> after detecting that the leader is unresponsive.
> 
> Please correct me if I am wrong. If I am not mistaken, should we add code at
> the follower to monitor the heartbeat messages that it receives from the
> leader and take action if it misses heartbeats for time > (syncLimit *
> tickTime)? This certainly is a hypothetical case, however, I think it is
> worth a fix.
> 
> Thanks.
> -Vishal
>