You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Michi Mutsuzaki <mi...@cs.stanford.edu> on 2014/05/30 07:29:48 UTC

leader election doesn't settle

I have a 3 server cluster using ZooKeeper 3.4.5 with server IDs 61,
150, and 228, and 150 is the leader. I shut down 150. I have 2
questions.

1) Both 61 and 228 takes about 5 minutes to detect that the leader
died. Is there a tcp setting I need to tune to make this quicker?

https://paste.apache.org/4AFR?action=download

2) Leader election between 61 and 228 never settles. 61 doesn't seem
to receive notification from 228, and 228 keeps receiving notification
from 61 for the previous epoch. I restarted 61 and the leader election
settled. Have you guys seen this behavior?

https://paste.apache.org/37vU?action=download

Re: leader election doesn't settle

Posted by Flavio Junqueira <fp...@yahoo.com>.

Please consider committing ZK-1810 as well! :-)

-Flavio

On 31 May 2014, at 04:29, Michi Mutsuzaki <mi...@cs.stanford.edu> wrote:

> Thanks Flavio. I guess it's time for me to upgrade to 3.4.6 :)
> 
> Regarding the socket timeout, I see that the timeout is set to
> self.tickTime * self.initLimit in connectToLeader(). I'm using the
> default values for both tickTime and initLimit, so it should have
> timed out sooner. I'll double check these settings and the time the
> leader got killed.
> 
> Thanks!
> --Michi
> 
> 
> On Fri, May 30, 2014 at 1:30 AM, FPJ <fp...@yahoo.com> wrote:
>> Hi Michi,
>> 
>> 1) The follower stops following the leader when it gets an exception on the
>> socket (Follower.followLeader):
>>              ...
>>              while (self.isRunning()) {
>>                    readPacket(qp);
>>                    processPacket(qp);
>>                }
>>            } catch (Exception e) {
>>            ...
>> 
>>      I believe we are setting the timeout like this: self.tickTime *
>> self.initLimit. Check connectToLeader().
>> 
>> 2) I believe we fixed this bug in 3.4.6 and the change is pending for trunk.
>> Check ZK-1808 for 3.4.6 and ZK-1810 for trunk.
>> 
>> -Flavio
>> 
>> 
>>> -----Original Message-----
>>> From: mutsuzaki@gmail.com [mailto:mutsuzaki@gmail.com] On Behalf Of
>>> Michi Mutsuzaki
>>> Sent: 30 May 2014 06:30
>>> To: user@zookeeper.apache.org
>>> Subject: leader election doesn't settle
>>> 
>>> I have a 3 server cluster using ZooKeeper 3.4.5 with server IDs 61, 150,
>> and
>>> 228, and 150 is the leader. I shut down 150. I have 2 questions.
>>> 
>>> 1) Both 61 and 228 takes about 5 minutes to detect that the leader died.
>> Is
>>> there a tcp setting I need to tune to make this quicker?
>>> 
>>> https://paste.apache.org/4AFR?action=download
>>> 
>>> 2) Leader election between 61 and 228 never settles. 61 doesn't seem to
>>> receive notification from 228, and 228 keeps receiving notification from
>> 61 for
>>> the previous epoch. I restarted 61 and the leader election settled. Have
>> you
>>> guys seen this behavior?
>>> 
>>> https://paste.apache.org/37vU?action=download
>>

Re: leader election doesn't settle

Posted by Michi Mutsuzaki <mi...@cs.stanford.edu>.

Thanks Flavio. I guess it's time for me to upgrade to 3.4.6 :)

Regarding the socket timeout, I see that the timeout is set to
self.tickTime * self.initLimit in connectToLeader(). I'm using the
default values for both tickTime and initLimit, so it should have
timed out sooner. I'll double check these settings and the time the
leader got killed.

Thanks!
--Michi


On Fri, May 30, 2014 at 1:30 AM, FPJ <fp...@yahoo.com> wrote:
> Hi Michi,
>
> 1) The follower stops following the leader when it gets an exception on the
> socket (Follower.followLeader):
>               ...
>               while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
>             } catch (Exception e) {
>             ...
>
>       I believe we are setting the timeout like this: self.tickTime *
> self.initLimit. Check connectToLeader().
>
> 2) I believe we fixed this bug in 3.4.6 and the change is pending for trunk.
> Check ZK-1808 for 3.4.6 and ZK-1810 for trunk.
>
> -Flavio
>
>
>> -----Original Message-----
>> From: mutsuzaki@gmail.com [mailto:mutsuzaki@gmail.com] On Behalf Of
>> Michi Mutsuzaki
>> Sent: 30 May 2014 06:30
>> To: user@zookeeper.apache.org
>> Subject: leader election doesn't settle
>>
>> I have a 3 server cluster using ZooKeeper 3.4.5 with server IDs 61, 150,
> and
>> 228, and 150 is the leader. I shut down 150. I have 2 questions.
>>
>> 1) Both 61 and 228 takes about 5 minutes to detect that the leader died.
> Is
>> there a tcp setting I need to tune to make this quicker?
>>
>> https://paste.apache.org/4AFR?action=download
>>
>> 2) Leader election between 61 and 228 never settles. 61 doesn't seem to
>> receive notification from 228, and 228 keeps receiving notification from
> 61 for
>> the previous epoch. I restarted 61 and the leader election settled. Have
> you
>> guys seen this behavior?
>>
>> https://paste.apache.org/37vU?action=download
>

RE: leader election doesn't settle

Posted by FPJ <fp...@yahoo.com>.

Hi Michi,

1) The follower stops following the leader when it gets an exception on the
socket (Follower.followLeader):
              ...
              while (self.isRunning()) {
                    readPacket(qp);
                    processPacket(qp);
                }
            } catch (Exception e) {
            ...

      I believe we are setting the timeout like this: self.tickTime *
self.initLimit. Check connectToLeader().

2) I believe we fixed this bug in 3.4.6 and the change is pending for trunk.
Check ZK-1808 for 3.4.6 and ZK-1810 for trunk.

-Flavio


> -----Original Message-----
> From: mutsuzaki@gmail.com [mailto:mutsuzaki@gmail.com] On Behalf Of
> Michi Mutsuzaki
> Sent: 30 May 2014 06:30
> To: user@zookeeper.apache.org
> Subject: leader election doesn't settle
> 
> I have a 3 server cluster using ZooKeeper 3.4.5 with server IDs 61, 150,
and
> 228, and 150 is the leader. I shut down 150. I have 2 questions.
> 
> 1) Both 61 and 228 takes about 5 minutes to detect that the leader died.
Is
> there a tcp setting I need to tune to make this quicker?
> 
> https://paste.apache.org/4AFR?action=download
> 
> 2) Leader election between 61 and 228 never settles. 61 doesn't seem to
> receive notification from 228, and 228 keeps receiving notification from
61 for
> the previous epoch. I restarted 61 and the leader election settled. Have
you
> guys seen this behavior?
> 
> https://paste.apache.org/37vU?action=download