You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Guy Laden <gu...@gmail.com> on 2016/08/24 22:15:35 UTC

Working around Leader election Listner thread death

Hi all,

It looks like due to a security scan sending "bad" traffic to the leader
election port, we have clusters in which
the leader election Listener thread is dead (unchecked exception was thrown
and thread died - seen in the log).
(This seems to be fixed by fixed in
https://issues.apache.org/jira/browse/ZOOKEEPER-2186)

In this state, when a healthy server comes up and tries to connecnt to the
quorum, it gets stuck on
the leader election. It establishes TCP connections to the other servers
but any traffic it sends seems
to get stuck in the receiver's TCP Recv queue (seen with netstat), and is
not read/processed by zk.

Not a good place to be :)

This is with 3.4.6

Is there a way to get such clusters back to a healthy state without loss of
quorum / client impact?
Some way of re-starting the listener thread? or restarting the servers in a
certain order?
e.g. If I restart a minority, say the ones with lower server id's - is
there a way to get the majority servers
to re-initiate leader election connections with them so as to connect them
to the quorum? (and to do this without
the majority losing quorum).

Thanks,
Guy

Re: Working around Leader election Listner thread death

Posted by Guy Laden <gu...@gmail.com>.
Hi Flavio, I think your idea of using iptables should help. I hope to have
time to experiment with it.
Thanks for your help.
Guy


On Wed, Aug 31, 2016 at 11:15 AM, Flavio Junqueira <fp...@apache.org> wrote:

> Ok, I think I get what you're saying. Perhaps you're missing that this is
> an issue that Guy encountered in 3.4.6 and that is fixed in a later
> release. We are discussing here a workaround for his 3.4.6 deployment, not
> a permanent solution. Does it make sense?
>
> -Flavio
>
> > On 31 Aug 2016, at 01:16, David Brower <da...@oracle.com> wrote:
> >
> > You'd be programming iptables to pass/accept things from a whitelist of
> peers you're willing to talk with.
> >
> > If you've got such a whitelist, you don't need to program iptables to
> look at the peer address from a packet/socket and drop it, you can just do
> it in your message processing code.
> >
> > The second part deals with various hang situations.   If you've got a
> critical thread selecting/reading messages, then it can't wait forever for
> a stuck read (or write).   Every operation needs to be timed out in some
> fashion to prevent things like hung election thread.
> >
> > You get into this sort of thing when miscreants or PEN-testers start
> scanning your open ports and sending you malformed or fuzzed packets that
> you don't handle cleanly, or start some exchange that they don't complete.
> >
> > -dB
> >
> > Oracle RAC Database and Clusterware Architect
> >
> >
> > On 8/30/2016 4:54 PM, Flavio Junqueira wrote:
> >> I'm not sure what you're suggesting, David. Could you be more specific,
> please?
> >>
> >> -Flavio
> >>
> >>> On 30 Aug 2016, at 23:54, David Brower <da...@oracle.com>
> wrote:
> >>>
> >>> Anything you could do with iptables you can do in the process by
> having it drop connections from things not on a whitelist, and not having a
> thread waiting indefinitely for operations from any connection.
> >>>
> >>> -dB
> >>>
> >>>
> >>> On 8/30/2016 2:46 PM, Flavio Junqueira wrote:
> >>>> I was trying to write down an analysis and I haven't been able to
> come up with anything that is foolproof. Basically, the two main issues are:
> >>>>
> >>>> - A bad server is able to connect to a good server in the case it has
> a message outstanding and is trying to establish a connection to the good
> server. This happens if the server is LOOKING or has an outstanding message
> from the previous round. The converse isn't true, though. A good server
> can't start a connection to a bad server because the bad server doesn't
> have a listener.
> >>>> - If we bounce servers sequentially, there is a chance that a bad
> server is elected more than once along the process, which induces multiple
> leader election rounds.
> >>>>
> >>>> Perhaps this is overkill, but I was wondering if it makes sense to
> filter election traffic to and from bad servers using, for example,
> iptables. The idea is to a rule that are local to each server preventing
> the server to get connections established for leader election. For each bad
> server, we stop it, remove the rule, and bring it back up. We also stop a
> minority first before stoping the bad leader.
> >>>>
> >>>> -Flavio
> >>>>
> >>>>> On 29 Aug 2016, at 09:29, Guy Laden <gu...@gmail.com> wrote:
> >>>>>
> >>>>> Hi Flavio, Thanks for your reply. The situation is that indeed all
> the
> >>>>> servers are in a bad state so it looks like we will have to perform a
> >>>>> cluster restart.
> >>>>>
> >>>>> We played with attempts to optimize the downtime along the lines you
> >>>>> suggested. In testing it we ran into the issue where a server with no
> >>>>> Listener thread can initiate a leader election connection to a
> >>>>> newly-restarted server that does have a Listener. The result is a
> quorum
> >>>>> that may include 'bad' servers, even a 'bad' leader. So we tried to
> first
> >>>>> restart the higher-id servers, because lower-id servers will drop
> their
> >>>>> leader-election connections to higher id servers.
> >>>>> I'm told there are issues with this flow as well but have not yet
> >>>>> investigated the details.
> >>>>> I also worry about the leader-election retries done with exponential
> >>>>> backoff.
> >>>>>
> >>>>> I guess we will play with things a bit more but at this point I am
> tending
> >>>>> towards a simple parallel restart of all servers..
> >>>>>
> >>>>> Once the clusters are healthy again we will do a rolling upgrade to
> 3.4.8
> >>>>> sometime soon.
> >>>>>
> >>>>> Thanks again,
> >>>>> Guy
> >>>>>
> >>>>>
> >>>>> On Sun, Aug 28, 2016 at 5:52 PM, Flavio Junqueira <fp...@apache.org>
> wrote:
> >>>>>
> >>>>>> Hi Guy,
> >>>>>>
> >>>>>> We don't have a way to restart the listener thread, so you really
> need to
> >>>>>> bounce the server. I don't think there is a way of doing this
> without
> >>>>>> forcing a leader election, assuming all your servers are in this
> bad state.
> >>>>>> To minimize downtime, one thing you can do is to avoid bouncing the
> current
> >>>>>> leader until it loses quorum support. Once it loses quorum support,
> you
> >>>>>> have a quorum of healthy servers and they will elect a new, healthy
> leader.
> >>>>>> At the point, you can bounce all your unhealthy servers.
> >>>>>>
> >>>>>> You may also want to move to a later 3.4 release.
> >>>>>>
> >>>>>> -Flavio
> >>>>>>
> >>>>>>> On 24 Aug 2016, at 23:15, Guy Laden <gu...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> It looks like due to a security scan sending "bad" traffic to the
> leader
> >>>>>>> election port, we have clusters in which
> >>>>>>> the leader election Listener thread is dead (unchecked exception
> was
> >>>>>> thrown
> >>>>>>> and thread died - seen in the log).
> >>>>>>> (This seems to be fixed by fixed in
> >>>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2186)
> >>>>>>>
> >>>>>>> In this state, when a healthy server comes up and tries to
> connecnt to
> >>>>>> the
> >>>>>>> quorum, it gets stuck on
> >>>>>>> the leader election. It establishes TCP connections to the other
> servers
> >>>>>>> but any traffic it sends seems
> >>>>>>> to get stuck in the receiver's TCP Recv queue (seen with netstat),
> and is
> >>>>>>> not read/processed by zk.
> >>>>>>>
> >>>>>>> Not a good place to be :)
> >>>>>>>
> >>>>>>> This is with 3.4.6
> >>>>>>>
> >>>>>>> Is there a way to get such clusters back to a healthy state
> without loss
> >>>>>> of
> >>>>>>> quorum / client impact?
> >>>>>>> Some way of re-starting the listener thread? or restarting the
> servers
> >>>>>> in a
> >>>>>>> certain order?
> >>>>>>> e.g. If I restart a minority, say the ones with lower server id's
> - is
> >>>>>>> there a way to get the majority servers
> >>>>>>> to re-initiate leader election connections with them so as to
> connect
> >>>>>> them
> >>>>>>> to the quorum? (and to do this without
> >>>>>>> the majority losing quorum).
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Guy
> >
>
>

Re: Working around Leader election Listner thread death

Posted by David Brower <da...@oracle.com>.
OK, makes sense.

On 8/31/2016 1:15 AM, Flavio Junqueira wrote:
> Ok, I think I get what you're saying. Perhaps you're missing that this is an issue that Guy encountered in 3.4.6 and that is fixed in a later release. We are discussing here a workaround for his 3.4.6 deployment, not a permanent solution. Does it make sense?
>
> -Flavio
>   
>> On 31 Aug 2016, at 01:16, David Brower <da...@oracle.com> wrote:
>>
>> You'd be programming iptables to pass/accept things from a whitelist of peers you're willing to talk with.
>>
>> If you've got such a whitelist, you don't need to program iptables to look at the peer address from a packet/socket and drop it, you can just do it in your message processing code.
>>
>> The second part deals with various hang situations.   If you've got a critical thread selecting/reading messages, then it can't wait forever for a stuck read (or write).   Every operation needs to be timed out in some fashion to prevent things like hung election thread.
>>
>> You get into this sort of thing when miscreants or PEN-testers start scanning your open ports and sending you malformed or fuzzed packets that you don't handle cleanly, or start some exchange that they don't complete.
>>
>> -dB
>>
>> Oracle RAC Database and Clusterware Architect
>>
>>
>> On 8/30/2016 4:54 PM, Flavio Junqueira wrote:
>>> I'm not sure what you're suggesting, David. Could you be more specific, please?
>>>
>>> -Flavio
>>>
>>>> On 30 Aug 2016, at 23:54, David Brower <da...@oracle.com> wrote:
>>>>
>>>> Anything you could do with iptables you can do in the process by having it drop connections from things not on a whitelist, and not having a thread waiting indefinitely for operations from any connection.
>>>>
>>>> -dB
>>>>
>>>>
>>>> On 8/30/2016 2:46 PM, Flavio Junqueira wrote:
>>>>> I was trying to write down an analysis and I haven't been able to come up with anything that is foolproof. Basically, the two main issues are:
>>>>>
>>>>> - A bad server is able to connect to a good server in the case it has a message outstanding and is trying to establish a connection to the good server. This happens if the server is LOOKING or has an outstanding message from the previous round. The converse isn't true, though. A good server can't start a connection to a bad server because the bad server doesn't have a listener.
>>>>> - If we bounce servers sequentially, there is a chance that a bad server is elected more than once along the process, which induces multiple leader election rounds.
>>>>>
>>>>> Perhaps this is overkill, but I was wondering if it makes sense to filter election traffic to and from bad servers using, for example, iptables. The idea is to a rule that are local to each server preventing the server to get connections established for leader election. For each bad server, we stop it, remove the rule, and bring it back up. We also stop a minority first before stoping the bad leader.
>>>>>
>>>>> -Flavio
>>>>>
>>>>>> On 29 Aug 2016, at 09:29, Guy Laden <gu...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Flavio, Thanks for your reply. The situation is that indeed all the
>>>>>> servers are in a bad state so it looks like we will have to perform a
>>>>>> cluster restart.
>>>>>>
>>>>>> We played with attempts to optimize the downtime along the lines you
>>>>>> suggested. In testing it we ran into the issue where a server with no
>>>>>> Listener thread can initiate a leader election connection to a
>>>>>> newly-restarted server that does have a Listener. The result is a quorum
>>>>>> that may include 'bad' servers, even a 'bad' leader. So we tried to first
>>>>>> restart the higher-id servers, because lower-id servers will drop their
>>>>>> leader-election connections to higher id servers.
>>>>>> I'm told there are issues with this flow as well but have not yet
>>>>>> investigated the details.
>>>>>> I also worry about the leader-election retries done with exponential
>>>>>> backoff.
>>>>>>
>>>>>> I guess we will play with things a bit more but at this point I am tending
>>>>>> towards a simple parallel restart of all servers..
>>>>>>
>>>>>> Once the clusters are healthy again we will do a rolling upgrade to 3.4.8
>>>>>> sometime soon.
>>>>>>
>>>>>> Thanks again,
>>>>>> Guy
>>>>>>
>>>>>>
>>>>>> On Sun, Aug 28, 2016 at 5:52 PM, Flavio Junqueira <fp...@apache.org> wrote:
>>>>>>
>>>>>>> Hi Guy,
>>>>>>>
>>>>>>> We don't have a way to restart the listener thread, so you really need to
>>>>>>> bounce the server. I don't think there is a way of doing this without
>>>>>>> forcing a leader election, assuming all your servers are in this bad state.
>>>>>>> To minimize downtime, one thing you can do is to avoid bouncing the current
>>>>>>> leader until it loses quorum support. Once it loses quorum support, you
>>>>>>> have a quorum of healthy servers and they will elect a new, healthy leader.
>>>>>>> At the point, you can bounce all your unhealthy servers.
>>>>>>>
>>>>>>> You may also want to move to a later 3.4 release.
>>>>>>>
>>>>>>> -Flavio
>>>>>>>
>>>>>>>> On 24 Aug 2016, at 23:15, Guy Laden <gu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> It looks like due to a security scan sending "bad" traffic to the leader
>>>>>>>> election port, we have clusters in which
>>>>>>>> the leader election Listener thread is dead (unchecked exception was
>>>>>>> thrown
>>>>>>>> and thread died - seen in the log).
>>>>>>>> (This seems to be fixed by fixed in
>>>>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2186)
>>>>>>>>
>>>>>>>> In this state, when a healthy server comes up and tries to connecnt to
>>>>>>> the
>>>>>>>> quorum, it gets stuck on
>>>>>>>> the leader election. It establishes TCP connections to the other servers
>>>>>>>> but any traffic it sends seems
>>>>>>>> to get stuck in the receiver's TCP Recv queue (seen with netstat), and is
>>>>>>>> not read/processed by zk.
>>>>>>>>
>>>>>>>> Not a good place to be :)
>>>>>>>>
>>>>>>>> This is with 3.4.6
>>>>>>>>
>>>>>>>> Is there a way to get such clusters back to a healthy state without loss
>>>>>>> of
>>>>>>>> quorum / client impact?
>>>>>>>> Some way of re-starting the listener thread? or restarting the servers
>>>>>>> in a
>>>>>>>> certain order?
>>>>>>>> e.g. If I restart a minority, say the ones with lower server id's - is
>>>>>>>> there a way to get the majority servers
>>>>>>>> to re-initiate leader election connections with them so as to connect
>>>>>>> them
>>>>>>>> to the quorum? (and to do this without
>>>>>>>> the majority losing quorum).
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Guy



Re: Working around Leader election Listner thread death

Posted by Flavio Junqueira <fp...@apache.org>.
Ok, I think I get what you're saying. Perhaps you're missing that this is an issue that Guy encountered in 3.4.6 and that is fixed in a later release. We are discussing here a workaround for his 3.4.6 deployment, not a permanent solution. Does it make sense?

-Flavio
 
> On 31 Aug 2016, at 01:16, David Brower <da...@oracle.com> wrote:
> 
> You'd be programming iptables to pass/accept things from a whitelist of peers you're willing to talk with.
> 
> If you've got such a whitelist, you don't need to program iptables to look at the peer address from a packet/socket and drop it, you can just do it in your message processing code.
> 
> The second part deals with various hang situations.   If you've got a critical thread selecting/reading messages, then it can't wait forever for a stuck read (or write).   Every operation needs to be timed out in some fashion to prevent things like hung election thread.
> 
> You get into this sort of thing when miscreants or PEN-testers start scanning your open ports and sending you malformed or fuzzed packets that you don't handle cleanly, or start some exchange that they don't complete.
> 
> -dB
> 
> Oracle RAC Database and Clusterware Architect
> 
> 
> On 8/30/2016 4:54 PM, Flavio Junqueira wrote:
>> I'm not sure what you're suggesting, David. Could you be more specific, please?
>> 
>> -Flavio
>> 
>>> On 30 Aug 2016, at 23:54, David Brower <da...@oracle.com> wrote:
>>> 
>>> Anything you could do with iptables you can do in the process by having it drop connections from things not on a whitelist, and not having a thread waiting indefinitely for operations from any connection.
>>> 
>>> -dB
>>> 
>>> 
>>> On 8/30/2016 2:46 PM, Flavio Junqueira wrote:
>>>> I was trying to write down an analysis and I haven't been able to come up with anything that is foolproof. Basically, the two main issues are:
>>>> 
>>>> - A bad server is able to connect to a good server in the case it has a message outstanding and is trying to establish a connection to the good server. This happens if the server is LOOKING or has an outstanding message from the previous round. The converse isn't true, though. A good server can't start a connection to a bad server because the bad server doesn't have a listener.
>>>> - If we bounce servers sequentially, there is a chance that a bad server is elected more than once along the process, which induces multiple leader election rounds.
>>>> 
>>>> Perhaps this is overkill, but I was wondering if it makes sense to filter election traffic to and from bad servers using, for example, iptables. The idea is to a rule that are local to each server preventing the server to get connections established for leader election. For each bad server, we stop it, remove the rule, and bring it back up. We also stop a minority first before stoping the bad leader.
>>>> 
>>>> -Flavio
>>>> 
>>>>> On 29 Aug 2016, at 09:29, Guy Laden <gu...@gmail.com> wrote:
>>>>> 
>>>>> Hi Flavio, Thanks for your reply. The situation is that indeed all the
>>>>> servers are in a bad state so it looks like we will have to perform a
>>>>> cluster restart.
>>>>> 
>>>>> We played with attempts to optimize the downtime along the lines you
>>>>> suggested. In testing it we ran into the issue where a server with no
>>>>> Listener thread can initiate a leader election connection to a
>>>>> newly-restarted server that does have a Listener. The result is a quorum
>>>>> that may include 'bad' servers, even a 'bad' leader. So we tried to first
>>>>> restart the higher-id servers, because lower-id servers will drop their
>>>>> leader-election connections to higher id servers.
>>>>> I'm told there are issues with this flow as well but have not yet
>>>>> investigated the details.
>>>>> I also worry about the leader-election retries done with exponential
>>>>> backoff.
>>>>> 
>>>>> I guess we will play with things a bit more but at this point I am tending
>>>>> towards a simple parallel restart of all servers..
>>>>> 
>>>>> Once the clusters are healthy again we will do a rolling upgrade to 3.4.8
>>>>> sometime soon.
>>>>> 
>>>>> Thanks again,
>>>>> Guy
>>>>> 
>>>>> 
>>>>> On Sun, Aug 28, 2016 at 5:52 PM, Flavio Junqueira <fp...@apache.org> wrote:
>>>>> 
>>>>>> Hi Guy,
>>>>>> 
>>>>>> We don't have a way to restart the listener thread, so you really need to
>>>>>> bounce the server. I don't think there is a way of doing this without
>>>>>> forcing a leader election, assuming all your servers are in this bad state.
>>>>>> To minimize downtime, one thing you can do is to avoid bouncing the current
>>>>>> leader until it loses quorum support. Once it loses quorum support, you
>>>>>> have a quorum of healthy servers and they will elect a new, healthy leader.
>>>>>> At the point, you can bounce all your unhealthy servers.
>>>>>> 
>>>>>> You may also want to move to a later 3.4 release.
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>>> On 24 Aug 2016, at 23:15, Guy Laden <gu...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> It looks like due to a security scan sending "bad" traffic to the leader
>>>>>>> election port, we have clusters in which
>>>>>>> the leader election Listener thread is dead (unchecked exception was
>>>>>> thrown
>>>>>>> and thread died - seen in the log).
>>>>>>> (This seems to be fixed by fixed in
>>>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2186)
>>>>>>> 
>>>>>>> In this state, when a healthy server comes up and tries to connecnt to
>>>>>> the
>>>>>>> quorum, it gets stuck on
>>>>>>> the leader election. It establishes TCP connections to the other servers
>>>>>>> but any traffic it sends seems
>>>>>>> to get stuck in the receiver's TCP Recv queue (seen with netstat), and is
>>>>>>> not read/processed by zk.
>>>>>>> 
>>>>>>> Not a good place to be :)
>>>>>>> 
>>>>>>> This is with 3.4.6
>>>>>>> 
>>>>>>> Is there a way to get such clusters back to a healthy state without loss
>>>>>> of
>>>>>>> quorum / client impact?
>>>>>>> Some way of re-starting the listener thread? or restarting the servers
>>>>>> in a
>>>>>>> certain order?
>>>>>>> e.g. If I restart a minority, say the ones with lower server id's - is
>>>>>>> there a way to get the majority servers
>>>>>>> to re-initiate leader election connections with them so as to connect
>>>>>> them
>>>>>>> to the quorum? (and to do this without
>>>>>>> the majority losing quorum).
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Guy
> 


Re: Working around Leader election Listner thread death

Posted by David Brower <da...@oracle.com>.
You'd be programming iptables to pass/accept things from a whitelist of 
peers you're willing to talk with.

If you've got such a whitelist, you don't need to program iptables to 
look at the peer address from a packet/socket and drop it, you can just 
do it in your message processing code.

The second part deals with various hang situations.   If you've got a 
critical thread selecting/reading messages, then it can't wait forever 
for a stuck read (or write).   Every operation needs to be timed out in 
some fashion to prevent things like hung election thread.

You get into this sort of thing when miscreants or PEN-testers start 
scanning your open ports and sending you malformed or fuzzed packets 
that you don't handle cleanly, or start some exchange that they don't 
complete.

-dB

Oracle RAC Database and Clusterware Architect


On 8/30/2016 4:54 PM, Flavio Junqueira wrote:
> I'm not sure what you're suggesting, David. Could you be more specific, please?
>
> -Flavio
>
>> On 30 Aug 2016, at 23:54, David Brower <da...@oracle.com> wrote:
>>
>> Anything you could do with iptables you can do in the process by having it drop connections from things not on a whitelist, and not having a thread waiting indefinitely for operations from any connection.
>>
>> -dB
>>
>>
>> On 8/30/2016 2:46 PM, Flavio Junqueira wrote:
>>> I was trying to write down an analysis and I haven't been able to come up with anything that is foolproof. Basically, the two main issues are:
>>>
>>> - A bad server is able to connect to a good server in the case it has a message outstanding and is trying to establish a connection to the good server. This happens if the server is LOOKING or has an outstanding message from the previous round. The converse isn't true, though. A good server can't start a connection to a bad server because the bad server doesn't have a listener.
>>> - If we bounce servers sequentially, there is a chance that a bad server is elected more than once along the process, which induces multiple leader election rounds.
>>>
>>> Perhaps this is overkill, but I was wondering if it makes sense to filter election traffic to and from bad servers using, for example, iptables. The idea is to a rule that are local to each server preventing the server to get connections established for leader election. For each bad server, we stop it, remove the rule, and bring it back up. We also stop a minority first before stoping the bad leader.
>>>
>>> -Flavio
>>>
>>>> On 29 Aug 2016, at 09:29, Guy Laden <gu...@gmail.com> wrote:
>>>>
>>>> Hi Flavio, Thanks for your reply. The situation is that indeed all the
>>>> servers are in a bad state so it looks like we will have to perform a
>>>> cluster restart.
>>>>
>>>> We played with attempts to optimize the downtime along the lines you
>>>> suggested. In testing it we ran into the issue where a server with no
>>>> Listener thread can initiate a leader election connection to a
>>>> newly-restarted server that does have a Listener. The result is a quorum
>>>> that may include 'bad' servers, even a 'bad' leader. So we tried to first
>>>> restart the higher-id servers, because lower-id servers will drop their
>>>> leader-election connections to higher id servers.
>>>> I'm told there are issues with this flow as well but have not yet
>>>> investigated the details.
>>>> I also worry about the leader-election retries done with exponential
>>>> backoff.
>>>>
>>>> I guess we will play with things a bit more but at this point I am tending
>>>> towards a simple parallel restart of all servers..
>>>>
>>>> Once the clusters are healthy again we will do a rolling upgrade to 3.4.8
>>>> sometime soon.
>>>>
>>>> Thanks again,
>>>> Guy
>>>>
>>>>
>>>> On Sun, Aug 28, 2016 at 5:52 PM, Flavio Junqueira <fp...@apache.org> wrote:
>>>>
>>>>> Hi Guy,
>>>>>
>>>>> We don't have a way to restart the listener thread, so you really need to
>>>>> bounce the server. I don't think there is a way of doing this without
>>>>> forcing a leader election, assuming all your servers are in this bad state.
>>>>> To minimize downtime, one thing you can do is to avoid bouncing the current
>>>>> leader until it loses quorum support. Once it loses quorum support, you
>>>>> have a quorum of healthy servers and they will elect a new, healthy leader.
>>>>> At the point, you can bounce all your unhealthy servers.
>>>>>
>>>>> You may also want to move to a later 3.4 release.
>>>>>
>>>>> -Flavio
>>>>>
>>>>>> On 24 Aug 2016, at 23:15, Guy Laden <gu...@gmail.com> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> It looks like due to a security scan sending "bad" traffic to the leader
>>>>>> election port, we have clusters in which
>>>>>> the leader election Listener thread is dead (unchecked exception was
>>>>> thrown
>>>>>> and thread died - seen in the log).
>>>>>> (This seems to be fixed by fixed in
>>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2186)
>>>>>>
>>>>>> In this state, when a healthy server comes up and tries to connecnt to
>>>>> the
>>>>>> quorum, it gets stuck on
>>>>>> the leader election. It establishes TCP connections to the other servers
>>>>>> but any traffic it sends seems
>>>>>> to get stuck in the receiver's TCP Recv queue (seen with netstat), and is
>>>>>> not read/processed by zk.
>>>>>>
>>>>>> Not a good place to be :)
>>>>>>
>>>>>> This is with 3.4.6
>>>>>>
>>>>>> Is there a way to get such clusters back to a healthy state without loss
>>>>> of
>>>>>> quorum / client impact?
>>>>>> Some way of re-starting the listener thread? or restarting the servers
>>>>> in a
>>>>>> certain order?
>>>>>> e.g. If I restart a minority, say the ones with lower server id's - is
>>>>>> there a way to get the majority servers
>>>>>> to re-initiate leader election connections with them so as to connect
>>>>> them
>>>>>> to the quorum? (and to do this without
>>>>>> the majority losing quorum).
>>>>>>
>>>>>> Thanks,
>>>>>> Guy


Re: Working around Leader election Listner thread death

Posted by Flavio Junqueira <fp...@apache.org>.
I'm not sure what you're suggesting, David. Could you be more specific, please?

-Flavio

> On 30 Aug 2016, at 23:54, David Brower <da...@oracle.com> wrote:
> 
> Anything you could do with iptables you can do in the process by having it drop connections from things not on a whitelist, and not having a thread waiting indefinitely for operations from any connection.
> 
> -dB
> 
> 
> On 8/30/2016 2:46 PM, Flavio Junqueira wrote:
>> I was trying to write down an analysis and I haven't been able to come up with anything that is foolproof. Basically, the two main issues are:
>> 
>> - A bad server is able to connect to a good server in the case it has a message outstanding and is trying to establish a connection to the good server. This happens if the server is LOOKING or has an outstanding message from the previous round. The converse isn't true, though. A good server can't start a connection to a bad server because the bad server doesn't have a listener.
>> - If we bounce servers sequentially, there is a chance that a bad server is elected more than once along the process, which induces multiple leader election rounds.
>> 
>> Perhaps this is overkill, but I was wondering if it makes sense to filter election traffic to and from bad servers using, for example, iptables. The idea is to a rule that are local to each server preventing the server to get connections established for leader election. For each bad server, we stop it, remove the rule, and bring it back up. We also stop a minority first before stoping the bad leader.
>> 
>> -Flavio
>> 
>>> On 29 Aug 2016, at 09:29, Guy Laden <gu...@gmail.com> wrote:
>>> 
>>> Hi Flavio, Thanks for your reply. The situation is that indeed all the
>>> servers are in a bad state so it looks like we will have to perform a
>>> cluster restart.
>>> 
>>> We played with attempts to optimize the downtime along the lines you
>>> suggested. In testing it we ran into the issue where a server with no
>>> Listener thread can initiate a leader election connection to a
>>> newly-restarted server that does have a Listener. The result is a quorum
>>> that may include 'bad' servers, even a 'bad' leader. So we tried to first
>>> restart the higher-id servers, because lower-id servers will drop their
>>> leader-election connections to higher id servers.
>>> I'm told there are issues with this flow as well but have not yet
>>> investigated the details.
>>> I also worry about the leader-election retries done with exponential
>>> backoff.
>>> 
>>> I guess we will play with things a bit more but at this point I am tending
>>> towards a simple parallel restart of all servers..
>>> 
>>> Once the clusters are healthy again we will do a rolling upgrade to 3.4.8
>>> sometime soon.
>>> 
>>> Thanks again,
>>> Guy
>>> 
>>> 
>>> On Sun, Aug 28, 2016 at 5:52 PM, Flavio Junqueira <fp...@apache.org> wrote:
>>> 
>>>> Hi Guy,
>>>> 
>>>> We don't have a way to restart the listener thread, so you really need to
>>>> bounce the server. I don't think there is a way of doing this without
>>>> forcing a leader election, assuming all your servers are in this bad state.
>>>> To minimize downtime, one thing you can do is to avoid bouncing the current
>>>> leader until it loses quorum support. Once it loses quorum support, you
>>>> have a quorum of healthy servers and they will elect a new, healthy leader.
>>>> At the point, you can bounce all your unhealthy servers.
>>>> 
>>>> You may also want to move to a later 3.4 release.
>>>> 
>>>> -Flavio
>>>> 
>>>>> On 24 Aug 2016, at 23:15, Guy Laden <gu...@gmail.com> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> It looks like due to a security scan sending "bad" traffic to the leader
>>>>> election port, we have clusters in which
>>>>> the leader election Listener thread is dead (unchecked exception was
>>>> thrown
>>>>> and thread died - seen in the log).
>>>>> (This seems to be fixed by fixed in
>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2186)
>>>>> 
>>>>> In this state, when a healthy server comes up and tries to connecnt to
>>>> the
>>>>> quorum, it gets stuck on
>>>>> the leader election. It establishes TCP connections to the other servers
>>>>> but any traffic it sends seems
>>>>> to get stuck in the receiver's TCP Recv queue (seen with netstat), and is
>>>>> not read/processed by zk.
>>>>> 
>>>>> Not a good place to be :)
>>>>> 
>>>>> This is with 3.4.6
>>>>> 
>>>>> Is there a way to get such clusters back to a healthy state without loss
>>>> of
>>>>> quorum / client impact?
>>>>> Some way of re-starting the listener thread? or restarting the servers
>>>> in a
>>>>> certain order?
>>>>> e.g. If I restart a minority, say the ones with lower server id's - is
>>>>> there a way to get the majority servers
>>>>> to re-initiate leader election connections with them so as to connect
>>>> them
>>>>> to the quorum? (and to do this without
>>>>> the majority losing quorum).
>>>>> 
>>>>> Thanks,
>>>>> Guy
>>>> 
> 


Re: Working around Leader election Listner thread death

Posted by David Brower <da...@oracle.com>.
Anything you could do with iptables you can do in the process by having 
it drop connections from things not on a whitelist, and not having a 
thread waiting indefinitely for operations from any connection.

-dB


On 8/30/2016 2:46 PM, Flavio Junqueira wrote:
> I was trying to write down an analysis and I haven't been able to come up with anything that is foolproof. Basically, the two main issues are:
>
> - A bad server is able to connect to a good server in the case it has a message outstanding and is trying to establish a connection to the good server. This happens if the server is LOOKING or has an outstanding message from the previous round. The converse isn't true, though. A good server can't start a connection to a bad server because the bad server doesn't have a listener.
> - If we bounce servers sequentially, there is a chance that a bad server is elected more than once along the process, which induces multiple leader election rounds.
>
> Perhaps this is overkill, but I was wondering if it makes sense to filter election traffic to and from bad servers using, for example, iptables. The idea is to a rule that are local to each server preventing the server to get connections established for leader election. For each bad server, we stop it, remove the rule, and bring it back up. We also stop a minority first before stoping the bad leader.
>
> -Flavio
>
>> On 29 Aug 2016, at 09:29, Guy Laden <gu...@gmail.com> wrote:
>>
>> Hi Flavio, Thanks for your reply. The situation is that indeed all the
>> servers are in a bad state so it looks like we will have to perform a
>> cluster restart.
>>
>> We played with attempts to optimize the downtime along the lines you
>> suggested. In testing it we ran into the issue where a server with no
>> Listener thread can initiate a leader election connection to a
>> newly-restarted server that does have a Listener. The result is a quorum
>> that may include 'bad' servers, even a 'bad' leader. So we tried to first
>> restart the higher-id servers, because lower-id servers will drop their
>> leader-election connections to higher id servers.
>> I'm told there are issues with this flow as well but have not yet
>> investigated the details.
>> I also worry about the leader-election retries done with exponential
>> backoff.
>>
>> I guess we will play with things a bit more but at this point I am tending
>> towards a simple parallel restart of all servers..
>>
>> Once the clusters are healthy again we will do a rolling upgrade to 3.4.8
>> sometime soon.
>>
>> Thanks again,
>> Guy
>>
>>
>> On Sun, Aug 28, 2016 at 5:52 PM, Flavio Junqueira <fp...@apache.org> wrote:
>>
>>> Hi Guy,
>>>
>>> We don't have a way to restart the listener thread, so you really need to
>>> bounce the server. I don't think there is a way of doing this without
>>> forcing a leader election, assuming all your servers are in this bad state.
>>> To minimize downtime, one thing you can do is to avoid bouncing the current
>>> leader until it loses quorum support. Once it loses quorum support, you
>>> have a quorum of healthy servers and they will elect a new, healthy leader.
>>> At the point, you can bounce all your unhealthy servers.
>>>
>>> You may also want to move to a later 3.4 release.
>>>
>>> -Flavio
>>>
>>>> On 24 Aug 2016, at 23:15, Guy Laden <gu...@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> It looks like due to a security scan sending "bad" traffic to the leader
>>>> election port, we have clusters in which
>>>> the leader election Listener thread is dead (unchecked exception was
>>> thrown
>>>> and thread died - seen in the log).
>>>> (This seems to be fixed by fixed in
>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2186)
>>>>
>>>> In this state, when a healthy server comes up and tries to connecnt to
>>> the
>>>> quorum, it gets stuck on
>>>> the leader election. It establishes TCP connections to the other servers
>>>> but any traffic it sends seems
>>>> to get stuck in the receiver's TCP Recv queue (seen with netstat), and is
>>>> not read/processed by zk.
>>>>
>>>> Not a good place to be :)
>>>>
>>>> This is with 3.4.6
>>>>
>>>> Is there a way to get such clusters back to a healthy state without loss
>>> of
>>>> quorum / client impact?
>>>> Some way of re-starting the listener thread? or restarting the servers
>>> in a
>>>> certain order?
>>>> e.g. If I restart a minority, say the ones with lower server id's - is
>>>> there a way to get the majority servers
>>>> to re-initiate leader election connections with them so as to connect
>>> them
>>>> to the quorum? (and to do this without
>>>> the majority losing quorum).
>>>>
>>>> Thanks,
>>>> Guy
>>>


Re: Working around Leader election Listner thread death

Posted by Flavio Junqueira <fp...@apache.org>.
I was trying to write down an analysis and I haven't been able to come up with anything that is foolproof. Basically, the two main issues are:

- A bad server is able to connect to a good server in the case it has a message outstanding and is trying to establish a connection to the good server. This happens if the server is LOOKING or has an outstanding message from the previous round. The converse isn't true, though. A good server can't start a connection to a bad server because the bad server doesn't have a listener.
- If we bounce servers sequentially, there is a chance that a bad server is elected more than once along the process, which induces multiple leader election rounds.

Perhaps this is overkill, but I was wondering if it makes sense to filter election traffic to and from bad servers using, for example, iptables. The idea is to a rule that are local to each server preventing the server to get connections established for leader election. For each bad server, we stop it, remove the rule, and bring it back up. We also stop a minority first before stoping the bad leader.

-Flavio

> On 29 Aug 2016, at 09:29, Guy Laden <gu...@gmail.com> wrote:
> 
> Hi Flavio, Thanks for your reply. The situation is that indeed all the
> servers are in a bad state so it looks like we will have to perform a
> cluster restart.
> 
> We played with attempts to optimize the downtime along the lines you
> suggested. In testing it we ran into the issue where a server with no
> Listener thread can initiate a leader election connection to a
> newly-restarted server that does have a Listener. The result is a quorum
> that may include 'bad' servers, even a 'bad' leader. So we tried to first
> restart the higher-id servers, because lower-id servers will drop their
> leader-election connections to higher id servers.
> I'm told there are issues with this flow as well but have not yet
> investigated the details.
> I also worry about the leader-election retries done with exponential
> backoff.
> 
> I guess we will play with things a bit more but at this point I am tending
> towards a simple parallel restart of all servers..
> 
> Once the clusters are healthy again we will do a rolling upgrade to 3.4.8
> sometime soon.
> 
> Thanks again,
> Guy
> 
> 
> On Sun, Aug 28, 2016 at 5:52 PM, Flavio Junqueira <fp...@apache.org> wrote:
> 
>> Hi Guy,
>> 
>> We don't have a way to restart the listener thread, so you really need to
>> bounce the server. I don't think there is a way of doing this without
>> forcing a leader election, assuming all your servers are in this bad state.
>> To minimize downtime, one thing you can do is to avoid bouncing the current
>> leader until it loses quorum support. Once it loses quorum support, you
>> have a quorum of healthy servers and they will elect a new, healthy leader.
>> At the point, you can bounce all your unhealthy servers.
>> 
>> You may also want to move to a later 3.4 release.
>> 
>> -Flavio
>> 
>>> On 24 Aug 2016, at 23:15, Guy Laden <gu...@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> It looks like due to a security scan sending "bad" traffic to the leader
>>> election port, we have clusters in which
>>> the leader election Listener thread is dead (unchecked exception was
>> thrown
>>> and thread died - seen in the log).
>>> (This seems to be fixed by fixed in
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2186)
>>> 
>>> In this state, when a healthy server comes up and tries to connecnt to
>> the
>>> quorum, it gets stuck on
>>> the leader election. It establishes TCP connections to the other servers
>>> but any traffic it sends seems
>>> to get stuck in the receiver's TCP Recv queue (seen with netstat), and is
>>> not read/processed by zk.
>>> 
>>> Not a good place to be :)
>>> 
>>> This is with 3.4.6
>>> 
>>> Is there a way to get such clusters back to a healthy state without loss
>> of
>>> quorum / client impact?
>>> Some way of re-starting the listener thread? or restarting the servers
>> in a
>>> certain order?
>>> e.g. If I restart a minority, say the ones with lower server id's - is
>>> there a way to get the majority servers
>>> to re-initiate leader election connections with them so as to connect
>> them
>>> to the quorum? (and to do this without
>>> the majority losing quorum).
>>> 
>>> Thanks,
>>> Guy
>> 
>> 


Re: Working around Leader election Listner thread death

Posted by Guy Laden <gu...@gmail.com>.
Hi Flavio, Thanks for your reply. The situation is that indeed all the
servers are in a bad state so it looks like we will have to perform a
cluster restart.

We played with attempts to optimize the downtime along the lines you
suggested. In testing it we ran into the issue where a server with no
Listener thread can initiate a leader election connection to a
newly-restarted server that does have a Listener. The result is a quorum
that may include 'bad' servers, even a 'bad' leader. So we tried to first
restart the higher-id servers, because lower-id servers will drop their
leader-election connections to higher id servers.
I'm told there are issues with this flow as well but have not yet
investigated the details.
I also worry about the leader-election retries done with exponential
backoff.

I guess we will play with things a bit more but at this point I am tending
towards a simple parallel restart of all servers..

Once the clusters are healthy again we will do a rolling upgrade to 3.4.8
sometime soon.

Thanks again,
Guy


On Sun, Aug 28, 2016 at 5:52 PM, Flavio Junqueira <fp...@apache.org> wrote:

> Hi Guy,
>
> We don't have a way to restart the listener thread, so you really need to
> bounce the server. I don't think there is a way of doing this without
> forcing a leader election, assuming all your servers are in this bad state.
> To minimize downtime, one thing you can do is to avoid bouncing the current
> leader until it loses quorum support. Once it loses quorum support, you
> have a quorum of healthy servers and they will elect a new, healthy leader.
> At the point, you can bounce all your unhealthy servers.
>
> You may also want to move to a later 3.4 release.
>
> -Flavio
>
> > On 24 Aug 2016, at 23:15, Guy Laden <gu...@gmail.com> wrote:
> >
> > Hi all,
> >
> > It looks like due to a security scan sending "bad" traffic to the leader
> > election port, we have clusters in which
> > the leader election Listener thread is dead (unchecked exception was
> thrown
> > and thread died - seen in the log).
> > (This seems to be fixed by fixed in
> > https://issues.apache.org/jira/browse/ZOOKEEPER-2186)
> >
> > In this state, when a healthy server comes up and tries to connecnt to
> the
> > quorum, it gets stuck on
> > the leader election. It establishes TCP connections to the other servers
> > but any traffic it sends seems
> > to get stuck in the receiver's TCP Recv queue (seen with netstat), and is
> > not read/processed by zk.
> >
> > Not a good place to be :)
> >
> > This is with 3.4.6
> >
> > Is there a way to get such clusters back to a healthy state without loss
> of
> > quorum / client impact?
> > Some way of re-starting the listener thread? or restarting the servers
> in a
> > certain order?
> > e.g. If I restart a minority, say the ones with lower server id's - is
> > there a way to get the majority servers
> > to re-initiate leader election connections with them so as to connect
> them
> > to the quorum? (and to do this without
> > the majority losing quorum).
> >
> > Thanks,
> > Guy
>
>

Re: Working around Leader election Listner thread death

Posted by Flavio Junqueira <fp...@apache.org>.
Hi Guy,

We don't have a way to restart the listener thread, so you really need to bounce the server. I don't think there is a way of doing this without forcing a leader election, assuming all your servers are in this bad state. To minimize downtime, one thing you can do is to avoid bouncing the current leader until it loses quorum support. Once it loses quorum support, you have a quorum of healthy servers and they will elect a new, healthy leader. At the point, you can bounce all your unhealthy servers.

You may also want to move to a later 3.4 release.

-Flavio
  
> On 24 Aug 2016, at 23:15, Guy Laden <gu...@gmail.com> wrote:
> 
> Hi all,
> 
> It looks like due to a security scan sending "bad" traffic to the leader
> election port, we have clusters in which
> the leader election Listener thread is dead (unchecked exception was thrown
> and thread died - seen in the log).
> (This seems to be fixed by fixed in
> https://issues.apache.org/jira/browse/ZOOKEEPER-2186)
> 
> In this state, when a healthy server comes up and tries to connecnt to the
> quorum, it gets stuck on
> the leader election. It establishes TCP connections to the other servers
> but any traffic it sends seems
> to get stuck in the receiver's TCP Recv queue (seen with netstat), and is
> not read/processed by zk.
> 
> Not a good place to be :)
> 
> This is with 3.4.6
> 
> Is there a way to get such clusters back to a healthy state without loss of
> quorum / client impact?
> Some way of re-starting the listener thread? or restarting the servers in a
> certain order?
> e.g. If I restart a minority, say the ones with lower server id's - is
> there a way to get the majority servers
> to re-initiate leader election connections with them so as to connect them
> to the quorum? (and to do this without
> the majority losing quorum).
> 
> Thanks,
> Guy