You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Travis Crawford <tr...@gmail.com> on 2010/04/29 09:08:39 UTC

Misbehaving zk servers

Hey zookeeper gurus -

We recently had a zookeeper outage when one ZK server was started with
a low limit after upgrading to 3.3.0. Several days later the outage
occurred when that node reached its file descriptor limit and clients
started having major issues.

Are there any circumstances when a ZK server will get blacklisted from
the ensemble? Something similar to how tasktrackers are blacklisted
when too many tasks fail.

Thanks!
Travis

Re: Misbehaving zk servers

Posted by Travis Crawford <tr...@gmail.com>.
On Thu, Apr 29, 2010 at 10:24 AM, Patrick Hunt <ph...@apache.org> wrote:
> Did you find any bugs on java.sun.com related to those? ;-)
>
> That does sound like a good solution to me. We should stop accepting
> connections and log it to the log as well. We might also want to update the
> user docs and tell users to monitor the FD count as part of their monitoring
> regime. Is there a way to register for notifications on those via JMX? We
> might want to add this to our own JMX/4letterwords to simplify monitoring of
> this critical resource for users.
>
> Travis, would you mind creating a JIRA for this? Thanks!

Filed:

https://issues.apache.org/jira/browse/ZOOKEEPER-759

Thanks for the feedback all!
Travis


>
> Patrick
>
> On 04/29/2010 10:09 AM, Travis Crawford wrote:
>>
>> On Thu, Apr 29, 2010 at 9:49 AM, Patrick Hunt<ph...@apache.org>  wrote:
>>>
>>> Is there any good (simple/fast/bulletproof) way to monitor the FD use
>>> inside
>>> the jvm? If so we could stop accepting new client connections once we get
>>> close to the os imposed limit... The test would have to be a bulletproof
>>> one
>>> though - we wouldn't want to end up in some worse situation (where we
>>> refuse
>>> connection because we mistakenly believe that the limit has been
>>> reached).
>>>
>>> Might be good to open a JIRA for this and add some tests. In particular
>>> we
>>> should verify the server handles this as gracefully as it can when the
>>> limit
>>> has been reached.
>>
>> Poking around with jconsole I found two stats that already measure FDs:
>>
>> - java.lang.OperatingSystem.MaxFileDescriptorCount
>> - java.lang.OperatingSystem.OpenFileDescriptorCount
>>
>> They're described (rather tersely) at:
>>
>>
>> http://java.sun.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html
>>
>> So it sounds like the feature request would be stop accepting new
>> client connections if OpenFileDescriptorCount>  95% of
>> MaxFileDescriptorCount? Only start accepting new requests when
>> OpenFileDescriptorCount<  90% of MaxFileDescriptorCount. Basically the
>> high/low watermark thing.
>>
>> Thoughts?
>>
>> --travis
>>
>>
>>
>>
>>>
>>> Patrick
>>>
>>> On 04/29/2010 09:34 AM, Mahadev Konar wrote:
>>>>
>>>> Hi Travis,
>>>>
>>>>  How many clients did you have connected to this server? Usually the
>>>> default
>>>> is 8K file descriptors. Did you have clients more than that?
>>>>
>>>> Also, if clients fail to attach to a server, they will run off to
>>>> another
>>>> server. We do not do any blacklisting because we expect the server to
>>>> heal
>>>> and if it does not, it mostly shuts itself down in most of the cases.
>>>>
>>>> Thanks
>>>> mahadev
>>>>
>>>>
>>>> On 4/29/10 12:08 AM, "Travis Crawford"<tr...@gmail.com>
>>>>  wrote:
>>>>
>>>>> Hey zookeeper gurus -
>>>>>
>>>>> We recently had a zookeeper outage when one ZK server was started with
>>>>> a low limit after upgrading to 3.3.0. Several days later the outage
>>>>> occurred when that node reached its file descriptor limit and clients
>>>>> started having major issues.
>>>>>
>>>>> Are there any circumstances when a ZK server will get blacklisted from
>>>>> the ensemble? Something similar to how tasktrackers are blacklisted
>>>>> when too many tasks fail.
>>>>>
>>>>> Thanks!
>>>>> Travis
>>>>
>>>
>

Re: Misbehaving zk servers

Posted by Patrick Hunt <ph...@apache.org>.
Did you find any bugs on java.sun.com related to those? ;-)

That does sound like a good solution to me. We should stop accepting 
connections and log it to the log as well. We might also want to update 
the user docs and tell users to monitor the FD count as part of their 
monitoring regime. Is there a way to register for notifications on those 
via JMX? We might want to add this to our own JMX/4letterwords to 
simplify monitoring of this critical resource for users.

Travis, would you mind creating a JIRA for this? Thanks!

Patrick

On 04/29/2010 10:09 AM, Travis Crawford wrote:
> On Thu, Apr 29, 2010 at 9:49 AM, Patrick Hunt<ph...@apache.org>  wrote:
>> Is there any good (simple/fast/bulletproof) way to monitor the FD use inside
>> the jvm? If so we could stop accepting new client connections once we get
>> close to the os imposed limit... The test would have to be a bulletproof one
>> though - we wouldn't want to end up in some worse situation (where we refuse
>> connection because we mistakenly believe that the limit has been reached).
>>
>> Might be good to open a JIRA for this and add some tests. In particular we
>> should verify the server handles this as gracefully as it can when the limit
>> has been reached.
>
> Poking around with jconsole I found two stats that already measure FDs:
>
> - java.lang.OperatingSystem.MaxFileDescriptorCount
> - java.lang.OperatingSystem.OpenFileDescriptorCount
>
> They're described (rather tersely) at:
>
> http://java.sun.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html
>
> So it sounds like the feature request would be stop accepting new
> client connections if OpenFileDescriptorCount>  95% of
> MaxFileDescriptorCount? Only start accepting new requests when
> OpenFileDescriptorCount<  90% of MaxFileDescriptorCount. Basically the
> high/low watermark thing.
>
> Thoughts?
>
> --travis
>
>
>
>
>>
>> Patrick
>>
>> On 04/29/2010 09:34 AM, Mahadev Konar wrote:
>>>
>>> Hi Travis,
>>>
>>>   How many clients did you have connected to this server? Usually the
>>> default
>>> is 8K file descriptors. Did you have clients more than that?
>>>
>>> Also, if clients fail to attach to a server, they will run off to another
>>> server. We do not do any blacklisting because we expect the server to heal
>>> and if it does not, it mostly shuts itself down in most of the cases.
>>>
>>> Thanks
>>> mahadev
>>>
>>>
>>> On 4/29/10 12:08 AM, "Travis Crawford"<tr...@gmail.com>    wrote:
>>>
>>>> Hey zookeeper gurus -
>>>>
>>>> We recently had a zookeeper outage when one ZK server was started with
>>>> a low limit after upgrading to 3.3.0. Several days later the outage
>>>> occurred when that node reached its file descriptor limit and clients
>>>> started having major issues.
>>>>
>>>> Are there any circumstances when a ZK server will get blacklisted from
>>>> the ensemble? Something similar to how tasktrackers are blacklisted
>>>> when too many tasks fail.
>>>>
>>>> Thanks!
>>>> Travis
>>>
>>

Re: Misbehaving zk servers

Posted by Travis Crawford <tr...@gmail.com>.
On Thu, Apr 29, 2010 at 9:49 AM, Patrick Hunt <ph...@apache.org> wrote:
> Is there any good (simple/fast/bulletproof) way to monitor the FD use inside
> the jvm? If so we could stop accepting new client connections once we get
> close to the os imposed limit... The test would have to be a bulletproof one
> though - we wouldn't want to end up in some worse situation (where we refuse
> connection because we mistakenly believe that the limit has been reached).
>
> Might be good to open a JIRA for this and add some tests. In particular we
> should verify the server handles this as gracefully as it can when the limit
> has been reached.

Poking around with jconsole I found two stats that already measure FDs:

- java.lang.OperatingSystem.MaxFileDescriptorCount
- java.lang.OperatingSystem.OpenFileDescriptorCount

They're described (rather tersely) at:

http://java.sun.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html

So it sounds like the feature request would be stop accepting new
client connections if OpenFileDescriptorCount > 95% of
MaxFileDescriptorCount? Only start accepting new requests when
OpenFileDescriptorCount < 90% of MaxFileDescriptorCount. Basically the
high/low watermark thing.

Thoughts?

--travis




>
> Patrick
>
> On 04/29/2010 09:34 AM, Mahadev Konar wrote:
>>
>> Hi Travis,
>>
>>  How many clients did you have connected to this server? Usually the
>> default
>> is 8K file descriptors. Did you have clients more than that?
>>
>> Also, if clients fail to attach to a server, they will run off to another
>> server. We do not do any blacklisting because we expect the server to heal
>> and if it does not, it mostly shuts itself down in most of the cases.
>>
>> Thanks
>> mahadev
>>
>>
>> On 4/29/10 12:08 AM, "Travis Crawford"<tr...@gmail.com>  wrote:
>>
>>> Hey zookeeper gurus -
>>>
>>> We recently had a zookeeper outage when one ZK server was started with
>>> a low limit after upgrading to 3.3.0. Several days later the outage
>>> occurred when that node reached its file descriptor limit and clients
>>> started having major issues.
>>>
>>> Are there any circumstances when a ZK server will get blacklisted from
>>> the ensemble? Something similar to how tasktrackers are blacklisted
>>> when too many tasks fail.
>>>
>>> Thanks!
>>> Travis
>>
>

Re: Misbehaving zk servers

Posted by Patrick Hunt <ph...@apache.org>.
Is there any good (simple/fast/bulletproof) way to monitor the FD use 
inside the jvm? If so we could stop accepting new client connections 
once we get close to the os imposed limit... The test would have to be a 
bulletproof one though - we wouldn't want to end up in some worse 
situation (where we refuse connection because we mistakenly believe that 
the limit has been reached).

Might be good to open a JIRA for this and add some tests. In particular 
we should verify the server handles this as gracefully as it can when 
the limit has been reached.

Patrick

On 04/29/2010 09:34 AM, Mahadev Konar wrote:
> Hi Travis,
>
>   How many clients did you have connected to this server? Usually the default
> is 8K file descriptors. Did you have clients more than that?
>
> Also, if clients fail to attach to a server, they will run off to another
> server. We do not do any blacklisting because we expect the server to heal
> and if it does not, it mostly shuts itself down in most of the cases.
>
> Thanks
> mahadev
>
>
> On 4/29/10 12:08 AM, "Travis Crawford"<tr...@gmail.com>  wrote:
>
>> Hey zookeeper gurus -
>>
>> We recently had a zookeeper outage when one ZK server was started with
>> a low limit after upgrading to 3.3.0. Several days later the outage
>> occurred when that node reached its file descriptor limit and clients
>> started having major issues.
>>
>> Are there any circumstances when a ZK server will get blacklisted from
>> the ensemble? Something similar to how tasktrackers are blacklisted
>> when too many tasks fail.
>>
>> Thanks!
>> Travis
>

Re: Misbehaving zk servers

Posted by Mahadev Konar <ma...@yahoo-inc.com>.
Hi Travis,

 How many clients did you have connected to this server? Usually the default
is 8K file descriptors. Did you have clients more than that?

Also, if clients fail to attach to a server, they will run off to another
server. We do not do any blacklisting because we expect the server to heal
and if it does not, it mostly shuts itself down in most of the cases.

Thanks
mahadev


On 4/29/10 12:08 AM, "Travis Crawford" <tr...@gmail.com> wrote:

> Hey zookeeper gurus -
> 
> We recently had a zookeeper outage when one ZK server was started with
> a low limit after upgrading to 3.3.0. Several days later the outage
> occurred when that node reached its file descriptor limit and clients
> started having major issues.
> 
> Are there any circumstances when a ZK server will get blacklisted from
> the ensemble? Something similar to how tasktrackers are blacklisted
> when too many tasks fail.
> 
> Thanks!
> Travis