You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Patrick Hunt <ph...@apache.org> on 2016/05/09 02:37:58 UTC

Interesting elastic/ZK post

Interesting root cause and mitigations discussion.

https://www.elastic.co/blog/elastic-cloud-outage-april-2016

Patrick

Re: Interesting elastic/ZK post

Posted by Chris Nauroth <cn...@hortonworks.com>.

I filed ZOOKEEPER-2424 to track this.

--Chris Nauroth




On 5/9/16, 10:18 AM, "Patrick Hunt" <ph...@apache.org> wrote:

>Makes sense to me to add it. Someone could create a ZK jira? Sounds like a
>great starter project for someone interested to get rolling with ZK.  3.5+
>adds jetty support for accessing metrics, sounds like it would dovetail
>nicely.
>
>Patrick
>
>On Mon, May 9, 2016 at 10:12 AM, Chris Nauroth <cn...@hortonworks.com>
>wrote:
>
>> I always sympathize with a major outage report, but on the bright side,
>>it
>> was very satisfying to hear the ZooKeeper cluster had sustained uptime
>>for
>> 3 years.  That agrees with my own user experience.  It's often the most
>> stable component of a distributed infrastructure (as it needs to be).
>>
>> As far as potential improvements, I was wondering if it would make sense
>> to introduce something like Hadoop's JvmPauseMonitor [1].  This is a
>> background thread that attempts to detect GC churn and log warnings
>>about
>> it.  This has been very helpful in diagnosing NameNode misconfigurations
>> that lead to GC churn.
>>
>> This wouldn't have prevented a problem for the Elastic Cloud team, but
>>at
>> least it would have made the root cause more visible.  A warning about
>>GC
>> churn could have been shown in the main ZooKeeper log instead of a
>> separate GC log or inferring it from other sources like JMX.
>>
>> [1] https://s.apache.org/4sdx
>>
>> --Chris Nauroth
>>
>>
>>
>>
>> On 5/8/16, 7:37 PM, "Patrick Hunt" <ph...@apache.org> wrote:
>>
>> >Interesting root cause and mitigations discussion.
>> >
>> >https://www.elastic.co/blog/elastic-cloud-outage-april-2016
>> >
>> >Patrick
>>
>>

Re: Interesting elastic/ZK post

Posted by Chris Nauroth <cn...@hortonworks.com>.

I filed ZOOKEEPER-2424 to track this.

--Chris Nauroth




On 5/9/16, 10:18 AM, "Patrick Hunt" <ph...@apache.org> wrote:

>Makes sense to me to add it. Someone could create a ZK jira? Sounds like a
>great starter project for someone interested to get rolling with ZK.  3.5+
>adds jetty support for accessing metrics, sounds like it would dovetail
>nicely.
>
>Patrick
>
>On Mon, May 9, 2016 at 10:12 AM, Chris Nauroth <cn...@hortonworks.com>
>wrote:
>
>> I always sympathize with a major outage report, but on the bright side,
>>it
>> was very satisfying to hear the ZooKeeper cluster had sustained uptime
>>for
>> 3 years.  That agrees with my own user experience.  It's often the most
>> stable component of a distributed infrastructure (as it needs to be).
>>
>> As far as potential improvements, I was wondering if it would make sense
>> to introduce something like Hadoop's JvmPauseMonitor [1].  This is a
>> background thread that attempts to detect GC churn and log warnings
>>about
>> it.  This has been very helpful in diagnosing NameNode misconfigurations
>> that lead to GC churn.
>>
>> This wouldn't have prevented a problem for the Elastic Cloud team, but
>>at
>> least it would have made the root cause more visible.  A warning about
>>GC
>> churn could have been shown in the main ZooKeeper log instead of a
>> separate GC log or inferring it from other sources like JMX.
>>
>> [1] https://s.apache.org/4sdx
>>
>> --Chris Nauroth
>>
>>
>>
>>
>> On 5/8/16, 7:37 PM, "Patrick Hunt" <ph...@apache.org> wrote:
>>
>> >Interesting root cause and mitigations discussion.
>> >
>> >https://www.elastic.co/blog/elastic-cloud-outage-april-2016
>> >
>> >Patrick
>>
>>

Re: Interesting elastic/ZK post

Posted by Patrick Hunt <ph...@apache.org>.

Makes sense to me to add it. Someone could create a ZK jira? Sounds like a
great starter project for someone interested to get rolling with ZK.  3.5+
adds jetty support for accessing metrics, sounds like it would dovetail
nicely.

Patrick

On Mon, May 9, 2016 at 10:12 AM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> I always sympathize with a major outage report, but on the bright side, it
> was very satisfying to hear the ZooKeeper cluster had sustained uptime for
> 3 years.  That agrees with my own user experience.  It's often the most
> stable component of a distributed infrastructure (as it needs to be).
>
> As far as potential improvements, I was wondering if it would make sense
> to introduce something like Hadoop's JvmPauseMonitor [1].  This is a
> background thread that attempts to detect GC churn and log warnings about
> it.  This has been very helpful in diagnosing NameNode misconfigurations
> that lead to GC churn.
>
> This wouldn't have prevented a problem for the Elastic Cloud team, but at
> least it would have made the root cause more visible.  A warning about GC
> churn could have been shown in the main ZooKeeper log instead of a
> separate GC log or inferring it from other sources like JMX.
>
> [1] https://s.apache.org/4sdx
>
> --Chris Nauroth
>
>
>
>
> On 5/8/16, 7:37 PM, "Patrick Hunt" <ph...@apache.org> wrote:
>
> >Interesting root cause and mitigations discussion.
> >
> >https://www.elastic.co/blog/elastic-cloud-outage-april-2016
> >
> >Patrick
>
>

Re: Interesting elastic/ZK post

Posted by Patrick Hunt <ph...@apache.org>.

Makes sense to me to add it. Someone could create a ZK jira? Sounds like a
great starter project for someone interested to get rolling with ZK.  3.5+
adds jetty support for accessing metrics, sounds like it would dovetail
nicely.

Patrick

On Mon, May 9, 2016 at 10:12 AM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> I always sympathize with a major outage report, but on the bright side, it
> was very satisfying to hear the ZooKeeper cluster had sustained uptime for
> 3 years.  That agrees with my own user experience.  It's often the most
> stable component of a distributed infrastructure (as it needs to be).
>
> As far as potential improvements, I was wondering if it would make sense
> to introduce something like Hadoop's JvmPauseMonitor [1].  This is a
> background thread that attempts to detect GC churn and log warnings about
> it.  This has been very helpful in diagnosing NameNode misconfigurations
> that lead to GC churn.
>
> This wouldn't have prevented a problem for the Elastic Cloud team, but at
> least it would have made the root cause more visible.  A warning about GC
> churn could have been shown in the main ZooKeeper log instead of a
> separate GC log or inferring it from other sources like JMX.
>
> [1] https://s.apache.org/4sdx
>
> --Chris Nauroth
>
>
>
>
> On 5/8/16, 7:37 PM, "Patrick Hunt" <ph...@apache.org> wrote:
>
> >Interesting root cause and mitigations discussion.
> >
> >https://www.elastic.co/blog/elastic-cloud-outage-april-2016
> >
> >Patrick
>
>

Re: Interesting elastic/ZK post

Posted by Chris Nauroth <cn...@hortonworks.com>.

I always sympathize with a major outage report, but on the bright side, it
was very satisfying to hear the ZooKeeper cluster had sustained uptime for
3 years.  That agrees with my own user experience.  It's often the most
stable component of a distributed infrastructure (as it needs to be).

As far as potential improvements, I was wondering if it would make sense
to introduce something like Hadoop's JvmPauseMonitor [1].  This is a
background thread that attempts to detect GC churn and log warnings about
it.  This has been very helpful in diagnosing NameNode misconfigurations
that lead to GC churn.

This wouldn't have prevented a problem for the Elastic Cloud team, but at
least it would have made the root cause more visible.  A warning about GC
churn could have been shown in the main ZooKeeper log instead of a
separate GC log or inferring it from other sources like JMX.

[1] https://s.apache.org/4sdx

--Chris Nauroth

On 5/8/16, 7:37 PM, "Patrick Hunt" <ph...@apache.org> wrote:

>Interesting root cause and mitigations discussion.
>
>https://www.elastic.co/blog/elastic-cloud-outage-april-2016
>
>Patrick

Re: Interesting elastic/ZK post

Posted by Chris Nauroth <cn...@hortonworks.com>.

I always sympathize with a major outage report, but on the bright side, it
was very satisfying to hear the ZooKeeper cluster had sustained uptime for
3 years.  That agrees with my own user experience.  It's often the most
stable component of a distributed infrastructure (as it needs to be).

As far as potential improvements, I was wondering if it would make sense
to introduce something like Hadoop's JvmPauseMonitor [1].  This is a
background thread that attempts to detect GC churn and log warnings about
it.  This has been very helpful in diagnosing NameNode misconfigurations
that lead to GC churn.

This wouldn't have prevented a problem for the Elastic Cloud team, but at
least it would have made the root cause more visible.  A warning about GC
churn could have been shown in the main ZooKeeper log instead of a
separate GC log or inferring it from other sources like JMX.

[1] https://s.apache.org/4sdx

--Chris Nauroth

On 5/8/16, 7:37 PM, "Patrick Hunt" <ph...@apache.org> wrote:

>Interesting root cause and mitigations discussion.
>
>https://www.elastic.co/blog/elastic-cloud-outage-april-2016
>
>Patrick