You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Jerry Hebert <je...@gmail.com> on 2019/10/02 18:05:06 UTC

One node crashing in 3.4.11 triggered a full ensemble restart

Hi all,

My first post here! I'm hoping you all might be able to offer some guidance
or redirect me to an existing ticket. We have a five node ensemble on
3.4.11 that we're currently in the process of upgrading to 3.5.5. We
recently saw some bizarre behavior in our ensemble that I was hoping to
find some sort pre-existing ticket or discussion about but I was having
difficulty finding hits for this in Jira.

The behavior that we saw from our metrics is that one of our nodes (not
sure if it was a follower or a leader) started to demonstrate
instability (high CPU, high RAM) and it crashed. Not a big deal, but as
soon as it crashed, all of the other four nodes all immediately restarted,
resulting in a short outage. One node crashing should never cause an
ensemble restart of course, so I assumed that this must be a bug in ZK. The
nodes that restarted had no indication of errors in their logs, they just
simply restarted. Does this sound familiar to any of you?

Also, we are using Exhibitor on that ensemble so it's also possible that
the restart was caused by Exhibitor.

My hope is that this issue will be behind us once the 3.5.5 upgrade is
complete but I'd ideally like to find some concrete evidence of this.

Thanks!
Jerry

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Jerry Hebert <je...@gmail.com>.

This is really useful discussion, I really appreciate it! I'm not too
worried about the restarts that I saw and they are totally unrelated to the
upgrade. The upgrade is only relevant insofar as I was seeking confidence
that I would not see the issue once upgraded to 3.5.5 but I'm inclined to
believe the restarts were due to Exhibitor.

Whether or not I can create a mixed version ensemble is a far more
important question to me since I'm currently trying to devise an upgrade
strategy that avoids taking downtime.

Thanks,
Jerry

On Thu, Oct 3, 2019 at 6:59 AM Enrico Olivelli <eo...@gmail.com> wrote:

> I think it is possible to perform a rolling upgrade from 3.4, all of my
> customers migrated one year ago and without any issue (reported to my
> team).
>
> Norbert, where did you find that information?
>
> btw I would like to setup tests about backward compatibility,
> server-to-server and client-to-server
>
> Enrico
>
> Il giorno gio 3 ott 2019 alle ore 15:16 Jörn Franke <jo...@gmail.com>
> ha scritto:
>
> > I tried only from 3.4.14 and there it was possible. I recommend first to
> > upgrade to the latest 3.4 version and then to 3.5
> >
> > > Am 02.10.2019 um 21:40 schrieb Jerry Hebert <je...@gmail.com>:
> > >
> > > Hi Jörn,
> > >
> > > No, this was a very intermittent issue. We've been running this
> ensemble
> > > for about four years now and have never seen this problem so it seems
> to
> > be
> > > super heisenbuggy. Our upgrade process will be more involved than what
> > you
> > > described (we're switching networks, instance types, underlying
> > automation
> > > and removing Exhibitor) but I'm glad you asked because I have a
> question
> > > about that too. :)
> > >
> > > Are you saying that a 3.5.5 node can synchronize with a 3.4.11
> ensemble?
> > I
> > > wasn't sure if that would work or not. e.g., maybe I could bring up the
> > new
> > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> > nodes,
> > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11
> boxes?
> > >
> > > Thanks,
> > > Jerry
> > >
> > >> On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jo...@gmail.com>
> > wrote:
> > >>
> > >> Have you tried to stop the node, delete the data and log directory,
> > >> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
> > >>
> > >>>> Am 02.10.2019 um 20:14 schrieb Jerry Hebert <jerry.hebert@gmail.com
> >:
> > >>>
> > >>> Hi all,
> > >>>
> > >>> My first post here! I'm hoping you all might be able to offer some
> > >> guidance
> > >>> or redirect me to an existing ticket. We have a five node ensemble on
> > >>> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> > >>> recently saw some bizarre behavior in our ensemble that I was hoping
> to
> > >>> find some sort pre-existing ticket or discussion about but I was
> having
> > >>> difficulty finding hits for this in Jira.
> > >>>
> > >>> The behavior that we saw from our metrics is that one of our nodes
> (not
> > >>> sure if it was a follower or a leader) started to demonstrate
> > >>> instability (high CPU, high RAM) and it crashed. Not a big deal, but
> as
> > >>> soon as it crashed, all of the other four nodes all immediately
> > >> restarted,
> > >>> resulting in a short outage. One node crashing should never cause an
> > >>> ensemble restart of course, so I assumed that this must be a bug in
> ZK.
> > >> The
> > >>> nodes that restarted had no indication of errors in their logs, they
> > just
> > >>> simply restarted. Does this sound familiar to any of you?
> > >>>
> > >>> Also, we are using Exhibitor on that ensemble so it's also possible
> > that
> > >>> the restart was caused by Exhibitor.
> > >>>
> > >>> My hope is that this issue will be behind us once the 3.5.5 upgrade
> is
> > >>> complete but I'd ideally like to find some concrete evidence of this.
> > >>>
> > >>> Thanks!
> > >>> Jerry
> > >>
> >
>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Enrico Olivelli <eo...@gmail.com>.

I think it is possible to perform a rolling upgrade from 3.4, all of my
customers migrated one year ago and without any issue (reported to my team).

Norbert, where did you find that information?

btw I would like to setup tests about backward compatibility,
server-to-server and client-to-server

Enrico

Il giorno gio 3 ott 2019 alle ore 15:16 Jörn Franke <jo...@gmail.com>
ha scritto:

> I tried only from 3.4.14 and there it was possible. I recommend first to
> upgrade to the latest 3.4 version and then to 3.5
>
> > Am 02.10.2019 um 21:40 schrieb Jerry Hebert <je...@gmail.com>:
> >
> > Hi Jörn,
> >
> > No, this was a very intermittent issue. We've been running this ensemble
> > for about four years now and have never seen this problem so it seems to
> be
> > super heisenbuggy. Our upgrade process will be more involved than what
> you
> > described (we're switching networks, instance types, underlying
> automation
> > and removing Exhibitor) but I'm glad you asked because I have a question
> > about that too. :)
> >
> > Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble?
> I
> > wasn't sure if that would work or not. e.g., maybe I could bring up the
> new
> > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> nodes,
> > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?
> >
> > Thanks,
> > Jerry
> >
> >> On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jo...@gmail.com>
> wrote:
> >>
> >> Have you tried to stop the node, delete the data and log directory,
> >> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
> >>
> >>>> Am 02.10.2019 um 20:14 schrieb Jerry Hebert <je...@gmail.com>:
> >>>
> >>> Hi all,
> >>>
> >>> My first post here! I'm hoping you all might be able to offer some
> >> guidance
> >>> or redirect me to an existing ticket. We have a five node ensemble on
> >>> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> >>> recently saw some bizarre behavior in our ensemble that I was hoping to
> >>> find some sort pre-existing ticket or discussion about but I was having
> >>> difficulty finding hits for this in Jira.
> >>>
> >>> The behavior that we saw from our metrics is that one of our nodes (not
> >>> sure if it was a follower or a leader) started to demonstrate
> >>> instability (high CPU, high RAM) and it crashed. Not a big deal, but as
> >>> soon as it crashed, all of the other four nodes all immediately
> >> restarted,
> >>> resulting in a short outage. One node crashing should never cause an
> >>> ensemble restart of course, so I assumed that this must be a bug in ZK.
> >> The
> >>> nodes that restarted had no indication of errors in their logs, they
> just
> >>> simply restarted. Does this sound familiar to any of you?
> >>>
> >>> Also, we are using Exhibitor on that ensemble so it's also possible
> that
> >>> the restart was caused by Exhibitor.
> >>>
> >>> My hope is that this issue will be behind us once the 3.5.5 upgrade is
> >>> complete but I'd ideally like to find some concrete evidence of this.
> >>>
> >>> Thanks!
> >>> Jerry
> >>
>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Jörn Franke <jo...@gmail.com>.

I tried only from 3.4.14 and there it was possible. I recommend first to upgrade to the latest 3.4 version and then to 3.5

> Am 02.10.2019 um 21:40 schrieb Jerry Hebert <je...@gmail.com>:
> 
> Hi Jörn,
> 
> No, this was a very intermittent issue. We've been running this ensemble
> for about four years now and have never seen this problem so it seems to be
> super heisenbuggy. Our upgrade process will be more involved than what you
> described (we're switching networks, instance types, underlying automation
> and removing Exhibitor) but I'm glad you asked because I have a question
> about that too. :)
> 
> Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble? I
> wasn't sure if that would work or not. e.g., maybe I could bring up the new
> 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11 nodes,
> five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?
> 
> Thanks,
> Jerry
> 
>> On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jo...@gmail.com> wrote:
>> 
>> Have you tried to stop the node, delete the data and log directory,
>> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
>> 
>>>> Am 02.10.2019 um 20:14 schrieb Jerry Hebert <je...@gmail.com>:
>>> 
>>> Hi all,
>>> 
>>> My first post here! I'm hoping you all might be able to offer some
>> guidance
>>> or redirect me to an existing ticket. We have a five node ensemble on
>>> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
>>> recently saw some bizarre behavior in our ensemble that I was hoping to
>>> find some sort pre-existing ticket or discussion about but I was having
>>> difficulty finding hits for this in Jira.
>>> 
>>> The behavior that we saw from our metrics is that one of our nodes (not
>>> sure if it was a follower or a leader) started to demonstrate
>>> instability (high CPU, high RAM) and it crashed. Not a big deal, but as
>>> soon as it crashed, all of the other four nodes all immediately
>> restarted,
>>> resulting in a short outage. One node crashing should never cause an
>>> ensemble restart of course, so I assumed that this must be a bug in ZK.
>> The
>>> nodes that restarted had no indication of errors in their logs, they just
>>> simply restarted. Does this sound familiar to any of you?
>>> 
>>> Also, we are using Exhibitor on that ensemble so it's also possible that
>>> the restart was caused by Exhibitor.
>>> 
>>> My hope is that this issue will be behind us once the 3.5.5 upgrade is
>>> complete but I'd ideally like to find some concrete evidence of this.
>>> 
>>> Thanks!
>>> Jerry
>>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Jörn Franke <jo...@gmail.com>.

I can confirm that a rolling update from Zk 3.4 to ZK 3.5 is possible if and only if a ZK ensemble is used. standalone updates may introduce difficulties. 
Of course I cannot tell for all possible setups, but for a ZK ensemble with multiple Solr instances it is possible.

> Am 03.10.2019 um 14:55 schrieb Shawn Heisey <ap...@elyograg.org>:
> 
> On 10/3/2019 2:45 AM, Norbert Kalmar wrote:
>> As for running a mixed version of 3.5 and 3.4 quorum - I'm afraid it will
>> not work. From 3.5 we have a check on PROTOCOL_VERSION. 3.4 did not have
>> this protocol version, so when the nodes try to communicate it will throw
>> an exception. Plus, it is not a goal to keep quorum protocol backward
>> compatible, so chances are even without the check it would not work.
> 
> This document suggests that a mixed environment of 3.4 and 3.5 will work:
> 
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement
> 
> But you seem to be saying that it won't.
> 
> As a committer on the Lucene/Solr project (which uses ZK) I am wondering what we can tell our users about upgrading ZK.  I was under the impression from the wiki page I linked that they could do a rolling upgrade with zero downtime, where they do one ZK server at a time.  Are you saying that this is not possible?
> 
> The Upgrade FAQ that you linked doesn't say anything about 3.4 and 3.5 not working together.  The only big gotcha I see there is ZOOKEEPER-3056, which has a workaround.
> 
> (I think of 4lw whitelisting as just a config problem with a new default, not a true upgrade issue)
> 
> Thanks,
> Shawn

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Shawn Heisey <ap...@elyograg.org>.

On 10/3/2019 2:45 AM, Norbert Kalmar wrote:
> As for running a mixed version of 3.5 and 3.4 quorum - I'm afraid it will
> not work. From 3.5 we have a check on PROTOCOL_VERSION. 3.4 did not have
> this protocol version, so when the nodes try to communicate it will throw
> an exception. Plus, it is not a goal to keep quorum protocol backward
> compatible, so chances are even without the check it would not work.

This document suggests that a mixed environment of 3.4 and 3.5 will work:

https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement

But you seem to be saying that it won't.

As a committer on the Lucene/Solr project (which uses ZK) I am wondering 
what we can tell our users about upgrading ZK.  I was under the 
impression from the wiki page I linked that they could do a rolling 
upgrade with zero downtime, where they do one ZK server at a time.  Are 
you saying that this is not possible?

The Upgrade FAQ that you linked doesn't say anything about 3.4 and 3.5 
not working together.  The only big gotcha I see there is 
ZOOKEEPER-3056, which has a workaround.

(I think of 4lw whitelisting as just a config problem with a new 
default, not a true upgrade issue)

Thanks,
Shawn

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Norbert Kalmar <nk...@cloudera.com.INVALID>.

Hi,

Here are the issues we encountered so far upgrading to 3.5.5 from 3.4:
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Upgrade+FAQ

As Enrico mentioned, nothing similar so far. One is no snapshot taken yet
the other is 4 letter words needs to be whitelisted.

As for running a mixed version of 3.5 and 3.4 quorum - I'm afraid it will
not work. From 3.5 we have a check on PROTOCOL_VERSION. 3.4 did not have
this protocol version, so when the nodes try to communicate it will throw
an exception. Plus, it is not a goal to keep quorum protocol backward
compatible, so chances are even without the check it would not work.

Regards,
Norbert

On Thu, Oct 3, 2019 at 12:09 AM Enrico Olivelli <eo...@gmail.com> wrote:

> Il mer 2 ott 2019, 22:52 Jerry Hebert <je...@gmail.com> ha scritto:
>
> > Hi Enrico,
> >
> > The nodes that restarted did not have any errors in their logs, they
> seemed
> > to simply restart successfully so I think your hunch about the external
> > system is probably correct.
> >
> > Could you comment on my second question above regarding cross-version
> > migration or should I make a new thread?
> >
>
>
> I am not aware of any issue about an upgrade from 3.4 to 3.5 similar to
> your case. It is expected to work.
>
> Enrico
>
>
> > Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble?
> I
> > > wasn't sure if that would work or not. e.g., maybe I could bring up the
> > new
> > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> > nodes,
> > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11
> boxes?
> >
> >
> > Thanks!
> > Jerry
> >
> > On Wed, Oct 2, 2019 at 1:12 PM Enrico Olivelli <eo...@gmail.com>
> > wrote:
> >
> > > Any particular error/stacktrace in the logs?
> > > If it is zookeeper that is self killing it should log it, otherwise is
> > some
> > > other external system, I am sorry I don't know Exhibitor
> > >
> > > Hope that helps
> > > Enrico
> > >
> > > Il mer 2 ott 2019, 21:40 Jerry Hebert <je...@gmail.com> ha
> > scritto:
> > >
> > > > Hi Jörn,
> > > >
> > > > No, this was a very intermittent issue. We've been running this
> > ensemble
> > > > for about four years now and have never seen this problem so it seems
> > to
> > > be
> > > > super heisenbuggy. Our upgrade process will be more involved than
> what
> > > you
> > > > described (we're switching networks, instance types, underlying
> > > automation
> > > > and removing Exhibitor) but I'm glad you asked because I have a
> > question
> > > > about that too. :)
> > > >
> > > > Are you saying that a 3.5.5 node can synchronize with a 3.4.11
> > ensemble?
> > > I
> > > > wasn't sure if that would work or not. e.g., maybe I could bring up
> the
> > > new
> > > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> > > nodes,
> > > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11
> > boxes?
> > > >
> > > > Thanks,
> > > > Jerry
> > > >
> > > > On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jo...@gmail.com>
> > > wrote:
> > > >
> > > > > Have you tried to stop the node, delete the data and log directory,
> > > > > upgrade to 3.5.5 , start the node and wait until it is
> synchronized ?
> > > > >
> > > > > > Am 02.10.2019 um 20:14 schrieb Jerry Hebert <
> > jerry.hebert@gmail.com
> > > >:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > My first post here! I'm hoping you all might be able to offer
> some
> > > > > guidance
> > > > > > or redirect me to an existing ticket. We have a five node
> ensemble
> > on
> > > > > > 3.4.11 that we're currently in the process of upgrading to 3.5.5.
> > We
> > > > > > recently saw some bizarre behavior in our ensemble that I was
> > hoping
> > > to
> > > > > > find some sort pre-existing ticket or discussion about but I was
> > > having
> > > > > > difficulty finding hits for this in Jira.
> > > > > >
> > > > > > The behavior that we saw from our metrics is that one of our
> nodes
> > > (not
> > > > > > sure if it was a follower or a leader) started to demonstrate
> > > > > > instability (high CPU, high RAM) and it crashed. Not a big deal,
> > but
> > > as
> > > > > > soon as it crashed, all of the other four nodes all immediately
> > > > > restarted,
> > > > > > resulting in a short outage. One node crashing should never cause
> > an
> > > > > > ensemble restart of course, so I assumed that this must be a bug
> in
> > > ZK.
> > > > > The
> > > > > > nodes that restarted had no indication of errors in their logs,
> > they
> > > > just
> > > > > > simply restarted. Does this sound familiar to any of you?
> > > > > >
> > > > > > Also, we are using Exhibitor on that ensemble so it's also
> possible
> > > > that
> > > > > > the restart was caused by Exhibitor.
> > > > > >
> > > > > > My hope is that this issue will be behind us once the 3.5.5
> upgrade
> > > is
> > > > > > complete but I'd ideally like to find some concrete evidence of
> > this.
> > > > > >
> > > > > > Thanks!
> > > > > > Jerry
> > > > >
> > > >
> > >
> >
>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Enrico Olivelli <eo...@gmail.com>.

Il mer 2 ott 2019, 22:52 Jerry Hebert <je...@gmail.com> ha scritto:

> Hi Enrico,
>
> The nodes that restarted did not have any errors in their logs, they seemed
> to simply restart successfully so I think your hunch about the external
> system is probably correct.
>
> Could you comment on my second question above regarding cross-version
> migration or should I make a new thread?
>


I am not aware of any issue about an upgrade from 3.4 to 3.5 similar to
your case. It is expected to work.

Enrico


> Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble? I
> > wasn't sure if that would work or not. e.g., maybe I could bring up the
> new
> > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> nodes,
> > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?
>
>
> Thanks!
> Jerry
>
> On Wed, Oct 2, 2019 at 1:12 PM Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > Any particular error/stacktrace in the logs?
> > If it is zookeeper that is self killing it should log it, otherwise is
> some
> > other external system, I am sorry I don't know Exhibitor
> >
> > Hope that helps
> > Enrico
> >
> > Il mer 2 ott 2019, 21:40 Jerry Hebert <je...@gmail.com> ha
> scritto:
> >
> > > Hi Jörn,
> > >
> > > No, this was a very intermittent issue. We've been running this
> ensemble
> > > for about four years now and have never seen this problem so it seems
> to
> > be
> > > super heisenbuggy. Our upgrade process will be more involved than what
> > you
> > > described (we're switching networks, instance types, underlying
> > automation
> > > and removing Exhibitor) but I'm glad you asked because I have a
> question
> > > about that too. :)
> > >
> > > Are you saying that a 3.5.5 node can synchronize with a 3.4.11
> ensemble?
> > I
> > > wasn't sure if that would work or not. e.g., maybe I could bring up the
> > new
> > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> > nodes,
> > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11
> boxes?
> > >
> > > Thanks,
> > > Jerry
> > >
> > > On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jo...@gmail.com>
> > wrote:
> > >
> > > > Have you tried to stop the node, delete the data and log directory,
> > > > upgrade to 3.5.5 , start the node and wait until it is synchronized ?
> > > >
> > > > > Am 02.10.2019 um 20:14 schrieb Jerry Hebert <
> jerry.hebert@gmail.com
> > >:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > My first post here! I'm hoping you all might be able to offer some
> > > > guidance
> > > > > or redirect me to an existing ticket. We have a five node ensemble
> on
> > > > > 3.4.11 that we're currently in the process of upgrading to 3.5.5.
> We
> > > > > recently saw some bizarre behavior in our ensemble that I was
> hoping
> > to
> > > > > find some sort pre-existing ticket or discussion about but I was
> > having
> > > > > difficulty finding hits for this in Jira.
> > > > >
> > > > > The behavior that we saw from our metrics is that one of our nodes
> > (not
> > > > > sure if it was a follower or a leader) started to demonstrate
> > > > > instability (high CPU, high RAM) and it crashed. Not a big deal,
> but
> > as
> > > > > soon as it crashed, all of the other four nodes all immediately
> > > > restarted,
> > > > > resulting in a short outage. One node crashing should never cause
> an
> > > > > ensemble restart of course, so I assumed that this must be a bug in
> > ZK.
> > > > The
> > > > > nodes that restarted had no indication of errors in their logs,
> they
> > > just
> > > > > simply restarted. Does this sound familiar to any of you?
> > > > >
> > > > > Also, we are using Exhibitor on that ensemble so it's also possible
> > > that
> > > > > the restart was caused by Exhibitor.
> > > > >
> > > > > My hope is that this issue will be behind us once the 3.5.5 upgrade
> > is
> > > > > complete but I'd ideally like to find some concrete evidence of
> this.
> > > > >
> > > > > Thanks!
> > > > > Jerry
> > > >
> > >
> >
>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Jerry Hebert <je...@gmail.com>.

Hi Enrico,

The nodes that restarted did not have any errors in their logs, they seemed
to simply restart successfully so I think your hunch about the external
system is probably correct.

Could you comment on my second question above regarding cross-version
migration or should I make a new thread?

Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble? I
> wasn't sure if that would work or not. e.g., maybe I could bring up the new
> 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11 nodes,
> five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?


Thanks!
Jerry

On Wed, Oct 2, 2019 at 1:12 PM Enrico Olivelli <eo...@gmail.com> wrote:

> Any particular error/stacktrace in the logs?
> If it is zookeeper that is self killing it should log it, otherwise is some
> other external system, I am sorry I don't know Exhibitor
>
> Hope that helps
> Enrico
>
> Il mer 2 ott 2019, 21:40 Jerry Hebert <je...@gmail.com> ha scritto:
>
> > Hi Jörn,
> >
> > No, this was a very intermittent issue. We've been running this ensemble
> > for about four years now and have never seen this problem so it seems to
> be
> > super heisenbuggy. Our upgrade process will be more involved than what
> you
> > described (we're switching networks, instance types, underlying
> automation
> > and removing Exhibitor) but I'm glad you asked because I have a question
> > about that too. :)
> >
> > Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble?
> I
> > wasn't sure if that would work or not. e.g., maybe I could bring up the
> new
> > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> nodes,
> > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?
> >
> > Thanks,
> > Jerry
> >
> > On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jo...@gmail.com>
> wrote:
> >
> > > Have you tried to stop the node, delete the data and log directory,
> > > upgrade to 3.5.5 , start the node and wait until it is synchronized ?
> > >
> > > > Am 02.10.2019 um 20:14 schrieb Jerry Hebert <jerry.hebert@gmail.com
> >:
> > > >
> > > > Hi all,
> > > >
> > > > My first post here! I'm hoping you all might be able to offer some
> > > guidance
> > > > or redirect me to an existing ticket. We have a five node ensemble on
> > > > 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> > > > recently saw some bizarre behavior in our ensemble that I was hoping
> to
> > > > find some sort pre-existing ticket or discussion about but I was
> having
> > > > difficulty finding hits for this in Jira.
> > > >
> > > > The behavior that we saw from our metrics is that one of our nodes
> (not
> > > > sure if it was a follower or a leader) started to demonstrate
> > > > instability (high CPU, high RAM) and it crashed. Not a big deal, but
> as
> > > > soon as it crashed, all of the other four nodes all immediately
> > > restarted,
> > > > resulting in a short outage. One node crashing should never cause an
> > > > ensemble restart of course, so I assumed that this must be a bug in
> ZK.
> > > The
> > > > nodes that restarted had no indication of errors in their logs, they
> > just
> > > > simply restarted. Does this sound familiar to any of you?
> > > >
> > > > Also, we are using Exhibitor on that ensemble so it's also possible
> > that
> > > > the restart was caused by Exhibitor.
> > > >
> > > > My hope is that this issue will be behind us once the 3.5.5 upgrade
> is
> > > > complete but I'd ideally like to find some concrete evidence of this.
> > > >
> > > > Thanks!
> > > > Jerry
> > >
> >
>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Enrico Olivelli <eo...@gmail.com>.

Any particular error/stacktrace in the logs?
If it is zookeeper that is self killing it should log it, otherwise is some
other external system, I am sorry I don't know Exhibitor

Hope that helps
Enrico

Il mer 2 ott 2019, 21:40 Jerry Hebert <je...@gmail.com> ha scritto:

> Hi Jörn,
>
> No, this was a very intermittent issue. We've been running this ensemble
> for about four years now and have never seen this problem so it seems to be
> super heisenbuggy. Our upgrade process will be more involved than what you
> described (we're switching networks, instance types, underlying automation
> and removing Exhibitor) but I'm glad you asked because I have a question
> about that too. :)
>
> Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble? I
> wasn't sure if that would work or not. e.g., maybe I could bring up the new
> 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11 nodes,
> five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?
>
> Thanks,
> Jerry
>
> On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jo...@gmail.com> wrote:
>
> > Have you tried to stop the node, delete the data and log directory,
> > upgrade to 3.5.5 , start the node and wait until it is synchronized ?
> >
> > > Am 02.10.2019 um 20:14 schrieb Jerry Hebert <je...@gmail.com>:
> > >
> > > Hi all,
> > >
> > > My first post here! I'm hoping you all might be able to offer some
> > guidance
> > > or redirect me to an existing ticket. We have a five node ensemble on
> > > 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> > > recently saw some bizarre behavior in our ensemble that I was hoping to
> > > find some sort pre-existing ticket or discussion about but I was having
> > > difficulty finding hits for this in Jira.
> > >
> > > The behavior that we saw from our metrics is that one of our nodes (not
> > > sure if it was a follower or a leader) started to demonstrate
> > > instability (high CPU, high RAM) and it crashed. Not a big deal, but as
> > > soon as it crashed, all of the other four nodes all immediately
> > restarted,
> > > resulting in a short outage. One node crashing should never cause an
> > > ensemble restart of course, so I assumed that this must be a bug in ZK.
> > The
> > > nodes that restarted had no indication of errors in their logs, they
> just
> > > simply restarted. Does this sound familiar to any of you?
> > >
> > > Also, we are using Exhibitor on that ensemble so it's also possible
> that
> > > the restart was caused by Exhibitor.
> > >
> > > My hope is that this issue will be behind us once the 3.5.5 upgrade is
> > > complete but I'd ideally like to find some concrete evidence of this.
> > >
> > > Thanks!
> > > Jerry
> >
>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Jerry Hebert <je...@gmail.com>.

Hi Jörn,

No, this was a very intermittent issue. We've been running this ensemble
for about four years now and have never seen this problem so it seems to be
super heisenbuggy. Our upgrade process will be more involved than what you
described (we're switching networks, instance types, underlying automation
and removing Exhibitor) but I'm glad you asked because I have a question
about that too. :)

Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble? I
wasn't sure if that would work or not. e.g., maybe I could bring up the new
3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11 nodes,
five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?

Thanks,
Jerry

On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jo...@gmail.com> wrote:

> Have you tried to stop the node, delete the data and log directory,
> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
>
> > Am 02.10.2019 um 20:14 schrieb Jerry Hebert <je...@gmail.com>:
> >
> > Hi all,
> >
> > My first post here! I'm hoping you all might be able to offer some
> guidance
> > or redirect me to an existing ticket. We have a five node ensemble on
> > 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> > recently saw some bizarre behavior in our ensemble that I was hoping to
> > find some sort pre-existing ticket or discussion about but I was having
> > difficulty finding hits for this in Jira.
> >
> > The behavior that we saw from our metrics is that one of our nodes (not
> > sure if it was a follower or a leader) started to demonstrate
> > instability (high CPU, high RAM) and it crashed. Not a big deal, but as
> > soon as it crashed, all of the other four nodes all immediately
> restarted,
> > resulting in a short outage. One node crashing should never cause an
> > ensemble restart of course, so I assumed that this must be a bug in ZK.
> The
> > nodes that restarted had no indication of errors in their logs, they just
> > simply restarted. Does this sound familiar to any of you?
> >
> > Also, we are using Exhibitor on that ensemble so it's also possible that
> > the restart was caused by Exhibitor.
> >
> > My hope is that this issue will be behind us once the 3.5.5 upgrade is
> > complete but I'd ideally like to find some concrete evidence of this.
> >
> > Thanks!
> > Jerry
>

Re: One node crashing in 3.4.11 triggered a full ensemble restart

Posted by Jörn Franke <jo...@gmail.com>.

Have you tried to stop the node, delete the data and log directory, upgrade to 3.5.5 , start the node and wait until it is synchronized ?

> Am 02.10.2019 um 20:14 schrieb Jerry Hebert <je...@gmail.com>:
> 
> Hi all,
> 
> My first post here! I'm hoping you all might be able to offer some guidance
> or redirect me to an existing ticket. We have a five node ensemble on
> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> recently saw some bizarre behavior in our ensemble that I was hoping to
> find some sort pre-existing ticket or discussion about but I was having
> difficulty finding hits for this in Jira.
> 
> The behavior that we saw from our metrics is that one of our nodes (not
> sure if it was a follower or a leader) started to demonstrate
> instability (high CPU, high RAM) and it crashed. Not a big deal, but as
> soon as it crashed, all of the other four nodes all immediately restarted,
> resulting in a short outage. One node crashing should never cause an
> ensemble restart of course, so I assumed that this must be a bug in ZK. The
> nodes that restarted had no indication of errors in their logs, they just
> simply restarted. Does this sound familiar to any of you?
> 
> Also, we are using Exhibitor on that ensemble so it's also possible that
> the restart was caused by Exhibitor.
> 
> My hope is that this issue will be behind us once the 3.5.5 upgrade is
> complete but I'd ideally like to find some concrete evidence of this.
> 
> Thanks!
> Jerry