You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Camille Fournier <ca...@apache.org> on 2012/04/13 17:09:34 UTC

Input on a change

Hi everyone,

I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
like some feedback from the user base on it.
https://issues.apache.org/jira/browse/ZOOKEEPER-1442

The current behavior of ZK when we get an uncaught exception is to log it
and try to move on. This is arguably not the right thing to do, and will
possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
longer than it should.
The patch proposes that when we get an instance of java.lang.Error, we
should do a system.exit to fast-fail the process. With the possible
exception of ThreadDeath (which may or may not be an unrecoverable system
state depending on the thread), I think this makes sense, but I would like
to hear from others if they have an opinion. I think it's better to kill
the process and let your monitoring services detect process death (and thus
restart) than possibly linger unresponsive for a while, are there scenarios
that we're missing where this error can occur and you wouldn't want the
process killed?

Thanks for your feedback,

Camille

Re: Input on a change

Posted by Ted Dunning <te...@gmail.com>.

It would be lovely to have a ZK test jar that facilitates mock testing like
this.  I have done as Camille has, but the result is that we have probably
replicated work without resulting in a usable framework.

Shame on me at least.

The mockup structure for tests *inside* ZK is far better than the structure
available outside of ZK.  Unfortunately, every time I have taken a run at
this, it has exceeded my fun-time budget without significant publishable
results.

2012/4/16 Camille Fournier <ca...@apache.org>

> I've done extensive ZK client testing mostly using mocks with great
> success. I agree that there's cases when embedding a ZK is probably the
> easiest way to go, but it is actually much easier to simulate most failure
> cases in ZK via mocks than running an embedded server. It does require more
> up-front developer thought to set up the tests, but that thought pays
> dividends in that you actually are then forced to understand the possible
> failures and simulate them.
> It seems ultimately like a flag is the right way to go here, so I'll leave
> that feedback on the ticket.
>
> C
>
> 2012/4/16 Scott Fines <sc...@gmail.com>
>
> > Not to start a flame war here..
> >
> > Mocks are good for a lot of purposes, but I can think of a few cases
> where
> > testing failure states is a lot easier if you are actively contacting a
> ZK
> > server (like testing a failed client connection's consequences). To that
> > end, it seems like a good idea to support this kind of testing. So far,
> the
> > easiest way seems to be embedding a ZK Server.
> >
> > On the other hand, it doesn't seem likely that an active, production ZK
> > cluster is going to want any kind of unusual behavior. It's almost always
> > better to have some kind of behavior that is very well defined and
> > predictable happen, even if that behavior is catastrophic. In this case,
> > having a production server shut itself off whenever it doesn't know how
> to
> > recover is very predictable, and can be designed around. Leaving that
> > cluster in an awkward state that may or may not behave predictably is
> much
> > harder to work with.
> >
> > I personally can't see much else that you'd want to do in this scenario
> > that falls far enough outside these two patterns that it's worth
> supporting
> > custom handlers. It seems like the support time would be better spent
> > developing and planning around these two states.
> >
> > Scott
> >
> >
> >
> > On Mon, Apr 16, 2012 at 1:09 PM, John Sirois <jo...@gmail.com>
> > wrote:
> >
> > > On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy <is...@gmail.com>
> > wrote:
> > >
> > > > Fair enough, if you think it is an anti-pattern then perhaps you can
> > > > suggest an alternative that allows one to write tests that are (in
> > > > descending order of priority):
> > > >
> > > > 1. Quick and easy to setup and teardown
> > > > 2. Repeatably testable and debuggable in an IDE without having to
> > resort
> > > to
> > > > the external build tool
> > > > 3. Testable in parallel, i.e. having more than one build running on a
> > CI
> > > > server, so need some way to avoid port clashes
> > > > 4. Cross-platform - i.e. run the exact same build sequence on
> multiple
> > > > OSes.
> > > >
> > >
> > > We also embed for testing for all the reasons above with good success.
> >  In
> > > fact - I like the idea of the server product directly supporting
> > alternate
> > > mains - although this may be more burden on zk devs initially.  This
> > would
> > > allow an org to write their own main that plugs in an uncaught handler
> > and
> > > does whatever is appropriate in their deploy environment.
> > >
> > >
> > > >
> > > > Ishaaq
> > > >
> > > > On 16 April 2012 22:55, Camille Fournier <ca...@apache.org> wrote:
> > > >
> > > > > I believe that this change is inspired by someone that runs zk
> > > embedded.
> > > > > Personally I'm not moved by the testing argument, embedding the
> > server
> > > > for
> > > > > testing is a bit of an anti pattern in my mind.
> > > > >
> > > > > From my phone
> > > > > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com>
> wrote:
> > > > >
> > > > > > I'd go so far as to say that even the server-code should avoid
> > > > > System.exit.
> > > > > > Just because it is "meant" to be a standalone system doesn't mean
> > > that
> > > > > code
> > > > > > that makes it impossible to embed it should be encouraged.
> > > > > >
> > > > > > For e.g, we embed a local version of ZK to be used inside our
> unit
> > > > tests.
> > > > > > This makes it much easier for us to control ZK to coincide with
> > test
> > > > > > expectations as well as making for much faster build times. It
> > would
> > > > be a
> > > > > > shame if the embedded ZK started killing the JVM.
> > > > > >
> > > > > > Ishaaq
> > > > > >
> > > > > > On 16 April 2012 04:28, Camille Fournier <ca...@apache.org>
> > wrote:
> > > > > >
> > > > > > > This is a good point.
> > > > > > > I think this change should be fine for the server portion of
> the
> > > > code,
> > > > > > > since it's designed to be run as a standalone system. But for
> the
> > > > > > > client connection to also call system.exit on such an error is
> > > > > > > overreaching for all the reasons listed below.
> > > > > > >
> > > > > > > C
> > > > > > >
> > > > > > > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > > > > > > I really would not like for any library to perform a
> > System.exit
> > > > > call.
> > > > > > > This
> > > > > > > > would make huge program exit out of sudden (think about j2ee,
> > you
> > > > may
> > > > > > be
> > > > > > > > bitten by security manager).  Note that there are more or
> less
> > > safe
> > > > > > > errors,
> > > > > > > > like StackOverflowError.
> > > > > > > > Also System.exit make testing nightmare. E.g. maven2 silently
> > > skips
> > > > > any
> > > > > > > > tests after the one that calls System.exit. And everything's
> > > green.
> > > > > > > > As for me good options are:
> > > > > > > > 1) Call user-provided uncaught exception handler. Use the one
> > > from
> > > > > the
> > > > > > > > thread that created the connection if one is not specified
> > > > explicity.
> > > > > > > > 1) Stop everything, notifying user with a global watcher. If
> > it's
> > > > > > > possible,
> > > > > > > > clean any static state (e.g. restart threads) and allow to
> > > restart
> > > > > > > > connection.
> > > > > > > > In any case, call user code. Good system already know how to
> > > react
> > > > > (it
> > > > > > > may
> > > > > > > > want to send email to admin), allow it to perform well.
> > > > > > > >
> > > > > > > > Best regards, Vitalii Tymchyshyn.
> > > > > > > >
> > > > > > > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > > > > > > >
> > > > > > > >> Hi everyone,
> > > > > > > >>
> > > > > > > >> I'm trying to evaluate a patch that Jeremy Stribling has
> > > > submitted,
> > > > > > and
> > > > > > > I'd
> > > > > > > >> like some feedback from the user base on it.
> > > > > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > > > > > > >>
> > > > > > > >> The current behavior of ZK when we get an uncaught exception
> > is
> > > to
> > > > > log
> > > > > > > it
> > > > > > > >> and try to move on. This is arguably not the right thing to
> > do,
> > > > and
> > > > > > will
> > > > > > > >> possibly cause ZK to limp along with a bad VM (say, in an
> OOM
> > > > state)
> > > > > > for
> > > > > > > >> longer than it should.
> > > > > > > >> The patch proposes that when we get an instance of
> > > > java.lang.Error,
> > > > > we
> > > > > > > >> should do a system.exit to fast-fail the process. With the
> > > > possible
> > > > > > > >> exception of ThreadDeath (which may or may not be an
> > > unrecoverable
> > > > > > > system
> > > > > > > >> state depending on the thread), I think this makes sense,
> but
> > I
> > > > > would
> > > > > > > like
> > > > > > > >> to hear from others if they have an opinion. I think it's
> > better
> > > > to
> > > > > > kill
> > > > > > > >> the process and let your monitoring services detect process
> > > death
> > > > > (and
> > > > > > > thus
> > > > > > > >> restart) than possibly linger unresponsive for a while, are
> > > there
> > > > > > > scenarios
> > > > > > > >> that we're missing where this error can occur and you
> wouldn't
> > > > want
> > > > > > the
> > > > > > > >> process killed?
> > > > > > > >>
> > > > > > > >> Thanks for your feedback,
> > > > > > > >>
> > > > > > > >> Camille
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > >  Vitalii Tymchyshyn
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Input on a change

Posted by Camille Fournier <ca...@apache.org>.

I've done extensive ZK client testing mostly using mocks with great
success. I agree that there's cases when embedding a ZK is probably the
easiest way to go, but it is actually much easier to simulate most failure
cases in ZK via mocks than running an embedded server. It does require more
up-front developer thought to set up the tests, but that thought pays
dividends in that you actually are then forced to understand the possible
failures and simulate them.
It seems ultimately like a flag is the right way to go here, so I'll leave
that feedback on the ticket.

C

2012/4/16 Scott Fines <sc...@gmail.com>

> Not to start a flame war here..
>
> Mocks are good for a lot of purposes, but I can think of a few cases where
> testing failure states is a lot easier if you are actively contacting a ZK
> server (like testing a failed client connection's consequences). To that
> end, it seems like a good idea to support this kind of testing. So far, the
> easiest way seems to be embedding a ZK Server.
>
> On the other hand, it doesn't seem likely that an active, production ZK
> cluster is going to want any kind of unusual behavior. It's almost always
> better to have some kind of behavior that is very well defined and
> predictable happen, even if that behavior is catastrophic. In this case,
> having a production server shut itself off whenever it doesn't know how to
> recover is very predictable, and can be designed around. Leaving that
> cluster in an awkward state that may or may not behave predictably is much
> harder to work with.
>
> I personally can't see much else that you'd want to do in this scenario
> that falls far enough outside these two patterns that it's worth supporting
> custom handlers. It seems like the support time would be better spent
> developing and planning around these two states.
>
> Scott
>
>
>
> On Mon, Apr 16, 2012 at 1:09 PM, John Sirois <jo...@gmail.com>
> wrote:
>
> > On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy <is...@gmail.com>
> wrote:
> >
> > > Fair enough, if you think it is an anti-pattern then perhaps you can
> > > suggest an alternative that allows one to write tests that are (in
> > > descending order of priority):
> > >
> > > 1. Quick and easy to setup and teardown
> > > 2. Repeatably testable and debuggable in an IDE without having to
> resort
> > to
> > > the external build tool
> > > 3. Testable in parallel, i.e. having more than one build running on a
> CI
> > > server, so need some way to avoid port clashes
> > > 4. Cross-platform - i.e. run the exact same build sequence on multiple
> > > OSes.
> > >
> >
> > We also embed for testing for all the reasons above with good success.
>  In
> > fact - I like the idea of the server product directly supporting
> alternate
> > mains - although this may be more burden on zk devs initially.  This
> would
> > allow an org to write their own main that plugs in an uncaught handler
> and
> > does whatever is appropriate in their deploy environment.
> >
> >
> > >
> > > Ishaaq
> > >
> > > On 16 April 2012 22:55, Camille Fournier <ca...@apache.org> wrote:
> > >
> > > > I believe that this change is inspired by someone that runs zk
> > embedded.
> > > > Personally I'm not moved by the testing argument, embedding the
> server
> > > for
> > > > testing is a bit of an anti pattern in my mind.
> > > >
> > > > From my phone
> > > > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:
> > > >
> > > > > I'd go so far as to say that even the server-code should avoid
> > > > System.exit.
> > > > > Just because it is "meant" to be a standalone system doesn't mean
> > that
> > > > code
> > > > > that makes it impossible to embed it should be encouraged.
> > > > >
> > > > > For e.g, we embed a local version of ZK to be used inside our unit
> > > tests.
> > > > > This makes it much easier for us to control ZK to coincide with
> test
> > > > > expectations as well as making for much faster build times. It
> would
> > > be a
> > > > > shame if the embedded ZK started killing the JVM.
> > > > >
> > > > > Ishaaq
> > > > >
> > > > > On 16 April 2012 04:28, Camille Fournier <ca...@apache.org>
> wrote:
> > > > >
> > > > > > This is a good point.
> > > > > > I think this change should be fine for the server portion of the
> > > code,
> > > > > > since it's designed to be run as a standalone system. But for the
> > > > > > client connection to also call system.exit on such an error is
> > > > > > overreaching for all the reasons listed below.
> > > > > >
> > > > > > C
> > > > > >
> > > > > > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > > > > > I really would not like for any library to perform a
> System.exit
> > > > call.
> > > > > > This
> > > > > > > would make huge program exit out of sudden (think about j2ee,
> you
> > > may
> > > > > be
> > > > > > > bitten by security manager).  Note that there are more or less
> > safe
> > > > > > errors,
> > > > > > > like StackOverflowError.
> > > > > > > Also System.exit make testing nightmare. E.g. maven2 silently
> > skips
> > > > any
> > > > > > > tests after the one that calls System.exit. And everything's
> > green.
> > > > > > > As for me good options are:
> > > > > > > 1) Call user-provided uncaught exception handler. Use the one
> > from
> > > > the
> > > > > > > thread that created the connection if one is not specified
> > > explicity.
> > > > > > > 1) Stop everything, notifying user with a global watcher. If
> it's
> > > > > > possible,
> > > > > > > clean any static state (e.g. restart threads) and allow to
> > restart
> > > > > > > connection.
> > > > > > > In any case, call user code. Good system already know how to
> > react
> > > > (it
> > > > > > may
> > > > > > > want to send email to admin), allow it to perform well.
> > > > > > >
> > > > > > > Best regards, Vitalii Tymchyshyn.
> > > > > > >
> > > > > > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > > > > > >
> > > > > > >> Hi everyone,
> > > > > > >>
> > > > > > >> I'm trying to evaluate a patch that Jeremy Stribling has
> > > submitted,
> > > > > and
> > > > > > I'd
> > > > > > >> like some feedback from the user base on it.
> > > > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > > > > > >>
> > > > > > >> The current behavior of ZK when we get an uncaught exception
> is
> > to
> > > > log
> > > > > > it
> > > > > > >> and try to move on. This is arguably not the right thing to
> do,
> > > and
> > > > > will
> > > > > > >> possibly cause ZK to limp along with a bad VM (say, in an OOM
> > > state)
> > > > > for
> > > > > > >> longer than it should.
> > > > > > >> The patch proposes that when we get an instance of
> > > java.lang.Error,
> > > > we
> > > > > > >> should do a system.exit to fast-fail the process. With the
> > > possible
> > > > > > >> exception of ThreadDeath (which may or may not be an
> > unrecoverable
> > > > > > system
> > > > > > >> state depending on the thread), I think this makes sense, but
> I
> > > > would
> > > > > > like
> > > > > > >> to hear from others if they have an opinion. I think it's
> better
> > > to
> > > > > kill
> > > > > > >> the process and let your monitoring services detect process
> > death
> > > > (and
> > > > > > thus
> > > > > > >> restart) than possibly linger unresponsive for a while, are
> > there
> > > > > > scenarios
> > > > > > >> that we're missing where this error can occur and you wouldn't
> > > want
> > > > > the
> > > > > > >> process killed?
> > > > > > >>
> > > > > > >> Thanks for your feedback,
> > > > > > >>
> > > > > > >> Camille
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >  Vitalii Tymchyshyn
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Input on a change

Posted by Jeremy Stribling <st...@nicira.com>.

I would be very happy with a solution that allows a custom uncaught 
exception handler -- that's how we are working around this issue now, by 
installing a new default uncaught exception handler after we're sure 
NIOServerCnxnFactory has installed its own.  I do think by default ZK 
should shut down predictably if it hits an error (with the possible 
exception of ThreadDeath), but turning it off by default until a major 
release seems reasonable if we allow a custom handler.

I don't know that I'm the right person to implement it though, not being 
super strong in Java or in the design of user inputs into ZK, so if 
anyone wants to take a crack at this, feel free.

On 04/16/2012 11:27 AM, Scott Fines wrote:
> Not to start a flame war here..
>
> Mocks are good for a lot of purposes, but I can think of a few cases where
> testing failure states is a lot easier if you are actively contacting a ZK
> server (like testing a failed client connection's consequences). To that
> end, it seems like a good idea to support this kind of testing. So far, the
> easiest way seems to be embedding a ZK Server.
>
> On the other hand, it doesn't seem likely that an active, production ZK
> cluster is going to want any kind of unusual behavior. It's almost always
> better to have some kind of behavior that is very well defined and
> predictable happen, even if that behavior is catastrophic. In this case,
> having a production server shut itself off whenever it doesn't know how to
> recover is very predictable, and can be designed around. Leaving that
> cluster in an awkward state that may or may not behave predictably is much
> harder to work with.
>
> I personally can't see much else that you'd want to do in this scenario
> that falls far enough outside these two patterns that it's worth supporting
> custom handlers. It seems like the support time would be better spent
> developing and planning around these two states.
>
> Scott
>
>
>
> On Mon, Apr 16, 2012 at 1:09 PM, John Sirois<jo...@gmail.com>  wrote:
>
>> On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy<is...@gmail.com>  wrote:
>>
>>> Fair enough, if you think it is an anti-pattern then perhaps you can
>>> suggest an alternative that allows one to write tests that are (in
>>> descending order of priority):
>>>
>>> 1. Quick and easy to setup and teardown
>>> 2. Repeatably testable and debuggable in an IDE without having to resort
>> to
>>> the external build tool
>>> 3. Testable in parallel, i.e. having more than one build running on a CI
>>> server, so need some way to avoid port clashes
>>> 4. Cross-platform - i.e. run the exact same build sequence on multiple
>>> OSes.
>>>
>> We also embed for testing for all the reasons above with good success.  In
>> fact - I like the idea of the server product directly supporting alternate
>> mains - although this may be more burden on zk devs initially.  This would
>> allow an org to write their own main that plugs in an uncaught handler and
>> does whatever is appropriate in their deploy environment.
>>
>>
>>> Ishaaq
>>>
>>> On 16 April 2012 22:55, Camille Fournier<ca...@apache.org>  wrote:
>>>
>>>> I believe that this change is inspired by someone that runs zk
>> embedded.
>>>> Personally I'm not moved by the testing argument, embedding the server
>>> for
>>>> testing is a bit of an anti pattern in my mind.
>>>>
>>>>  From my phone
>>>> On Apr 15, 2012 11:20 PM, "Ishaaq Chandy"<is...@gmail.com>  wrote:
>>>>
>>>>> I'd go so far as to say that even the server-code should avoid
>>>> System.exit.
>>>>> Just because it is "meant" to be a standalone system doesn't mean
>> that
>>>> code
>>>>> that makes it impossible to embed it should be encouraged.
>>>>>
>>>>> For e.g, we embed a local version of ZK to be used inside our unit
>>> tests.
>>>>> This makes it much easier for us to control ZK to coincide with test
>>>>> expectations as well as making for much faster build times. It would
>>> be a
>>>>> shame if the embedded ZK started killing the JVM.
>>>>>
>>>>> Ishaaq
>>>>>
>>>>> On 16 April 2012 04:28, Camille Fournier<ca...@apache.org>  wrote:
>>>>>
>>>>>> This is a good point.
>>>>>> I think this change should be fine for the server portion of the
>>> code,
>>>>>> since it's designed to be run as a standalone system. But for the
>>>>>> client connection to also call system.exit on such an error is
>>>>>> overreaching for all the reasons listed below.
>>>>>>
>>>>>> C
>>>>>>
>>>>>> 2012/4/15 Віталій Тимчишин<ti...@gmail.com>:
>>>>>>> I really would not like for any library to perform a System.exit
>>>> call.
>>>>>> This
>>>>>>> would make huge program exit out of sudden (think about j2ee, you
>>> may
>>>>> be
>>>>>>> bitten by security manager).  Note that there are more or less
>> safe
>>>>>> errors,
>>>>>>> like StackOverflowError.
>>>>>>> Also System.exit make testing nightmare. E.g. maven2 silently
>> skips
>>>> any
>>>>>>> tests after the one that calls System.exit. And everything's
>> green.
>>>>>>> As for me good options are:
>>>>>>> 1) Call user-provided uncaught exception handler. Use the one
>> from
>>>> the
>>>>>>> thread that created the connection if one is not specified
>>> explicity.
>>>>>>> 1) Stop everything, notifying user with a global watcher. If it's
>>>>>> possible,
>>>>>>> clean any static state (e.g. restart threads) and allow to
>> restart
>>>>>>> connection.
>>>>>>> In any case, call user code. Good system already know how to
>> react
>>>> (it
>>>>>> may
>>>>>>> want to send email to admin), allow it to perform well.
>>>>>>>
>>>>>>> Best regards, Vitalii Tymchyshyn.
>>>>>>>
>>>>>>> 2012/4/13 Camille Fournier<ca...@apache.org>
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I'm trying to evaluate a patch that Jeremy Stribling has
>>> submitted,
>>>>> and
>>>>>> I'd
>>>>>>>> like some feedback from the user base on it.
>>>>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>>>>>>>
>>>>>>>> The current behavior of ZK when we get an uncaught exception is
>> to
>>>> log
>>>>>> it
>>>>>>>> and try to move on. This is arguably not the right thing to do,
>>> and
>>>>> will
>>>>>>>> possibly cause ZK to limp along with a bad VM (say, in an OOM
>>> state)
>>>>> for
>>>>>>>> longer than it should.
>>>>>>>> The patch proposes that when we get an instance of
>>> java.lang.Error,
>>>> we
>>>>>>>> should do a system.exit to fast-fail the process. With the
>>> possible
>>>>>>>> exception of ThreadDeath (which may or may not be an
>> unrecoverable
>>>>>> system
>>>>>>>> state depending on the thread), I think this makes sense, but I
>>>> would
>>>>>> like
>>>>>>>> to hear from others if they have an opinion. I think it's better
>>> to
>>>>> kill
>>>>>>>> the process and let your monitoring services detect process
>> death
>>>> (and
>>>>>> thus
>>>>>>>> restart) than possibly linger unresponsive for a while, are
>> there
>>>>>> scenarios
>>>>>>>> that we're missing where this error can occur and you wouldn't
>>> want
>>>>> the
>>>>>>>> process killed?
>>>>>>>>
>>>>>>>> Thanks for your feedback,
>>>>>>>>
>>>>>>>> Camille
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>>   Vitalii Tymchyshyn

Re: Input on a change

Posted by Scott Fines <sc...@gmail.com>.

Not to start a flame war here..

Mocks are good for a lot of purposes, but I can think of a few cases where
testing failure states is a lot easier if you are actively contacting a ZK
server (like testing a failed client connection's consequences). To that
end, it seems like a good idea to support this kind of testing. So far, the
easiest way seems to be embedding a ZK Server.

On the other hand, it doesn't seem likely that an active, production ZK
cluster is going to want any kind of unusual behavior. It's almost always
better to have some kind of behavior that is very well defined and
predictable happen, even if that behavior is catastrophic. In this case,
having a production server shut itself off whenever it doesn't know how to
recover is very predictable, and can be designed around. Leaving that
cluster in an awkward state that may or may not behave predictably is much
harder to work with.

I personally can't see much else that you'd want to do in this scenario
that falls far enough outside these two patterns that it's worth supporting
custom handlers. It seems like the support time would be better spent
developing and planning around these two states.

Scott



On Mon, Apr 16, 2012 at 1:09 PM, John Sirois <jo...@gmail.com> wrote:

> On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy <is...@gmail.com> wrote:
>
> > Fair enough, if you think it is an anti-pattern then perhaps you can
> > suggest an alternative that allows one to write tests that are (in
> > descending order of priority):
> >
> > 1. Quick and easy to setup and teardown
> > 2. Repeatably testable and debuggable in an IDE without having to resort
> to
> > the external build tool
> > 3. Testable in parallel, i.e. having more than one build running on a CI
> > server, so need some way to avoid port clashes
> > 4. Cross-platform - i.e. run the exact same build sequence on multiple
> > OSes.
> >
>
> We also embed for testing for all the reasons above with good success.  In
> fact - I like the idea of the server product directly supporting alternate
> mains - although this may be more burden on zk devs initially.  This would
> allow an org to write their own main that plugs in an uncaught handler and
> does whatever is appropriate in their deploy environment.
>
>
> >
> > Ishaaq
> >
> > On 16 April 2012 22:55, Camille Fournier <ca...@apache.org> wrote:
> >
> > > I believe that this change is inspired by someone that runs zk
> embedded.
> > > Personally I'm not moved by the testing argument, embedding the server
> > for
> > > testing is a bit of an anti pattern in my mind.
> > >
> > > From my phone
> > > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:
> > >
> > > > I'd go so far as to say that even the server-code should avoid
> > > System.exit.
> > > > Just because it is "meant" to be a standalone system doesn't mean
> that
> > > code
> > > > that makes it impossible to embed it should be encouraged.
> > > >
> > > > For e.g, we embed a local version of ZK to be used inside our unit
> > tests.
> > > > This makes it much easier for us to control ZK to coincide with test
> > > > expectations as well as making for much faster build times. It would
> > be a
> > > > shame if the embedded ZK started killing the JVM.
> > > >
> > > > Ishaaq
> > > >
> > > > On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:
> > > >
> > > > > This is a good point.
> > > > > I think this change should be fine for the server portion of the
> > code,
> > > > > since it's designed to be run as a standalone system. But for the
> > > > > client connection to also call system.exit on such an error is
> > > > > overreaching for all the reasons listed below.
> > > > >
> > > > > C
> > > > >
> > > > > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > > > > I really would not like for any library to perform a System.exit
> > > call.
> > > > > This
> > > > > > would make huge program exit out of sudden (think about j2ee, you
> > may
> > > > be
> > > > > > bitten by security manager).  Note that there are more or less
> safe
> > > > > errors,
> > > > > > like StackOverflowError.
> > > > > > Also System.exit make testing nightmare. E.g. maven2 silently
> skips
> > > any
> > > > > > tests after the one that calls System.exit. And everything's
> green.
> > > > > > As for me good options are:
> > > > > > 1) Call user-provided uncaught exception handler. Use the one
> from
> > > the
> > > > > > thread that created the connection if one is not specified
> > explicity.
> > > > > > 1) Stop everything, notifying user with a global watcher. If it's
> > > > > possible,
> > > > > > clean any static state (e.g. restart threads) and allow to
> restart
> > > > > > connection.
> > > > > > In any case, call user code. Good system already know how to
> react
> > > (it
> > > > > may
> > > > > > want to send email to admin), allow it to perform well.
> > > > > >
> > > > > > Best regards, Vitalii Tymchyshyn.
> > > > > >
> > > > > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > > > > >
> > > > > >> Hi everyone,
> > > > > >>
> > > > > >> I'm trying to evaluate a patch that Jeremy Stribling has
> > submitted,
> > > > and
> > > > > I'd
> > > > > >> like some feedback from the user base on it.
> > > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > > > > >>
> > > > > >> The current behavior of ZK when we get an uncaught exception is
> to
> > > log
> > > > > it
> > > > > >> and try to move on. This is arguably not the right thing to do,
> > and
> > > > will
> > > > > >> possibly cause ZK to limp along with a bad VM (say, in an OOM
> > state)
> > > > for
> > > > > >> longer than it should.
> > > > > >> The patch proposes that when we get an instance of
> > java.lang.Error,
> > > we
> > > > > >> should do a system.exit to fast-fail the process. With the
> > possible
> > > > > >> exception of ThreadDeath (which may or may not be an
> unrecoverable
> > > > > system
> > > > > >> state depending on the thread), I think this makes sense, but I
> > > would
> > > > > like
> > > > > >> to hear from others if they have an opinion. I think it's better
> > to
> > > > kill
> > > > > >> the process and let your monitoring services detect process
> death
> > > (and
> > > > > thus
> > > > > >> restart) than possibly linger unresponsive for a while, are
> there
> > > > > scenarios
> > > > > >> that we're missing where this error can occur and you wouldn't
> > want
> > > > the
> > > > > >> process killed?
> > > > > >>
> > > > > >> Thanks for your feedback,
> > > > > >>
> > > > > >> Camille
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >  Vitalii Tymchyshyn
> > > > >
> > > >
> > >
> >
>

Re: Input on a change

Posted by John Sirois <jo...@gmail.com>.

On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy <is...@gmail.com> wrote:

> Fair enough, if you think it is an anti-pattern then perhaps you can
> suggest an alternative that allows one to write tests that are (in
> descending order of priority):
>
> 1. Quick and easy to setup and teardown
> 2. Repeatably testable and debuggable in an IDE without having to resort to
> the external build tool
> 3. Testable in parallel, i.e. having more than one build running on a CI
> server, so need some way to avoid port clashes
> 4. Cross-platform - i.e. run the exact same build sequence on multiple
> OSes.
>

We also embed for testing for all the reasons above with good success.  In
fact - I like the idea of the server product directly supporting alternate
mains - although this may be more burden on zk devs initially.  This would
allow an org to write their own main that plugs in an uncaught handler and
does whatever is appropriate in their deploy environment.


>
> Ishaaq
>
> On 16 April 2012 22:55, Camille Fournier <ca...@apache.org> wrote:
>
> > I believe that this change is inspired by someone that runs zk embedded.
> > Personally I'm not moved by the testing argument, embedding the server
> for
> > testing is a bit of an anti pattern in my mind.
> >
> > From my phone
> > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:
> >
> > > I'd go so far as to say that even the server-code should avoid
> > System.exit.
> > > Just because it is "meant" to be a standalone system doesn't mean that
> > code
> > > that makes it impossible to embed it should be encouraged.
> > >
> > > For e.g, we embed a local version of ZK to be used inside our unit
> tests.
> > > This makes it much easier for us to control ZK to coincide with test
> > > expectations as well as making for much faster build times. It would
> be a
> > > shame if the embedded ZK started killing the JVM.
> > >
> > > Ishaaq
> > >
> > > On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:
> > >
> > > > This is a good point.
> > > > I think this change should be fine for the server portion of the
> code,
> > > > since it's designed to be run as a standalone system. But for the
> > > > client connection to also call system.exit on such an error is
> > > > overreaching for all the reasons listed below.
> > > >
> > > > C
> > > >
> > > > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > > > I really would not like for any library to perform a System.exit
> > call.
> > > > This
> > > > > would make huge program exit out of sudden (think about j2ee, you
> may
> > > be
> > > > > bitten by security manager).  Note that there are more or less safe
> > > > errors,
> > > > > like StackOverflowError.
> > > > > Also System.exit make testing nightmare. E.g. maven2 silently skips
> > any
> > > > > tests after the one that calls System.exit. And everything's green.
> > > > > As for me good options are:
> > > > > 1) Call user-provided uncaught exception handler. Use the one from
> > the
> > > > > thread that created the connection if one is not specified
> explicity.
> > > > > 1) Stop everything, notifying user with a global watcher. If it's
> > > > possible,
> > > > > clean any static state (e.g. restart threads) and allow to restart
> > > > > connection.
> > > > > In any case, call user code. Good system already know how to react
> > (it
> > > > may
> > > > > want to send email to admin), allow it to perform well.
> > > > >
> > > > > Best regards, Vitalii Tymchyshyn.
> > > > >
> > > > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> I'm trying to evaluate a patch that Jeremy Stribling has
> submitted,
> > > and
> > > > I'd
> > > > >> like some feedback from the user base on it.
> > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > > > >>
> > > > >> The current behavior of ZK when we get an uncaught exception is to
> > log
> > > > it
> > > > >> and try to move on. This is arguably not the right thing to do,
> and
> > > will
> > > > >> possibly cause ZK to limp along with a bad VM (say, in an OOM
> state)
> > > for
> > > > >> longer than it should.
> > > > >> The patch proposes that when we get an instance of
> java.lang.Error,
> > we
> > > > >> should do a system.exit to fast-fail the process. With the
> possible
> > > > >> exception of ThreadDeath (which may or may not be an unrecoverable
> > > > system
> > > > >> state depending on the thread), I think this makes sense, but I
> > would
> > > > like
> > > > >> to hear from others if they have an opinion. I think it's better
> to
> > > kill
> > > > >> the process and let your monitoring services detect process death
> > (and
> > > > thus
> > > > >> restart) than possibly linger unresponsive for a while, are there
> > > > scenarios
> > > > >> that we're missing where this error can occur and you wouldn't
> want
> > > the
> > > > >> process killed?
> > > > >>
> > > > >> Thanks for your feedback,
> > > > >>
> > > > >> Camille
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >  Vitalii Tymchyshyn
> > > >
> > >
> >
>

Re: Input on a change

Posted by Camille Fournier <ca...@apache.org>.

Sure, mocking it's a helluva lot faster too.

C

>From my phone
On Apr 16, 2012 12:54 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:

> Fair enough, if you think it is an anti-pattern then perhaps you can
> suggest an alternative that allows one to write tests that are (in
> descending order of priority):
>
> 1. Quick and easy to setup and teardown
> 2. Repeatably testable and debuggable in an IDE without having to resort to
> the external build tool
> 3. Testable in parallel, i.e. having more than one build running on a CI
> server, so need some way to avoid port clashes
> 4. Cross-platform - i.e. run the exact same build sequence on multiple
> OSes.
>
> Ishaaq
>
> On 16 April 2012 22:55, Camille Fournier <ca...@apache.org> wrote:
>
> > I believe that this change is inspired by someone that runs zk embedded.
> > Personally I'm not moved by the testing argument, embedding the server
> for
> > testing is a bit of an anti pattern in my mind.
> >
> > From my phone
> > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:
> >
> > > I'd go so far as to say that even the server-code should avoid
> > System.exit.
> > > Just because it is "meant" to be a standalone system doesn't mean that
> > code
> > > that makes it impossible to embed it should be encouraged.
> > >
> > > For e.g, we embed a local version of ZK to be used inside our unit
> tests.
> > > This makes it much easier for us to control ZK to coincide with test
> > > expectations as well as making for much faster build times. It would
> be a
> > > shame if the embedded ZK started killing the JVM.
> > >
> > > Ishaaq
> > >
> > > On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:
> > >
> > > > This is a good point.
> > > > I think this change should be fine for the server portion of the
> code,
> > > > since it's designed to be run as a standalone system. But for the
> > > > client connection to also call system.exit on such an error is
> > > > overreaching for all the reasons listed below.
> > > >
> > > > C
> > > >
> > > > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > > > I really would not like for any library to perform a System.exit
> > call.
> > > > This
> > > > > would make huge program exit out of sudden (think about j2ee, you
> may
> > > be
> > > > > bitten by security manager).  Note that there are more or less safe
> > > > errors,
> > > > > like StackOverflowError.
> > > > > Also System.exit make testing nightmare. E.g. maven2 silently skips
> > any
> > > > > tests after the one that calls System.exit. And everything's green.
> > > > > As for me good options are:
> > > > > 1) Call user-provided uncaught exception handler. Use the one from
> > the
> > > > > thread that created the connection if one is not specified
> explicity.
> > > > > 1) Stop everything, notifying user with a global watcher. If it's
> > > > possible,
> > > > > clean any static state (e.g. restart threads) and allow to restart
> > > > > connection.
> > > > > In any case, call user code. Good system already know how to react
> > (it
> > > > may
> > > > > want to send email to admin), allow it to perform well.
> > > > >
> > > > > Best regards, Vitalii Tymchyshyn.
> > > > >
> > > > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> I'm trying to evaluate a patch that Jeremy Stribling has
> submitted,
> > > and
> > > > I'd
> > > > >> like some feedback from the user base on it.
> > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > > > >>
> > > > >> The current behavior of ZK when we get an uncaught exception is to
> > log
> > > > it
> > > > >> and try to move on. This is arguably not the right thing to do,
> and
> > > will
> > > > >> possibly cause ZK to limp along with a bad VM (say, in an OOM
> state)
> > > for
> > > > >> longer than it should.
> > > > >> The patch proposes that when we get an instance of
> java.lang.Error,
> > we
> > > > >> should do a system.exit to fast-fail the process. With the
> possible
> > > > >> exception of ThreadDeath (which may or may not be an unrecoverable
> > > > system
> > > > >> state depending on the thread), I think this makes sense, but I
> > would
> > > > like
> > > > >> to hear from others if they have an opinion. I think it's better
> to
> > > kill
> > > > >> the process and let your monitoring services detect process death
> > (and
> > > > thus
> > > > >> restart) than possibly linger unresponsive for a while, are there
> > > > scenarios
> > > > >> that we're missing where this error can occur and you wouldn't
> want
> > > the
> > > > >> process killed?
> > > > >>
> > > > >> Thanks for your feedback,
> > > > >>
> > > > >> Camille
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >  Vitalii Tymchyshyn
> > > >
> > >
> >
>

Re: Input on a change

Posted by John Sirois <jo...@gmail.com>.

On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy <is...@gmail.com> wrote:

> Fair enough, if you think it is an anti-pattern then perhaps you can
> suggest an alternative that allows one to write tests that are (in
> descending order of priority):
>
> 1. Quick and easy to setup and teardown
> 2. Repeatably testable and debuggable in an IDE without having to resort to
> the external build tool
> 3. Testable in parallel, i.e. having more than one build running on a CI
> server, so need some way to avoid port clashes
> 4. Cross-platform - i.e. run the exact same build sequence on multiple
> OSes.
>

We also embed for testing for all the reasons above with good success.  In
fact - I like the idea of the server product directly supporting alternate
mains - although this may be more burden on zk devs initially.  This would
allow an org to write their own main that plugs in an uncaught handler and
does whatever is appropriate in their deploy environment.


>
> Ishaaq
>
> On 16 April 2012 22:55, Camille Fournier <ca...@apache.org> wrote:
>
> > I believe that this change is inspired by someone that runs zk embedded.
> > Personally I'm not moved by the testing argument, embedding the server
> for
> > testing is a bit of an anti pattern in my mind.
> >
> > From my phone
> > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:
> >
> > > I'd go so far as to say that even the server-code should avoid
> > System.exit.
> > > Just because it is "meant" to be a standalone system doesn't mean that
> > code
> > > that makes it impossible to embed it should be encouraged.
> > >
> > > For e.g, we embed a local version of ZK to be used inside our unit
> tests.
> > > This makes it much easier for us to control ZK to coincide with test
> > > expectations as well as making for much faster build times. It would
> be a
> > > shame if the embedded ZK started killing the JVM.
> > >
> > > Ishaaq
> > >
> > > On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:
> > >
> > > > This is a good point.
> > > > I think this change should be fine for the server portion of the
> code,
> > > > since it's designed to be run as a standalone system. But for the
> > > > client connection to also call system.exit on such an error is
> > > > overreaching for all the reasons listed below.
> > > >
> > > > C
> > > >
> > > > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > > > I really would not like for any library to perform a System.exit
> > call.
> > > > This
> > > > > would make huge program exit out of sudden (think about j2ee, you
> may
> > > be
> > > > > bitten by security manager).  Note that there are more or less safe
> > > > errors,
> > > > > like StackOverflowError.
> > > > > Also System.exit make testing nightmare. E.g. maven2 silently skips
> > any
> > > > > tests after the one that calls System.exit. And everything's green.
> > > > > As for me good options are:
> > > > > 1) Call user-provided uncaught exception handler. Use the one from
> > the
> > > > > thread that created the connection if one is not specified
> explicity.
> > > > > 1) Stop everything, notifying user with a global watcher. If it's
> > > > possible,
> > > > > clean any static state (e.g. restart threads) and allow to restart
> > > > > connection.
> > > > > In any case, call user code. Good system already know how to react
> > (it
> > > > may
> > > > > want to send email to admin), allow it to perform well.
> > > > >
> > > > > Best regards, Vitalii Tymchyshyn.
> > > > >
> > > > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> I'm trying to evaluate a patch that Jeremy Stribling has
> submitted,
> > > and
> > > > I'd
> > > > >> like some feedback from the user base on it.
> > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > > > >>
> > > > >> The current behavior of ZK when we get an uncaught exception is to
> > log
> > > > it
> > > > >> and try to move on. This is arguably not the right thing to do,
> and
> > > will
> > > > >> possibly cause ZK to limp along with a bad VM (say, in an OOM
> state)
> > > for
> > > > >> longer than it should.
> > > > >> The patch proposes that when we get an instance of
> java.lang.Error,
> > we
> > > > >> should do a system.exit to fast-fail the process. With the
> possible
> > > > >> exception of ThreadDeath (which may or may not be an unrecoverable
> > > > system
> > > > >> state depending on the thread), I think this makes sense, but I
> > would
> > > > like
> > > > >> to hear from others if they have an opinion. I think it's better
> to
> > > kill
> > > > >> the process and let your monitoring services detect process death
> > (and
> > > > thus
> > > > >> restart) than possibly linger unresponsive for a while, are there
> > > > scenarios
> > > > >> that we're missing where this error can occur and you wouldn't
> want
> > > the
> > > > >> process killed?
> > > > >>
> > > > >> Thanks for your feedback,
> > > > >>
> > > > >> Camille
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >  Vitalii Tymchyshyn
> > > >
> > >
> >
>

Re: Input on a change

Posted by Scott Fines <sc...@gmail.com>.

For testing, including Michi's suggestion to provide a mechanism to disable
this behavior should be sufficient.

Scott



On Mon, Apr 16, 2012 at 11:52 AM, Ishaaq Chandy <is...@gmail.com> wrote:

> Fair enough, if you think it is an anti-pattern then perhaps you can
> suggest an alternative that allows one to write tests that are (in
> descending order of priority):
>
> 1. Quick and easy to setup and teardown
> 2. Repeatably testable and debuggable in an IDE without having to resort to
> the external build tool
> 3. Testable in parallel, i.e. having more than one build running on a CI
> server, so need some way to avoid port clashes
> 4. Cross-platform - i.e. run the exact same build sequence on multiple
> OSes.
>
> Ishaaq
>
> On 16 April 2012 22:55, Camille Fournier <ca...@apache.org> wrote:
>
> > I believe that this change is inspired by someone that runs zk embedded.
> > Personally I'm not moved by the testing argument, embedding the server
> for
> > testing is a bit of an anti pattern in my mind.
> >
> > From my phone
> > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:
> >
> > > I'd go so far as to say that even the server-code should avoid
> > System.exit.
> > > Just because it is "meant" to be a standalone system doesn't mean that
> > code
> > > that makes it impossible to embed it should be encouraged.
> > >
> > > For e.g, we embed a local version of ZK to be used inside our unit
> tests.
> > > This makes it much easier for us to control ZK to coincide with test
> > > expectations as well as making for much faster build times. It would
> be a
> > > shame if the embedded ZK started killing the JVM.
> > >
> > > Ishaaq
> > >
> > > On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:
> > >
> > > > This is a good point.
> > > > I think this change should be fine for the server portion of the
> code,
> > > > since it's designed to be run as a standalone system. But for the
> > > > client connection to also call system.exit on such an error is
> > > > overreaching for all the reasons listed below.
> > > >
> > > > C
> > > >
> > > > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > > > I really would not like for any library to perform a System.exit
> > call.
> > > > This
> > > > > would make huge program exit out of sudden (think about j2ee, you
> may
> > > be
> > > > > bitten by security manager).  Note that there are more or less safe
> > > > errors,
> > > > > like StackOverflowError.
> > > > > Also System.exit make testing nightmare. E.g. maven2 silently skips
> > any
> > > > > tests after the one that calls System.exit. And everything's green.
> > > > > As for me good options are:
> > > > > 1) Call user-provided uncaught exception handler. Use the one from
> > the
> > > > > thread that created the connection if one is not specified
> explicity.
> > > > > 1) Stop everything, notifying user with a global watcher. If it's
> > > > possible,
> > > > > clean any static state (e.g. restart threads) and allow to restart
> > > > > connection.
> > > > > In any case, call user code. Good system already know how to react
> > (it
> > > > may
> > > > > want to send email to admin), allow it to perform well.
> > > > >
> > > > > Best regards, Vitalii Tymchyshyn.
> > > > >
> > > > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> I'm trying to evaluate a patch that Jeremy Stribling has
> submitted,
> > > and
> > > > I'd
> > > > >> like some feedback from the user base on it.
> > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > > > >>
> > > > >> The current behavior of ZK when we get an uncaught exception is to
> > log
> > > > it
> > > > >> and try to move on. This is arguably not the right thing to do,
> and
> > > will
> > > > >> possibly cause ZK to limp along with a bad VM (say, in an OOM
> state)
> > > for
> > > > >> longer than it should.
> > > > >> The patch proposes that when we get an instance of
> java.lang.Error,
> > we
> > > > >> should do a system.exit to fast-fail the process. With the
> possible
> > > > >> exception of ThreadDeath (which may or may not be an unrecoverable
> > > > system
> > > > >> state depending on the thread), I think this makes sense, but I
> > would
> > > > like
> > > > >> to hear from others if they have an opinion. I think it's better
> to
> > > kill
> > > > >> the process and let your monitoring services detect process death
> > (and
> > > > thus
> > > > >> restart) than possibly linger unresponsive for a while, are there
> > > > scenarios
> > > > >> that we're missing where this error can occur and you wouldn't
> want
> > > the
> > > > >> process killed?
> > > > >>
> > > > >> Thanks for your feedback,
> > > > >>
> > > > >> Camille
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >  Vitalii Tymchyshyn
> > > >
> > >
> >
>

Re: Input on a change

Posted by Camille Fournier <ca...@apache.org>.

Sure, mocking it's a helluva lot faster too.

C

>From my phone
On Apr 16, 2012 12:54 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:

> Fair enough, if you think it is an anti-pattern then perhaps you can
> suggest an alternative that allows one to write tests that are (in
> descending order of priority):
>
> 1. Quick and easy to setup and teardown
> 2. Repeatably testable and debuggable in an IDE without having to resort to
> the external build tool
> 3. Testable in parallel, i.e. having more than one build running on a CI
> server, so need some way to avoid port clashes
> 4. Cross-platform - i.e. run the exact same build sequence on multiple
> OSes.
>
> Ishaaq
>
> On 16 April 2012 22:55, Camille Fournier <ca...@apache.org> wrote:
>
> > I believe that this change is inspired by someone that runs zk embedded.
> > Personally I'm not moved by the testing argument, embedding the server
> for
> > testing is a bit of an anti pattern in my mind.
> >
> > From my phone
> > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:
> >
> > > I'd go so far as to say that even the server-code should avoid
> > System.exit.
> > > Just because it is "meant" to be a standalone system doesn't mean that
> > code
> > > that makes it impossible to embed it should be encouraged.
> > >
> > > For e.g, we embed a local version of ZK to be used inside our unit
> tests.
> > > This makes it much easier for us to control ZK to coincide with test
> > > expectations as well as making for much faster build times. It would
> be a
> > > shame if the embedded ZK started killing the JVM.
> > >
> > > Ishaaq
> > >
> > > On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:
> > >
> > > > This is a good point.
> > > > I think this change should be fine for the server portion of the
> code,
> > > > since it's designed to be run as a standalone system. But for the
> > > > client connection to also call system.exit on such an error is
> > > > overreaching for all the reasons listed below.
> > > >
> > > > C
> > > >
> > > > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > > > I really would not like for any library to perform a System.exit
> > call.
> > > > This
> > > > > would make huge program exit out of sudden (think about j2ee, you
> may
> > > be
> > > > > bitten by security manager).  Note that there are more or less safe
> > > > errors,
> > > > > like StackOverflowError.
> > > > > Also System.exit make testing nightmare. E.g. maven2 silently skips
> > any
> > > > > tests after the one that calls System.exit. And everything's green.
> > > > > As for me good options are:
> > > > > 1) Call user-provided uncaught exception handler. Use the one from
> > the
> > > > > thread that created the connection if one is not specified
> explicity.
> > > > > 1) Stop everything, notifying user with a global watcher. If it's
> > > > possible,
> > > > > clean any static state (e.g. restart threads) and allow to restart
> > > > > connection.
> > > > > In any case, call user code. Good system already know how to react
> > (it
> > > > may
> > > > > want to send email to admin), allow it to perform well.
> > > > >
> > > > > Best regards, Vitalii Tymchyshyn.
> > > > >
> > > > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> I'm trying to evaluate a patch that Jeremy Stribling has
> submitted,
> > > and
> > > > I'd
> > > > >> like some feedback from the user base on it.
> > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > > > >>
> > > > >> The current behavior of ZK when we get an uncaught exception is to
> > log
> > > > it
> > > > >> and try to move on. This is arguably not the right thing to do,
> and
> > > will
> > > > >> possibly cause ZK to limp along with a bad VM (say, in an OOM
> state)
> > > for
> > > > >> longer than it should.
> > > > >> The patch proposes that when we get an instance of
> java.lang.Error,
> > we
> > > > >> should do a system.exit to fast-fail the process. With the
> possible
> > > > >> exception of ThreadDeath (which may or may not be an unrecoverable
> > > > system
> > > > >> state depending on the thread), I think this makes sense, but I
> > would
> > > > like
> > > > >> to hear from others if they have an opinion. I think it's better
> to
> > > kill
> > > > >> the process and let your monitoring services detect process death
> > (and
> > > > thus
> > > > >> restart) than possibly linger unresponsive for a while, are there
> > > > scenarios
> > > > >> that we're missing where this error can occur and you wouldn't
> want
> > > the
> > > > >> process killed?
> > > > >>
> > > > >> Thanks for your feedback,
> > > > >>
> > > > >> Camille
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >  Vitalii Tymchyshyn
> > > >
> > >
> >
>

Re: Input on a change

Posted by Ishaaq Chandy <is...@gmail.com>.

Fair enough, if you think it is an anti-pattern then perhaps you can
suggest an alternative that allows one to write tests that are (in
descending order of priority):

1. Quick and easy to setup and teardown
2. Repeatably testable and debuggable in an IDE without having to resort to
the external build tool
3. Testable in parallel, i.e. having more than one build running on a CI
server, so need some way to avoid port clashes
4. Cross-platform - i.e. run the exact same build sequence on multiple OSes.

Ishaaq

On 16 April 2012 22:55, Camille Fournier <ca...@apache.org> wrote:

> I believe that this change is inspired by someone that runs zk embedded.
> Personally I'm not moved by the testing argument, embedding the server for
> testing is a bit of an anti pattern in my mind.
>
> From my phone
> On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:
>
> > I'd go so far as to say that even the server-code should avoid
> System.exit.
> > Just because it is "meant" to be a standalone system doesn't mean that
> code
> > that makes it impossible to embed it should be encouraged.
> >
> > For e.g, we embed a local version of ZK to be used inside our unit tests.
> > This makes it much easier for us to control ZK to coincide with test
> > expectations as well as making for much faster build times. It would be a
> > shame if the embedded ZK started killing the JVM.
> >
> > Ishaaq
> >
> > On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:
> >
> > > This is a good point.
> > > I think this change should be fine for the server portion of the code,
> > > since it's designed to be run as a standalone system. But for the
> > > client connection to also call system.exit on such an error is
> > > overreaching for all the reasons listed below.
> > >
> > > C
> > >
> > > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > > I really would not like for any library to perform a System.exit
> call.
> > > This
> > > > would make huge program exit out of sudden (think about j2ee, you may
> > be
> > > > bitten by security manager).  Note that there are more or less safe
> > > errors,
> > > > like StackOverflowError.
> > > > Also System.exit make testing nightmare. E.g. maven2 silently skips
> any
> > > > tests after the one that calls System.exit. And everything's green.
> > > > As for me good options are:
> > > > 1) Call user-provided uncaught exception handler. Use the one from
> the
> > > > thread that created the connection if one is not specified explicity.
> > > > 1) Stop everything, notifying user with a global watcher. If it's
> > > possible,
> > > > clean any static state (e.g. restart threads) and allow to restart
> > > > connection.
> > > > In any case, call user code. Good system already know how to react
> (it
> > > may
> > > > want to send email to admin), allow it to perform well.
> > > >
> > > > Best regards, Vitalii Tymchyshyn.
> > > >
> > > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> I'm trying to evaluate a patch that Jeremy Stribling has submitted,
> > and
> > > I'd
> > > >> like some feedback from the user base on it.
> > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > > >>
> > > >> The current behavior of ZK when we get an uncaught exception is to
> log
> > > it
> > > >> and try to move on. This is arguably not the right thing to do, and
> > will
> > > >> possibly cause ZK to limp along with a bad VM (say, in an OOM state)
> > for
> > > >> longer than it should.
> > > >> The patch proposes that when we get an instance of java.lang.Error,
> we
> > > >> should do a system.exit to fast-fail the process. With the possible
> > > >> exception of ThreadDeath (which may or may not be an unrecoverable
> > > system
> > > >> state depending on the thread), I think this makes sense, but I
> would
> > > like
> > > >> to hear from others if they have an opinion. I think it's better to
> > kill
> > > >> the process and let your monitoring services detect process death
> (and
> > > thus
> > > >> restart) than possibly linger unresponsive for a while, are there
> > > scenarios
> > > >> that we're missing where this error can occur and you wouldn't want
> > the
> > > >> process killed?
> > > >>
> > > >> Thanks for your feedback,
> > > >>
> > > >> Camille
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >  Vitalii Tymchyshyn
> > >
> >
>

Re: Input on a change

Posted by Ishaaq Chandy <is...@gmail.com>.

Fair enough, if you think it is an anti-pattern then perhaps you can
suggest an alternative that allows one to write tests that are (in
descending order of priority):

1. Quick and easy to setup and teardown
2. Repeatably testable and debuggable in an IDE without having to resort to
the external build tool
3. Testable in parallel, i.e. having more than one build running on a CI
server, so need some way to avoid port clashes
4. Cross-platform - i.e. run the exact same build sequence on multiple OSes.

Ishaaq

On 16 April 2012 22:55, Camille Fournier <ca...@apache.org> wrote:

> I believe that this change is inspired by someone that runs zk embedded.
> Personally I'm not moved by the testing argument, embedding the server for
> testing is a bit of an anti pattern in my mind.
>
> From my phone
> On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:
>
> > I'd go so far as to say that even the server-code should avoid
> System.exit.
> > Just because it is "meant" to be a standalone system doesn't mean that
> code
> > that makes it impossible to embed it should be encouraged.
> >
> > For e.g, we embed a local version of ZK to be used inside our unit tests.
> > This makes it much easier for us to control ZK to coincide with test
> > expectations as well as making for much faster build times. It would be a
> > shame if the embedded ZK started killing the JVM.
> >
> > Ishaaq
> >
> > On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:
> >
> > > This is a good point.
> > > I think this change should be fine for the server portion of the code,
> > > since it's designed to be run as a standalone system. But for the
> > > client connection to also call system.exit on such an error is
> > > overreaching for all the reasons listed below.
> > >
> > > C
> > >
> > > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > > I really would not like for any library to perform a System.exit
> call.
> > > This
> > > > would make huge program exit out of sudden (think about j2ee, you may
> > be
> > > > bitten by security manager).  Note that there are more or less safe
> > > errors,
> > > > like StackOverflowError.
> > > > Also System.exit make testing nightmare. E.g. maven2 silently skips
> any
> > > > tests after the one that calls System.exit. And everything's green.
> > > > As for me good options are:
> > > > 1) Call user-provided uncaught exception handler. Use the one from
> the
> > > > thread that created the connection if one is not specified explicity.
> > > > 1) Stop everything, notifying user with a global watcher. If it's
> > > possible,
> > > > clean any static state (e.g. restart threads) and allow to restart
> > > > connection.
> > > > In any case, call user code. Good system already know how to react
> (it
> > > may
> > > > want to send email to admin), allow it to perform well.
> > > >
> > > > Best regards, Vitalii Tymchyshyn.
> > > >
> > > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> I'm trying to evaluate a patch that Jeremy Stribling has submitted,
> > and
> > > I'd
> > > >> like some feedback from the user base on it.
> > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > > >>
> > > >> The current behavior of ZK when we get an uncaught exception is to
> log
> > > it
> > > >> and try to move on. This is arguably not the right thing to do, and
> > will
> > > >> possibly cause ZK to limp along with a bad VM (say, in an OOM state)
> > for
> > > >> longer than it should.
> > > >> The patch proposes that when we get an instance of java.lang.Error,
> we
> > > >> should do a system.exit to fast-fail the process. With the possible
> > > >> exception of ThreadDeath (which may or may not be an unrecoverable
> > > system
> > > >> state depending on the thread), I think this makes sense, but I
> would
> > > like
> > > >> to hear from others if they have an opinion. I think it's better to
> > kill
> > > >> the process and let your monitoring services detect process death
> (and
> > > thus
> > > >> restart) than possibly linger unresponsive for a while, are there
> > > scenarios
> > > >> that we're missing where this error can occur and you wouldn't want
> > the
> > > >> process killed?
> > > >>
> > > >> Thanks for your feedback,
> > > >>
> > > >> Camille
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >  Vitalii Tymchyshyn
> > >
> >
>

Re: Input on a change

Posted by Camille Fournier <ca...@apache.org>.

I believe that this change is inspired by someone that runs zk embedded.
Personally I'm not moved by the testing argument, embedding the server for
testing is a bit of an anti pattern in my mind.

>From my phone
On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:

> I'd go so far as to say that even the server-code should avoid System.exit.
> Just because it is "meant" to be a standalone system doesn't mean that code
> that makes it impossible to embed it should be encouraged.
>
> For e.g, we embed a local version of ZK to be used inside our unit tests.
> This makes it much easier for us to control ZK to coincide with test
> expectations as well as making for much faster build times. It would be a
> shame if the embedded ZK started killing the JVM.
>
> Ishaaq
>
> On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:
>
> > This is a good point.
> > I think this change should be fine for the server portion of the code,
> > since it's designed to be run as a standalone system. But for the
> > client connection to also call system.exit on such an error is
> > overreaching for all the reasons listed below.
> >
> > C
> >
> > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > I really would not like for any library to perform a System.exit call.
> > This
> > > would make huge program exit out of sudden (think about j2ee, you may
> be
> > > bitten by security manager).  Note that there are more or less safe
> > errors,
> > > like StackOverflowError.
> > > Also System.exit make testing nightmare. E.g. maven2 silently skips any
> > > tests after the one that calls System.exit. And everything's green.
> > > As for me good options are:
> > > 1) Call user-provided uncaught exception handler. Use the one from the
> > > thread that created the connection if one is not specified explicity.
> > > 1) Stop everything, notifying user with a global watcher. If it's
> > possible,
> > > clean any static state (e.g. restart threads) and allow to restart
> > > connection.
> > > In any case, call user code. Good system already know how to react (it
> > may
> > > want to send email to admin), allow it to perform well.
> > >
> > > Best regards, Vitalii Tymchyshyn.
> > >
> > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > >
> > >> Hi everyone,
> > >>
> > >> I'm trying to evaluate a patch that Jeremy Stribling has submitted,
> and
> > I'd
> > >> like some feedback from the user base on it.
> > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > >>
> > >> The current behavior of ZK when we get an uncaught exception is to log
> > it
> > >> and try to move on. This is arguably not the right thing to do, and
> will
> > >> possibly cause ZK to limp along with a bad VM (say, in an OOM state)
> for
> > >> longer than it should.
> > >> The patch proposes that when we get an instance of java.lang.Error, we
> > >> should do a system.exit to fast-fail the process. With the possible
> > >> exception of ThreadDeath (which may or may not be an unrecoverable
> > system
> > >> state depending on the thread), I think this makes sense, but I would
> > like
> > >> to hear from others if they have an opinion. I think it's better to
> kill
> > >> the process and let your monitoring services detect process death (and
> > thus
> > >> restart) than possibly linger unresponsive for a while, are there
> > scenarios
> > >> that we're missing where this error can occur and you wouldn't want
> the
> > >> process killed?
> > >>
> > >> Thanks for your feedback,
> > >>
> > >> Camille
> > >>
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >  Vitalii Tymchyshyn
> >
>

Re: Input on a change

Posted by Camille Fournier <ca...@apache.org>.

I believe that this change is inspired by someone that runs zk embedded.
Personally I'm not moved by the testing argument, embedding the server for
testing is a bit of an anti pattern in my mind.

>From my phone
On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <is...@gmail.com> wrote:

> I'd go so far as to say that even the server-code should avoid System.exit.
> Just because it is "meant" to be a standalone system doesn't mean that code
> that makes it impossible to embed it should be encouraged.
>
> For e.g, we embed a local version of ZK to be used inside our unit tests.
> This makes it much easier for us to control ZK to coincide with test
> expectations as well as making for much faster build times. It would be a
> shame if the embedded ZK started killing the JVM.
>
> Ishaaq
>
> On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:
>
> > This is a good point.
> > I think this change should be fine for the server portion of the code,
> > since it's designed to be run as a standalone system. But for the
> > client connection to also call system.exit on such an error is
> > overreaching for all the reasons listed below.
> >
> > C
> >
> > 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > > I really would not like for any library to perform a System.exit call.
> > This
> > > would make huge program exit out of sudden (think about j2ee, you may
> be
> > > bitten by security manager).  Note that there are more or less safe
> > errors,
> > > like StackOverflowError.
> > > Also System.exit make testing nightmare. E.g. maven2 silently skips any
> > > tests after the one that calls System.exit. And everything's green.
> > > As for me good options are:
> > > 1) Call user-provided uncaught exception handler. Use the one from the
> > > thread that created the connection if one is not specified explicity.
> > > 1) Stop everything, notifying user with a global watcher. If it's
> > possible,
> > > clean any static state (e.g. restart threads) and allow to restart
> > > connection.
> > > In any case, call user code. Good system already know how to react (it
> > may
> > > want to send email to admin), allow it to perform well.
> > >
> > > Best regards, Vitalii Tymchyshyn.
> > >
> > > 2012/4/13 Camille Fournier <ca...@apache.org>
> > >
> > >> Hi everyone,
> > >>
> > >> I'm trying to evaluate a patch that Jeremy Stribling has submitted,
> and
> > I'd
> > >> like some feedback from the user base on it.
> > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> > >>
> > >> The current behavior of ZK when we get an uncaught exception is to log
> > it
> > >> and try to move on. This is arguably not the right thing to do, and
> will
> > >> possibly cause ZK to limp along with a bad VM (say, in an OOM state)
> for
> > >> longer than it should.
> > >> The patch proposes that when we get an instance of java.lang.Error, we
> > >> should do a system.exit to fast-fail the process. With the possible
> > >> exception of ThreadDeath (which may or may not be an unrecoverable
> > system
> > >> state depending on the thread), I think this makes sense, but I would
> > like
> > >> to hear from others if they have an opinion. I think it's better to
> kill
> > >> the process and let your monitoring services detect process death (and
> > thus
> > >> restart) than possibly linger unresponsive for a while, are there
> > scenarios
> > >> that we're missing where this error can occur and you wouldn't want
> the
> > >> process killed?
> > >>
> > >> Thanks for your feedback,
> > >>
> > >> Camille
> > >>
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >  Vitalii Tymchyshyn
> >
>

Re: Input on a change

Posted by Ishaaq Chandy <is...@gmail.com>.

I'd go so far as to say that even the server-code should avoid System.exit.
Just because it is "meant" to be a standalone system doesn't mean that code
that makes it impossible to embed it should be encouraged.

For e.g, we embed a local version of ZK to be used inside our unit tests.
This makes it much easier for us to control ZK to coincide with test
expectations as well as making for much faster build times. It would be a
shame if the embedded ZK started killing the JVM.

Ishaaq

On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:

> This is a good point.
> I think this change should be fine for the server portion of the code,
> since it's designed to be run as a standalone system. But for the
> client connection to also call system.exit on such an error is
> overreaching for all the reasons listed below.
>
> C
>
> 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > I really would not like for any library to perform a System.exit call.
> This
> > would make huge program exit out of sudden (think about j2ee, you may be
> > bitten by security manager).  Note that there are more or less safe
> errors,
> > like StackOverflowError.
> > Also System.exit make testing nightmare. E.g. maven2 silently skips any
> > tests after the one that calls System.exit. And everything's green.
> > As for me good options are:
> > 1) Call user-provided uncaught exception handler. Use the one from the
> > thread that created the connection if one is not specified explicity.
> > 1) Stop everything, notifying user with a global watcher. If it's
> possible,
> > clean any static state (e.g. restart threads) and allow to restart
> > connection.
> > In any case, call user code. Good system already know how to react (it
> may
> > want to send email to admin), allow it to perform well.
> >
> > Best regards, Vitalii Tymchyshyn.
> >
> > 2012/4/13 Camille Fournier <ca...@apache.org>
> >
> >> Hi everyone,
> >>
> >> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
> I'd
> >> like some feedback from the user base on it.
> >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> >>
> >> The current behavior of ZK when we get an uncaught exception is to log
> it
> >> and try to move on. This is arguably not the right thing to do, and will
> >> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
> >> longer than it should.
> >> The patch proposes that when we get an instance of java.lang.Error, we
> >> should do a system.exit to fast-fail the process. With the possible
> >> exception of ThreadDeath (which may or may not be an unrecoverable
> system
> >> state depending on the thread), I think this makes sense, but I would
> like
> >> to hear from others if they have an opinion. I think it's better to kill
> >> the process and let your monitoring services detect process death (and
> thus
> >> restart) than possibly linger unresponsive for a while, are there
> scenarios
> >> that we're missing where this error can occur and you wouldn't want the
> >> process killed?
> >>
> >> Thanks for your feedback,
> >>
> >> Camille
> >>
> >
> >
> >
> > --
> > Best regards,
> >  Vitalii Tymchyshyn
>

Re: Input on a change

Posted by Ishaaq Chandy <is...@gmail.com>.

I'd go so far as to say that even the server-code should avoid System.exit.
Just because it is "meant" to be a standalone system doesn't mean that code
that makes it impossible to embed it should be encouraged.

For e.g, we embed a local version of ZK to be used inside our unit tests.
This makes it much easier for us to control ZK to coincide with test
expectations as well as making for much faster build times. It would be a
shame if the embedded ZK started killing the JVM.

Ishaaq

On 16 April 2012 04:28, Camille Fournier <ca...@apache.org> wrote:

> This is a good point.
> I think this change should be fine for the server portion of the code,
> since it's designed to be run as a standalone system. But for the
> client connection to also call system.exit on such an error is
> overreaching for all the reasons listed below.
>
> C
>
> 2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> > I really would not like for any library to perform a System.exit call.
> This
> > would make huge program exit out of sudden (think about j2ee, you may be
> > bitten by security manager).  Note that there are more or less safe
> errors,
> > like StackOverflowError.
> > Also System.exit make testing nightmare. E.g. maven2 silently skips any
> > tests after the one that calls System.exit. And everything's green.
> > As for me good options are:
> > 1) Call user-provided uncaught exception handler. Use the one from the
> > thread that created the connection if one is not specified explicity.
> > 1) Stop everything, notifying user with a global watcher. If it's
> possible,
> > clean any static state (e.g. restart threads) and allow to restart
> > connection.
> > In any case, call user code. Good system already know how to react (it
> may
> > want to send email to admin), allow it to perform well.
> >
> > Best regards, Vitalii Tymchyshyn.
> >
> > 2012/4/13 Camille Fournier <ca...@apache.org>
> >
> >> Hi everyone,
> >>
> >> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
> I'd
> >> like some feedback from the user base on it.
> >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
> >>
> >> The current behavior of ZK when we get an uncaught exception is to log
> it
> >> and try to move on. This is arguably not the right thing to do, and will
> >> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
> >> longer than it should.
> >> The patch proposes that when we get an instance of java.lang.Error, we
> >> should do a system.exit to fast-fail the process. With the possible
> >> exception of ThreadDeath (which may or may not be an unrecoverable
> system
> >> state depending on the thread), I think this makes sense, but I would
> like
> >> to hear from others if they have an opinion. I think it's better to kill
> >> the process and let your monitoring services detect process death (and
> thus
> >> restart) than possibly linger unresponsive for a while, are there
> scenarios
> >> that we're missing where this error can occur and you wouldn't want the
> >> process killed?
> >>
> >> Thanks for your feedback,
> >>
> >> Camille
> >>
> >
> >
> >
> > --
> > Best regards,
> >  Vitalii Tymchyshyn
>

Re: Input on a change

Posted by Camille Fournier <ca...@apache.org>.

This is a good point.
I think this change should be fine for the server portion of the code,
since it's designed to be run as a standalone system. But for the
client connection to also call system.exit on such an error is
overreaching for all the reasons listed below.

C

2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> I really would not like for any library to perform a System.exit call. This
> would make huge program exit out of sudden (think about j2ee, you may be
> bitten by security manager).  Note that there are more or less safe errors,
> like StackOverflowError.
> Also System.exit make testing nightmare. E.g. maven2 silently skips any
> tests after the one that calls System.exit. And everything's green.
> As for me good options are:
> 1) Call user-provided uncaught exception handler. Use the one from the
> thread that created the connection if one is not specified explicity.
> 1) Stop everything, notifying user with a global watcher. If it's possible,
> clean any static state (e.g. restart threads) and allow to restart
> connection.
> In any case, call user code. Good system already know how to react (it may
> want to send email to admin), allow it to perform well.
>
> Best regards, Vitalii Tymchyshyn.
>
> 2012/4/13 Camille Fournier <ca...@apache.org>
>
>> Hi everyone,
>>
>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
>> like some feedback from the user base on it.
>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>
>> The current behavior of ZK when we get an uncaught exception is to log it
>> and try to move on. This is arguably not the right thing to do, and will
>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>> longer than it should.
>> The patch proposes that when we get an instance of java.lang.Error, we
>> should do a system.exit to fast-fail the process. With the possible
>> exception of ThreadDeath (which may or may not be an unrecoverable system
>> state depending on the thread), I think this makes sense, but I would like
>> to hear from others if they have an opinion. I think it's better to kill
>> the process and let your monitoring services detect process death (and thus
>> restart) than possibly linger unresponsive for a while, are there scenarios
>> that we're missing where this error can occur and you wouldn't want the
>> process killed?
>>
>> Thanks for your feedback,
>>
>> Camille
>>
>
>
>
> --
> Best regards,
>  Vitalii Tymchyshyn

Re: Input on a change

Posted by Camille Fournier <ca...@apache.org>.

This is a good point.
I think this change should be fine for the server portion of the code,
since it's designed to be run as a standalone system. But for the
client connection to also call system.exit on such an error is
overreaching for all the reasons listed below.

C

2012/4/15 Віталій Тимчишин <ti...@gmail.com>:
> I really would not like for any library to perform a System.exit call. This
> would make huge program exit out of sudden (think about j2ee, you may be
> bitten by security manager).  Note that there are more or less safe errors,
> like StackOverflowError.
> Also System.exit make testing nightmare. E.g. maven2 silently skips any
> tests after the one that calls System.exit. And everything's green.
> As for me good options are:
> 1) Call user-provided uncaught exception handler. Use the one from the
> thread that created the connection if one is not specified explicity.
> 1) Stop everything, notifying user with a global watcher. If it's possible,
> clean any static state (e.g. restart threads) and allow to restart
> connection.
> In any case, call user code. Good system already know how to react (it may
> want to send email to admin), allow it to perform well.
>
> Best regards, Vitalii Tymchyshyn.
>
> 2012/4/13 Camille Fournier <ca...@apache.org>
>
>> Hi everyone,
>>
>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
>> like some feedback from the user base on it.
>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>
>> The current behavior of ZK when we get an uncaught exception is to log it
>> and try to move on. This is arguably not the right thing to do, and will
>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>> longer than it should.
>> The patch proposes that when we get an instance of java.lang.Error, we
>> should do a system.exit to fast-fail the process. With the possible
>> exception of ThreadDeath (which may or may not be an unrecoverable system
>> state depending on the thread), I think this makes sense, but I would like
>> to hear from others if they have an opinion. I think it's better to kill
>> the process and let your monitoring services detect process death (and thus
>> restart) than possibly linger unresponsive for a while, are there scenarios
>> that we're missing where this error can occur and you wouldn't want the
>> process killed?
>>
>> Thanks for your feedback,
>>
>> Camille
>>
>
>
>
> --
> Best regards,
>  Vitalii Tymchyshyn

Re: Input on a change

Posted by Віталій Тимчишин <ti...@gmail.com>.

I really would not like for any library to perform a System.exit call. This
would make huge program exit out of sudden (think about j2ee, you may be
bitten by security manager).  Note that there are more or less safe errors,
like StackOverflowError.
Also System.exit make testing nightmare. E.g. maven2 silently skips any
tests after the one that calls System.exit. And everything's green.
As for me good options are:
1) Call user-provided uncaught exception handler. Use the one from the
thread that created the connection if one is not specified explicity.
1) Stop everything, notifying user with a global watcher. If it's possible,
clean any static state (e.g. restart threads) and allow to restart
connection.
In any case, call user code. Good system already know how to react (it may
want to send email to admin), allow it to perform well.

Best regards, Vitalii Tymchyshyn.

2012/4/13 Camille Fournier <ca...@apache.org>

> Hi everyone,
>
> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
> like some feedback from the user base on it.
> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>
> The current behavior of ZK when we get an uncaught exception is to log it
> and try to move on. This is arguably not the right thing to do, and will
> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
> longer than it should.
> The patch proposes that when we get an instance of java.lang.Error, we
> should do a system.exit to fast-fail the process. With the possible
> exception of ThreadDeath (which may or may not be an unrecoverable system
> state depending on the thread), I think this makes sense, but I would like
> to hear from others if they have an opinion. I think it's better to kill
> the process and let your monitoring services detect process death (and thus
> restart) than possibly linger unresponsive for a while, are there scenarios
> that we're missing where this error can occur and you wouldn't want the
> process killed?
>
> Thanks for your feedback,
>
> Camille
>



-- 
Best regards,
 Vitalii Tymchyshyn

ANN: Curator Mailing List

Posted by Jordan Zimmerman <jz...@netflix.com>.

FYI

There is now a mailing list for Curator users. Subscribe here:
https://listserv.netflix.com/mailman/listinfo/curator-users

-Jordan


P.S. Previous version sent with wrong subject

Re: Input on a change

Posted by Jordan Zimmerman <jz...@netflix.com>.

FYI

There is now a mailing list for Curator users. Subscribe here:
https://listserv.netflix.com/mailman/listinfo/curator-users

-Jordan

Re: Input on a change

Posted by "Shelley, Ryan" <Ry...@disney.com>.

System.exit(int) is developer specific. Traditionally, it should be
non-zero for abnormal operation (so really it should be System.exit(-1)).

Maybe it could just be parameterized entirely. Apply the
UncaughtExceptionHandler only if the option is enabled.

--failfast=true/false (defaults to false, until a future major release)

Ryan

On 4/13/12 2:38 PM, "Michi Mutsuzaki" <mi...@cs.stanford.edu> wrote:

>By the way, grepping the source code, I see we use various exit codes.
>Does anybody know what they all mean?
>
>--Michi
>
>On Fri, Apr 13, 2012 at 2:31 PM, Michi Mutsuzaki <mi...@cs.stanford.edu>
>wrote:
>> Ah I see. I guess log and System.exit(1) is the best we can do then.
>>
>> --Michi
>>
>> On Fri, Apr 13, 2012 at 2:25 PM, Jeremy Stribling <st...@nicira.com>
>>wrote:
>>> On 04/13/2012 01:19 PM, Michi Mutsuzaki wrote:
>>>>
>>>> I agree with both Scott's and Ryan's points. I think it makes to:
>>>>
>>>> 1. Make this behavior configurable (with default being "turned off")
>>>>to
>>>> preserve backward compatibility.
>>>> 2. Re-throw the exception instead of exiting with System.exit(1) so
>>>>that
>>>> users can use flags like -XX:+HeapDumpOnOutOfMemoryError.
>>>
>>>
>>> I don't think re-throwing exceptions from an uncaught exception
>>>handler is
>>> an option:
>>> 
>>>http://docs.oracle.com/javase/6/docs/api/java/lang/Thread.UncaughtExcept
>>>ionHandler.html#uncaughtException%28java.lang.Thread,%20java.lang.Throwa
>>>ble%29
>>>
>>> " Any exception thrown by this method will be ignored by the Java
>>>Virtual
>>> Machine."
>>>
>>>
>>>> --Michi
>>>> ________________________________________
>>>> From: Scott Fines [scottfines@gmail.com]
>>>> Sent: Friday, April 13, 2012 8:15 AM
>>>> To: user@zookeeper.apache.org
>>>> Cc: zookeeper-user@hadoop.apache.org; zookeeper-dev@hadoop.apache.org
>>>> Subject: Re: Input on a change
>>>>
>>>> On some JVMs (the HotSpot for sure, but maybe others too?) there's a
>>>>JVM
>>>> for performing actions on OutOfMemoryErrors
>>>>(-XX:OnOutOfMemoryError="<cmd
>>>> args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I
>>>>can't
>>>> remember off the top of my head). Will these triggers still be fired,
>>>>or
>>>> will the catch-all prevent them?
>>>>
>>>> I'm still +1 for the change no matter what, but it's probably a good
>>>>idea
>>>> to mention that in the docs if they don't work.
>>>>
>>>> Scott
>>>>
>>>> On Fri, Apr 13, 2012 at 10:09 AM, Camille
>>>> Fournier<ca...@apache.org>wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I'm trying to evaluate a patch that Jeremy Stribling has submitted,
>>>>>and
>>>>> I'd
>>>>> like some feedback from the user base on it.
>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>>>>
>>>>> The current behavior of ZK when we get an uncaught exception is to
>>>>>log it
>>>>> and try to move on. This is arguably not the right thing to do, and
>>>>>will
>>>>> possibly cause ZK to limp along with a bad VM (say, in an OOM state)
>>>>>for
>>>>> longer than it should.
>>>>> The patch proposes that when we get an instance of java.lang.Error,
>>>>>we
>>>>> should do a system.exit to fast-fail the process. With the possible
>>>>> exception of ThreadDeath (which may or may not be an unrecoverable
>>>>>system
>>>>> state depending on the thread), I think this makes sense, but I would
>>>>> like
>>>>> to hear from others if they have an opinion. I think it's better to
>>>>>kill
>>>>> the process and let your monitoring services detect process death
>>>>>(and
>>>>> thus
>>>>> restart) than possibly linger unresponsive for a while, are there
>>>>> scenarios
>>>>> that we're missing where this error can occur and you wouldn't want
>>>>>the
>>>>> process killed?
>>>>>
>>>>> Thanks for your feedback,
>>>>>
>>>>> Camille
>>>>>
>>>

Re: Input on a change

Posted by "Shelley, Ryan" <Ry...@disney.com>.

System.exit(int) is developer specific. Traditionally, it should be
non-zero for abnormal operation (so really it should be System.exit(-1)).

Maybe it could just be parameterized entirely. Apply the
UncaughtExceptionHandler only if the option is enabled.

--failfast=true/false (defaults to false, until a future major release)

Ryan

On 4/13/12 2:38 PM, "Michi Mutsuzaki" <mi...@cs.stanford.edu> wrote:

>By the way, grepping the source code, I see we use various exit codes.
>Does anybody know what they all mean?
>
>--Michi
>
>On Fri, Apr 13, 2012 at 2:31 PM, Michi Mutsuzaki <mi...@cs.stanford.edu>
>wrote:
>> Ah I see. I guess log and System.exit(1) is the best we can do then.
>>
>> --Michi
>>
>> On Fri, Apr 13, 2012 at 2:25 PM, Jeremy Stribling <st...@nicira.com>
>>wrote:
>>> On 04/13/2012 01:19 PM, Michi Mutsuzaki wrote:
>>>>
>>>> I agree with both Scott's and Ryan's points. I think it makes to:
>>>>
>>>> 1. Make this behavior configurable (with default being "turned off")
>>>>to
>>>> preserve backward compatibility.
>>>> 2. Re-throw the exception instead of exiting with System.exit(1) so
>>>>that
>>>> users can use flags like -XX:+HeapDumpOnOutOfMemoryError.
>>>
>>>
>>> I don't think re-throwing exceptions from an uncaught exception
>>>handler is
>>> an option:
>>> 
>>>http://docs.oracle.com/javase/6/docs/api/java/lang/Thread.UncaughtExcept
>>>ionHandler.html#uncaughtException%28java.lang.Thread,%20java.lang.Throwa
>>>ble%29
>>>
>>> " Any exception thrown by this method will be ignored by the Java
>>>Virtual
>>> Machine."
>>>
>>>
>>>> --Michi
>>>> ________________________________________
>>>> From: Scott Fines [scottfines@gmail.com]
>>>> Sent: Friday, April 13, 2012 8:15 AM
>>>> To: user@zookeeper.apache.org
>>>> Cc: zookeeper-user@hadoop.apache.org; zookeeper-dev@hadoop.apache.org
>>>> Subject: Re: Input on a change
>>>>
>>>> On some JVMs (the HotSpot for sure, but maybe others too?) there's a
>>>>JVM
>>>> for performing actions on OutOfMemoryErrors
>>>>(-XX:OnOutOfMemoryError="<cmd
>>>> args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I
>>>>can't
>>>> remember off the top of my head). Will these triggers still be fired,
>>>>or
>>>> will the catch-all prevent them?
>>>>
>>>> I'm still +1 for the change no matter what, but it's probably a good
>>>>idea
>>>> to mention that in the docs if they don't work.
>>>>
>>>> Scott
>>>>
>>>> On Fri, Apr 13, 2012 at 10:09 AM, Camille
>>>> Fournier<ca...@apache.org>wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I'm trying to evaluate a patch that Jeremy Stribling has submitted,
>>>>>and
>>>>> I'd
>>>>> like some feedback from the user base on it.
>>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>>>>
>>>>> The current behavior of ZK when we get an uncaught exception is to
>>>>>log it
>>>>> and try to move on. This is arguably not the right thing to do, and
>>>>>will
>>>>> possibly cause ZK to limp along with a bad VM (say, in an OOM state)
>>>>>for
>>>>> longer than it should.
>>>>> The patch proposes that when we get an instance of java.lang.Error,
>>>>>we
>>>>> should do a system.exit to fast-fail the process. With the possible
>>>>> exception of ThreadDeath (which may or may not be an unrecoverable
>>>>>system
>>>>> state depending on the thread), I think this makes sense, but I would
>>>>> like
>>>>> to hear from others if they have an opinion. I think it's better to
>>>>>kill
>>>>> the process and let your monitoring services detect process death
>>>>>(and
>>>>> thus
>>>>> restart) than possibly linger unresponsive for a while, are there
>>>>> scenarios
>>>>> that we're missing where this error can occur and you wouldn't want
>>>>>the
>>>>> process killed?
>>>>>
>>>>> Thanks for your feedback,
>>>>>
>>>>> Camille
>>>>>
>>>

Re: Input on a change

Posted by Michi Mutsuzaki <mi...@cs.stanford.edu>.

By the way, grepping the source code, I see we use various exit codes.
Does anybody know what they all mean?

--Michi

On Fri, Apr 13, 2012 at 2:31 PM, Michi Mutsuzaki <mi...@cs.stanford.edu> wrote:
> Ah I see. I guess log and System.exit(1) is the best we can do then.
>
> --Michi
>
> On Fri, Apr 13, 2012 at 2:25 PM, Jeremy Stribling <st...@nicira.com> wrote:
>> On 04/13/2012 01:19 PM, Michi Mutsuzaki wrote:
>>>
>>> I agree with both Scott's and Ryan's points. I think it makes to:
>>>
>>> 1. Make this behavior configurable (with default being "turned off") to
>>> preserve backward compatibility.
>>> 2. Re-throw the exception instead of exiting with System.exit(1) so that
>>> users can use flags like -XX:+HeapDumpOnOutOfMemoryError.
>>
>>
>> I don't think re-throwing exceptions from an uncaught exception handler is
>> an option:
>> http://docs.oracle.com/javase/6/docs/api/java/lang/Thread.UncaughtExceptionHandler.html#uncaughtException%28java.lang.Thread,%20java.lang.Throwable%29
>>
>> " Any exception thrown by this method will be ignored by the Java Virtual
>> Machine."
>>
>>
>>> --Michi
>>> ________________________________________
>>> From: Scott Fines [scottfines@gmail.com]
>>> Sent: Friday, April 13, 2012 8:15 AM
>>> To: user@zookeeper.apache.org
>>> Cc: zookeeper-user@hadoop.apache.org; zookeeper-dev@hadoop.apache.org
>>> Subject: Re: Input on a change
>>>
>>> On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
>>> for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
>>> args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
>>> remember off the top of my head). Will these triggers still be fired, or
>>> will the catch-all prevent them?
>>>
>>> I'm still +1 for the change no matter what, but it's probably a good idea
>>> to mention that in the docs if they don't work.
>>>
>>> Scott
>>>
>>> On Fri, Apr 13, 2012 at 10:09 AM, Camille
>>> Fournier<ca...@apache.org>wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
>>>> I'd
>>>> like some feedback from the user base on it.
>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>>>
>>>> The current behavior of ZK when we get an uncaught exception is to log it
>>>> and try to move on. This is arguably not the right thing to do, and will
>>>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>>>> longer than it should.
>>>> The patch proposes that when we get an instance of java.lang.Error, we
>>>> should do a system.exit to fast-fail the process. With the possible
>>>> exception of ThreadDeath (which may or may not be an unrecoverable system
>>>> state depending on the thread), I think this makes sense, but I would
>>>> like
>>>> to hear from others if they have an opinion. I think it's better to kill
>>>> the process and let your monitoring services detect process death (and
>>>> thus
>>>> restart) than possibly linger unresponsive for a while, are there
>>>> scenarios
>>>> that we're missing where this error can occur and you wouldn't want the
>>>> process killed?
>>>>
>>>> Thanks for your feedback,
>>>>
>>>> Camille
>>>>
>>

Re: Input on a change

Posted by Michi Mutsuzaki <mi...@cs.stanford.edu>.

By the way, grepping the source code, I see we use various exit codes.
Does anybody know what they all mean?

--Michi

On Fri, Apr 13, 2012 at 2:31 PM, Michi Mutsuzaki <mi...@cs.stanford.edu> wrote:
> Ah I see. I guess log and System.exit(1) is the best we can do then.
>
> --Michi
>
> On Fri, Apr 13, 2012 at 2:25 PM, Jeremy Stribling <st...@nicira.com> wrote:
>> On 04/13/2012 01:19 PM, Michi Mutsuzaki wrote:
>>>
>>> I agree with both Scott's and Ryan's points. I think it makes to:
>>>
>>> 1. Make this behavior configurable (with default being "turned off") to
>>> preserve backward compatibility.
>>> 2. Re-throw the exception instead of exiting with System.exit(1) so that
>>> users can use flags like -XX:+HeapDumpOnOutOfMemoryError.
>>
>>
>> I don't think re-throwing exceptions from an uncaught exception handler is
>> an option:
>> http://docs.oracle.com/javase/6/docs/api/java/lang/Thread.UncaughtExceptionHandler.html#uncaughtException%28java.lang.Thread,%20java.lang.Throwable%29
>>
>> " Any exception thrown by this method will be ignored by the Java Virtual
>> Machine."
>>
>>
>>> --Michi
>>> ________________________________________
>>> From: Scott Fines [scottfines@gmail.com]
>>> Sent: Friday, April 13, 2012 8:15 AM
>>> To: user@zookeeper.apache.org
>>> Cc: zookeeper-user@hadoop.apache.org; zookeeper-dev@hadoop.apache.org
>>> Subject: Re: Input on a change
>>>
>>> On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
>>> for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
>>> args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
>>> remember off the top of my head). Will these triggers still be fired, or
>>> will the catch-all prevent them?
>>>
>>> I'm still +1 for the change no matter what, but it's probably a good idea
>>> to mention that in the docs if they don't work.
>>>
>>> Scott
>>>
>>> On Fri, Apr 13, 2012 at 10:09 AM, Camille
>>> Fournier<ca...@apache.org>wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
>>>> I'd
>>>> like some feedback from the user base on it.
>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>>>
>>>> The current behavior of ZK when we get an uncaught exception is to log it
>>>> and try to move on. This is arguably not the right thing to do, and will
>>>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>>>> longer than it should.
>>>> The patch proposes that when we get an instance of java.lang.Error, we
>>>> should do a system.exit to fast-fail the process. With the possible
>>>> exception of ThreadDeath (which may or may not be an unrecoverable system
>>>> state depending on the thread), I think this makes sense, but I would
>>>> like
>>>> to hear from others if they have an opinion. I think it's better to kill
>>>> the process and let your monitoring services detect process death (and
>>>> thus
>>>> restart) than possibly linger unresponsive for a while, are there
>>>> scenarios
>>>> that we're missing where this error can occur and you wouldn't want the
>>>> process killed?
>>>>
>>>> Thanks for your feedback,
>>>>
>>>> Camille
>>>>
>>

Re: Input on a change

Posted by Michi Mutsuzaki <mi...@cs.stanford.edu>.

Ah I see. I guess log and System.exit(1) is the best we can do then.

--Michi

On Fri, Apr 13, 2012 at 2:25 PM, Jeremy Stribling <st...@nicira.com> wrote:
> On 04/13/2012 01:19 PM, Michi Mutsuzaki wrote:
>>
>> I agree with both Scott's and Ryan's points. I think it makes to:
>>
>> 1. Make this behavior configurable (with default being "turned off") to
>> preserve backward compatibility.
>> 2. Re-throw the exception instead of exiting with System.exit(1) so that
>> users can use flags like -XX:+HeapDumpOnOutOfMemoryError.
>
>
> I don't think re-throwing exceptions from an uncaught exception handler is
> an option:
> http://docs.oracle.com/javase/6/docs/api/java/lang/Thread.UncaughtExceptionHandler.html#uncaughtException%28java.lang.Thread,%20java.lang.Throwable%29
>
> " Any exception thrown by this method will be ignored by the Java Virtual
> Machine."
>
>
>> --Michi
>> ________________________________________
>> From: Scott Fines [scottfines@gmail.com]
>> Sent: Friday, April 13, 2012 8:15 AM
>> To: user@zookeeper.apache.org
>> Cc: zookeeper-user@hadoop.apache.org; zookeeper-dev@hadoop.apache.org
>> Subject: Re: Input on a change
>>
>> On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
>> for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
>> args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
>> remember off the top of my head). Will these triggers still be fired, or
>> will the catch-all prevent them?
>>
>> I'm still +1 for the change no matter what, but it's probably a good idea
>> to mention that in the docs if they don't work.
>>
>> Scott
>>
>> On Fri, Apr 13, 2012 at 10:09 AM, Camille
>> Fournier<ca...@apache.org>wrote:
>>
>>> Hi everyone,
>>>
>>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
>>> I'd
>>> like some feedback from the user base on it.
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>>
>>> The current behavior of ZK when we get an uncaught exception is to log it
>>> and try to move on. This is arguably not the right thing to do, and will
>>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>>> longer than it should.
>>> The patch proposes that when we get an instance of java.lang.Error, we
>>> should do a system.exit to fast-fail the process. With the possible
>>> exception of ThreadDeath (which may or may not be an unrecoverable system
>>> state depending on the thread), I think this makes sense, but I would
>>> like
>>> to hear from others if they have an opinion. I think it's better to kill
>>> the process and let your monitoring services detect process death (and
>>> thus
>>> restart) than possibly linger unresponsive for a while, are there
>>> scenarios
>>> that we're missing where this error can occur and you wouldn't want the
>>> process killed?
>>>
>>> Thanks for your feedback,
>>>
>>> Camille
>>>
>

Re: Input on a change

Posted by Michi Mutsuzaki <mi...@cs.stanford.edu>.

Ah I see. I guess log and System.exit(1) is the best we can do then.

--Michi

On Fri, Apr 13, 2012 at 2:25 PM, Jeremy Stribling <st...@nicira.com> wrote:
> On 04/13/2012 01:19 PM, Michi Mutsuzaki wrote:
>>
>> I agree with both Scott's and Ryan's points. I think it makes to:
>>
>> 1. Make this behavior configurable (with default being "turned off") to
>> preserve backward compatibility.
>> 2. Re-throw the exception instead of exiting with System.exit(1) so that
>> users can use flags like -XX:+HeapDumpOnOutOfMemoryError.
>
>
> I don't think re-throwing exceptions from an uncaught exception handler is
> an option:
> http://docs.oracle.com/javase/6/docs/api/java/lang/Thread.UncaughtExceptionHandler.html#uncaughtException%28java.lang.Thread,%20java.lang.Throwable%29
>
> " Any exception thrown by this method will be ignored by the Java Virtual
> Machine."
>
>
>> --Michi
>> ________________________________________
>> From: Scott Fines [scottfines@gmail.com]
>> Sent: Friday, April 13, 2012 8:15 AM
>> To: user@zookeeper.apache.org
>> Cc: zookeeper-user@hadoop.apache.org; zookeeper-dev@hadoop.apache.org
>> Subject: Re: Input on a change
>>
>> On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
>> for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
>> args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
>> remember off the top of my head). Will these triggers still be fired, or
>> will the catch-all prevent them?
>>
>> I'm still +1 for the change no matter what, but it's probably a good idea
>> to mention that in the docs if they don't work.
>>
>> Scott
>>
>> On Fri, Apr 13, 2012 at 10:09 AM, Camille
>> Fournier<ca...@apache.org>wrote:
>>
>>> Hi everyone,
>>>
>>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
>>> I'd
>>> like some feedback from the user base on it.
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>>
>>> The current behavior of ZK when we get an uncaught exception is to log it
>>> and try to move on. This is arguably not the right thing to do, and will
>>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>>> longer than it should.
>>> The patch proposes that when we get an instance of java.lang.Error, we
>>> should do a system.exit to fast-fail the process. With the possible
>>> exception of ThreadDeath (which may or may not be an unrecoverable system
>>> state depending on the thread), I think this makes sense, but I would
>>> like
>>> to hear from others if they have an opinion. I think it's better to kill
>>> the process and let your monitoring services detect process death (and
>>> thus
>>> restart) than possibly linger unresponsive for a while, are there
>>> scenarios
>>> that we're missing where this error can occur and you wouldn't want the
>>> process killed?
>>>
>>> Thanks for your feedback,
>>>
>>> Camille
>>>
>

Re: Input on a change

Posted by Jeremy Stribling <st...@nicira.com>.

On 04/13/2012 01:19 PM, Michi Mutsuzaki wrote:
> I agree with both Scott's and Ryan's points. I think it makes to:
>
> 1. Make this behavior configurable (with default being "turned off") to preserve backward compatibility.
> 2. Re-throw the exception instead of exiting with System.exit(1) so that users can use flags like -XX:+HeapDumpOnOutOfMemoryError.

I don't think re-throwing exceptions from an uncaught exception handler 
is an option:
http://docs.oracle.com/javase/6/docs/api/java/lang/Thread.UncaughtExceptionHandler.html#uncaughtException%28java.lang.Thread,%20java.lang.Throwable%29

" Any exception thrown by this method will be ignored by the Java 
Virtual Machine."

> --Michi
> ________________________________________
> From: Scott Fines [scottfines@gmail.com]
> Sent: Friday, April 13, 2012 8:15 AM
> To: user@zookeeper.apache.org
> Cc: zookeeper-user@hadoop.apache.org; zookeeper-dev@hadoop.apache.org
> Subject: Re: Input on a change
>
> On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
> for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
> args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
> remember off the top of my head). Will these triggers still be fired, or
> will the catch-all prevent them?
>
> I'm still +1 for the change no matter what, but it's probably a good idea
> to mention that in the docs if they don't work.
>
> Scott
>
> On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier<ca...@apache.org>wrote:
>
>> Hi everyone,
>>
>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
>> like some feedback from the user base on it.
>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>
>> The current behavior of ZK when we get an uncaught exception is to log it
>> and try to move on. This is arguably not the right thing to do, and will
>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>> longer than it should.
>> The patch proposes that when we get an instance of java.lang.Error, we
>> should do a system.exit to fast-fail the process. With the possible
>> exception of ThreadDeath (which may or may not be an unrecoverable system
>> state depending on the thread), I think this makes sense, but I would like
>> to hear from others if they have an opinion. I think it's better to kill
>> the process and let your monitoring services detect process death (and thus
>> restart) than possibly linger unresponsive for a while, are there scenarios
>> that we're missing where this error can occur and you wouldn't want the
>> process killed?
>>
>> Thanks for your feedback,
>>
>> Camille
>>

Re: Input on a change

Posted by Jeremy Stribling <st...@nicira.com>.

On 04/13/2012 01:19 PM, Michi Mutsuzaki wrote:
> I agree with both Scott's and Ryan's points. I think it makes to:
>
> 1. Make this behavior configurable (with default being "turned off") to preserve backward compatibility.
> 2. Re-throw the exception instead of exiting with System.exit(1) so that users can use flags like -XX:+HeapDumpOnOutOfMemoryError.

I don't think re-throwing exceptions from an uncaught exception handler 
is an option:
http://docs.oracle.com/javase/6/docs/api/java/lang/Thread.UncaughtExceptionHandler.html#uncaughtException%28java.lang.Thread,%20java.lang.Throwable%29

" Any exception thrown by this method will be ignored by the Java 
Virtual Machine."

> --Michi
> ________________________________________
> From: Scott Fines [scottfines@gmail.com]
> Sent: Friday, April 13, 2012 8:15 AM
> To: user@zookeeper.apache.org
> Cc: zookeeper-user@hadoop.apache.org; zookeeper-dev@hadoop.apache.org
> Subject: Re: Input on a change
>
> On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
> for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
> args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
> remember off the top of my head). Will these triggers still be fired, or
> will the catch-all prevent them?
>
> I'm still +1 for the change no matter what, but it's probably a good idea
> to mention that in the docs if they don't work.
>
> Scott
>
> On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier<ca...@apache.org>wrote:
>
>> Hi everyone,
>>
>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
>> like some feedback from the user base on it.
>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>
>> The current behavior of ZK when we get an uncaught exception is to log it
>> and try to move on. This is arguably not the right thing to do, and will
>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>> longer than it should.
>> The patch proposes that when we get an instance of java.lang.Error, we
>> should do a system.exit to fast-fail the process. With the possible
>> exception of ThreadDeath (which may or may not be an unrecoverable system
>> state depending on the thread), I think this makes sense, but I would like
>> to hear from others if they have an opinion. I think it's better to kill
>> the process and let your monitoring services detect process death (and thus
>> restart) than possibly linger unresponsive for a while, are there scenarios
>> that we're missing where this error can occur and you wouldn't want the
>> process killed?
>>
>> Thanks for your feedback,
>>
>> Camille
>>

RE: Input on a change

Posted by Michi Mutsuzaki <mi...@yahoo-inc.com>.

I agree with both Scott's and Ryan's points. I think it makes to:

1. Make this behavior configurable (with default being "turned off") to preserve backward compatibility. 
2. Re-throw the exception instead of exiting with System.exit(1) so that users can use flags like -XX:+HeapDumpOnOutOfMemoryError.

--Michi
________________________________________
From: Scott Fines [scottfines@gmail.com]
Sent: Friday, April 13, 2012 8:15 AM
To: user@zookeeper.apache.org
Cc: zookeeper-user@hadoop.apache.org; zookeeper-dev@hadoop.apache.org
Subject: Re: Input on a change

On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
remember off the top of my head). Will these triggers still be fired, or
will the catch-all prevent them?

I'm still +1 for the change no matter what, but it's probably a good idea
to mention that in the docs if they don't work.

Scott

On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier <ca...@apache.org>wrote:

> Hi everyone,
>
> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
> like some feedback from the user base on it.
> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>
> The current behavior of ZK when we get an uncaught exception is to log it
> and try to move on. This is arguably not the right thing to do, and will
> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
> longer than it should.
> The patch proposes that when we get an instance of java.lang.Error, we
> should do a system.exit to fast-fail the process. With the possible
> exception of ThreadDeath (which may or may not be an unrecoverable system
> state depending on the thread), I think this makes sense, but I would like
> to hear from others if they have an opinion. I think it's better to kill
> the process and let your monitoring services detect process death (and thus
> restart) than possibly linger unresponsive for a while, are there scenarios
> that we're missing where this error can occur and you wouldn't want the
> process killed?
>
> Thanks for your feedback,
>
> Camille
>

RE: Input on a change

Posted by Michi Mutsuzaki <mi...@yahoo-inc.com>.

I agree with both Scott's and Ryan's points. I think it makes to:

1. Make this behavior configurable (with default being "turned off") to preserve backward compatibility. 
2. Re-throw the exception instead of exiting with System.exit(1) so that users can use flags like -XX:+HeapDumpOnOutOfMemoryError.

--Michi
________________________________________
From: Scott Fines [scottfines@gmail.com]
Sent: Friday, April 13, 2012 8:15 AM
To: user@zookeeper.apache.org
Cc: zookeeper-user@hadoop.apache.org; zookeeper-dev@hadoop.apache.org
Subject: Re: Input on a change

On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
remember off the top of my head). Will these triggers still be fired, or
will the catch-all prevent them?

I'm still +1 for the change no matter what, but it's probably a good idea
to mention that in the docs if they don't work.

Scott

On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier <ca...@apache.org>wrote:

> Hi everyone,
>
> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
> like some feedback from the user base on it.
> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>
> The current behavior of ZK when we get an uncaught exception is to log it
> and try to move on. This is arguably not the right thing to do, and will
> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
> longer than it should.
> The patch proposes that when we get an instance of java.lang.Error, we
> should do a system.exit to fast-fail the process. With the possible
> exception of ThreadDeath (which may or may not be an unrecoverable system
> state depending on the thread), I think this makes sense, but I would like
> to hear from others if they have an opinion. I think it's better to kill
> the process and let your monitoring services detect process death (and thus
> restart) than possibly linger unresponsive for a while, are there scenarios
> that we're missing where this error can occur and you wouldn't want the
> process killed?
>
> Thanks for your feedback,
>
> Camille
>

Re: Input on a change

Posted by "Shelley, Ryan" <Ry...@disney.com>.

And to go just a bit further with this, would this be considered a
non-backward compatible change to the core expectations of the system
(even if it's patching a "bug")? It changes the "expected" behavior of the
application from an undefined possible zombie state to fail-fast.
Theoretically, that might constitute a version bump (http://semver.org),
unless you parameterize the behavior so that a user, who encounters this
behavior outside of their existing expectations, could turn off the
fail-fast behavior by setting an appropriate flag. I'm not an expert in ZK
whatsoever, but maybe there's some runtime operation that occurs and
generates an unchecked exception and the system is fine with that, but
it's only really experienced at runtime in exceptional cases. If the
system fails-fast, without a way to disable fail-fast, you may have to
release another patch promptly (instead of simply saying "disable
fail-fast for now, we'll fix the larger issue in the next release"). Maybe
that's excessive, but just putting it out there anyway.

On 4/13/12 11:22 AM, "Shelley, Ryan" <Ry...@disney.com> wrote:

>Just my 2 centsŠ is the error code 1 the correct error code to return to
>the OS? I'm just curious if anywhere else in ZooKeeper a System.exit(1)
>may be called. It may make sense to either re-use that error code, or use
>a different one (if 1 is already used elsewhere for a different type of
>error, like "Invalid arguments" during start-up, for example).
>
>If the error isn't an OOME, is there any clean-up ZK needs to do to maybe
>inform a cluster it's going down abruptly (maybe to gracefully begin a
>leader re-election if necessary, for example)?
>
>I'm +1 to fail-fast behavior.
>
>Ryan
>
>On 4/13/12 8:15 AM, "Scott Fines" <sc...@gmail.com> wrote:
>
>>On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
>>for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
>>args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
>>remember off the top of my head). Will these triggers still be fired, or
>>will the catch-all prevent them?
>>
>>I'm still +1 for the change no matter what, but it's probably a good idea
>>to mention that in the docs if they don't work.
>>
>>Scott
>>
>>On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier
>><ca...@apache.org>wrote:
>>
>>> Hi everyone,
>>>
>>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
>>>I'd
>>> like some feedback from the user base on it.
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>>
>>> The current behavior of ZK when we get an uncaught exception is to log
>>>it
>>> and try to move on. This is arguably not the right thing to do, and
>>>will
>>> possibly cause ZK to limp along with a bad VM (say, in an OOM state)
>>>for
>>> longer than it should.
>>> The patch proposes that when we get an instance of java.lang.Error, we
>>> should do a system.exit to fast-fail the process. With the possible
>>> exception of ThreadDeath (which may or may not be an unrecoverable
>>>system
>>> state depending on the thread), I think this makes sense, but I would
>>>like
>>> to hear from others if they have an opinion. I think it's better to
>>>kill
>>> the process and let your monitoring services detect process death (and
>>>thus
>>> restart) than possibly linger unresponsive for a while, are there
>>>scenarios
>>> that we're missing where this error can occur and you wouldn't want the
>>> process killed?
>>>
>>> Thanks for your feedback,
>>>
>>> Camille
>>>
>

Re: Input on a change

Posted by "Shelley, Ryan" <Ry...@disney.com>.

And to go just a bit further with this, would this be considered a
non-backward compatible change to the core expectations of the system
(even if it's patching a "bug")? It changes the "expected" behavior of the
application from an undefined possible zombie state to fail-fast.
Theoretically, that might constitute a version bump (http://semver.org),
unless you parameterize the behavior so that a user, who encounters this
behavior outside of their existing expectations, could turn off the
fail-fast behavior by setting an appropriate flag. I'm not an expert in ZK
whatsoever, but maybe there's some runtime operation that occurs and
generates an unchecked exception and the system is fine with that, but
it's only really experienced at runtime in exceptional cases. If the
system fails-fast, without a way to disable fail-fast, you may have to
release another patch promptly (instead of simply saying "disable
fail-fast for now, we'll fix the larger issue in the next release"). Maybe
that's excessive, but just putting it out there anyway.

On 4/13/12 11:22 AM, "Shelley, Ryan" <Ry...@disney.com> wrote:

>Just my 2 centsŠ is the error code 1 the correct error code to return to
>the OS? I'm just curious if anywhere else in ZooKeeper a System.exit(1)
>may be called. It may make sense to either re-use that error code, or use
>a different one (if 1 is already used elsewhere for a different type of
>error, like "Invalid arguments" during start-up, for example).
>
>If the error isn't an OOME, is there any clean-up ZK needs to do to maybe
>inform a cluster it's going down abruptly (maybe to gracefully begin a
>leader re-election if necessary, for example)?
>
>I'm +1 to fail-fast behavior.
>
>Ryan
>
>On 4/13/12 8:15 AM, "Scott Fines" <sc...@gmail.com> wrote:
>
>>On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
>>for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
>>args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
>>remember off the top of my head). Will these triggers still be fired, or
>>will the catch-all prevent them?
>>
>>I'm still +1 for the change no matter what, but it's probably a good idea
>>to mention that in the docs if they don't work.
>>
>>Scott
>>
>>On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier
>><ca...@apache.org>wrote:
>>
>>> Hi everyone,
>>>
>>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
>>>I'd
>>> like some feedback from the user base on it.
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>>
>>> The current behavior of ZK when we get an uncaught exception is to log
>>>it
>>> and try to move on. This is arguably not the right thing to do, and
>>>will
>>> possibly cause ZK to limp along with a bad VM (say, in an OOM state)
>>>for
>>> longer than it should.
>>> The patch proposes that when we get an instance of java.lang.Error, we
>>> should do a system.exit to fast-fail the process. With the possible
>>> exception of ThreadDeath (which may or may not be an unrecoverable
>>>system
>>> state depending on the thread), I think this makes sense, but I would
>>>like
>>> to hear from others if they have an opinion. I think it's better to
>>>kill
>>> the process and let your monitoring services detect process death (and
>>>thus
>>> restart) than possibly linger unresponsive for a while, are there
>>>scenarios
>>> that we're missing where this error can occur and you wouldn't want the
>>> process killed?
>>>
>>> Thanks for your feedback,
>>>
>>> Camille
>>>
>

Re: Input on a change

Posted by "Shelley, Ryan" <Ry...@disney.com>.

Just my 2 centsŠ is the error code 1 the correct error code to return to
the OS? I'm just curious if anywhere else in ZooKeeper a System.exit(1)
may be called. It may make sense to either re-use that error code, or use
a different one (if 1 is already used elsewhere for a different type of
error, like "Invalid arguments" during start-up, for example).

If the error isn't an OOME, is there any clean-up ZK needs to do to maybe
inform a cluster it's going down abruptly (maybe to gracefully begin a
leader re-election if necessary, for example)?

I'm +1 to fail-fast behavior.

Ryan

On 4/13/12 8:15 AM, "Scott Fines" <sc...@gmail.com> wrote:

>On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
>for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
>args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
>remember off the top of my head). Will these triggers still be fired, or
>will the catch-all prevent them?
>
>I'm still +1 for the change no matter what, but it's probably a good idea
>to mention that in the docs if they don't work.
>
>Scott
>
>On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier
><ca...@apache.org>wrote:
>
>> Hi everyone,
>>
>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
>>I'd
>> like some feedback from the user base on it.
>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>
>> The current behavior of ZK when we get an uncaught exception is to log
>>it
>> and try to move on. This is arguably not the right thing to do, and will
>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>> longer than it should.
>> The patch proposes that when we get an instance of java.lang.Error, we
>> should do a system.exit to fast-fail the process. With the possible
>> exception of ThreadDeath (which may or may not be an unrecoverable
>>system
>> state depending on the thread), I think this makes sense, but I would
>>like
>> to hear from others if they have an opinion. I think it's better to kill
>> the process and let your monitoring services detect process death (and
>>thus
>> restart) than possibly linger unresponsive for a while, are there
>>scenarios
>> that we're missing where this error can occur and you wouldn't want the
>> process killed?
>>
>> Thanks for your feedback,
>>
>> Camille
>>

Re: Input on a change

Posted by "Shelley, Ryan" <Ry...@disney.com>.

Just my 2 centsŠ is the error code 1 the correct error code to return to
the OS? I'm just curious if anywhere else in ZooKeeper a System.exit(1)
may be called. It may make sense to either re-use that error code, or use
a different one (if 1 is already used elsewhere for a different type of
error, like "Invalid arguments" during start-up, for example).

If the error isn't an OOME, is there any clean-up ZK needs to do to maybe
inform a cluster it's going down abruptly (maybe to gracefully begin a
leader re-election if necessary, for example)?

I'm +1 to fail-fast behavior.

Ryan

On 4/13/12 8:15 AM, "Scott Fines" <sc...@gmail.com> wrote:

>On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
>for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
>args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
>remember off the top of my head). Will these triggers still be fired, or
>will the catch-all prevent them?
>
>I'm still +1 for the change no matter what, but it's probably a good idea
>to mention that in the docs if they don't work.
>
>Scott
>
>On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier
><ca...@apache.org>wrote:
>
>> Hi everyone,
>>
>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and
>>I'd
>> like some feedback from the user base on it.
>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>>
>> The current behavior of ZK when we get an uncaught exception is to log
>>it
>> and try to move on. This is arguably not the right thing to do, and will
>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
>> longer than it should.
>> The patch proposes that when we get an instance of java.lang.Error, we
>> should do a system.exit to fast-fail the process. With the possible
>> exception of ThreadDeath (which may or may not be an unrecoverable
>>system
>> state depending on the thread), I think this makes sense, but I would
>>like
>> to hear from others if they have an opinion. I think it's better to kill
>> the process and let your monitoring services detect process death (and
>>thus
>> restart) than possibly linger unresponsive for a while, are there
>>scenarios
>> that we're missing where this error can occur and you wouldn't want the
>> process killed?
>>
>> Thanks for your feedback,
>>
>> Camille
>>

Re: Input on a change

Posted by Scott Fines <sc...@gmail.com>.

On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
remember off the top of my head). Will these triggers still be fired, or
will the catch-all prevent them?

I'm still +1 for the change no matter what, but it's probably a good idea
to mention that in the docs if they don't work.

Scott

On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier <ca...@apache.org>wrote:

> Hi everyone,
>
> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
> like some feedback from the user base on it.
> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>
> The current behavior of ZK when we get an uncaught exception is to log it
> and try to move on. This is arguably not the right thing to do, and will
> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
> longer than it should.
> The patch proposes that when we get an instance of java.lang.Error, we
> should do a system.exit to fast-fail the process. With the possible
> exception of ThreadDeath (which may or may not be an unrecoverable system
> state depending on the thread), I think this makes sense, but I would like
> to hear from others if they have an opinion. I think it's better to kill
> the process and let your monitoring services detect process death (and thus
> restart) than possibly linger unresponsive for a while, are there scenarios
> that we're missing where this error can occur and you wouldn't want the
> process killed?
>
> Thanks for your feedback,
>
> Camille
>

Re: Input on a change

Posted by Віталій Тимчишин <ti...@gmail.com>.

I really would not like for any library to perform a System.exit call. This
would make huge program exit out of sudden (think about j2ee, you may be
bitten by security manager).  Note that there are more or less safe errors,
like StackOverflowError.
Also System.exit make testing nightmare. E.g. maven2 silently skips any
tests after the one that calls System.exit. And everything's green.
As for me good options are:
1) Call user-provided uncaught exception handler. Use the one from the
thread that created the connection if one is not specified explicity.
1) Stop everything, notifying user with a global watcher. If it's possible,
clean any static state (e.g. restart threads) and allow to restart
connection.
In any case, call user code. Good system already know how to react (it may
want to send email to admin), allow it to perform well.

Best regards, Vitalii Tymchyshyn.

2012/4/13 Camille Fournier <ca...@apache.org>

> Hi everyone,
>
> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
> like some feedback from the user base on it.
> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>
> The current behavior of ZK when we get an uncaught exception is to log it
> and try to move on. This is arguably not the right thing to do, and will
> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
> longer than it should.
> The patch proposes that when we get an instance of java.lang.Error, we
> should do a system.exit to fast-fail the process. With the possible
> exception of ThreadDeath (which may or may not be an unrecoverable system
> state depending on the thread), I think this makes sense, but I would like
> to hear from others if they have an opinion. I think it's better to kill
> the process and let your monitoring services detect process death (and thus
> restart) than possibly linger unresponsive for a while, are there scenarios
> that we're missing where this error can occur and you wouldn't want the
> process killed?
>
> Thanks for your feedback,
>
> Camille
>



-- 
Best regards,
 Vitalii Tymchyshyn

Re: Input on a change

Posted by Scott Fines <sc...@gmail.com>.

On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM
for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd
args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't
remember off the top of my head). Will these triggers still be fired, or
will the catch-all prevent them?

I'm still +1 for the change no matter what, but it's probably a good idea
to mention that in the docs if they don't work.

Scott

On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier <ca...@apache.org>wrote:

> Hi everyone,
>
> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd
> like some feedback from the user base on it.
> https://issues.apache.org/jira/browse/ZOOKEEPER-1442
>
> The current behavior of ZK when we get an uncaught exception is to log it
> and try to move on. This is arguably not the right thing to do, and will
> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for
> longer than it should.
> The patch proposes that when we get an instance of java.lang.Error, we
> should do a system.exit to fast-fail the process. With the possible
> exception of ThreadDeath (which may or may not be an unrecoverable system
> state depending on the thread), I think this makes sense, but I would like
> to hear from others if they have an opinion. I think it's better to kill
> the process and let your monitoring services detect process death (and thus
> restart) than possibly linger unresponsive for a while, are there scenarios
> that we're missing where this error can occur and you wouldn't want the
> process killed?
>
> Thanks for your feedback,
>
> Camille
>