You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Andrey Gura <ag...@apache.org> on 2018/03/12 09:32:02 UTC

IEP-14: Ignite failures handling (Discussion)

Igniters!

We are working on proposal described in IEP-14 Ignite failures
handling [1] and it's time to discuss it with community (although it
was necessary to do this before).

Most important question: what should be default behaviour in case of
failure? There are 4 actions:

1. Restart JVM process (it's possible only if process was started from
ignite.(sh|bat) script)
2. Terminate JVM;
3. Stop node (if there is only one node in process then process will
be also terminated);
4. No operation.

I believe that node should be stopped by default. But there is chance
that node will not stopped correctly.

May be we should terminate JVM process by default. But it will kill
all nodes in the JVM process. It's especially bad behaviour in case
when nodes belong different Ignite clusters (real use case).

May be we should restart JVM process default. This approach has the
same problems as the previous one. And additionally it could lead to
continues restarts and, therefore, continues exchanges and
rebalancing.

Difficult choice. Could you please share your thoughts.

[1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Tue, Mar 13, 2018 at 11:17 PM, Nick Pordash <ni...@gmail.com>
wrote:

> I can tell you as a user that if any library I was using in my application
> called System.exit without my consent would result in a lot of frustration.
>
> If ignite enters an unrecoverable state then I think that is something that
> should be observable locally, similar to node segmentation and then the
> application can decide the best course of action.
>

Nick, you would be a lot more frustrated if Ignite was frozen and every
call to Ignite would freeze the application threads as well. Again, if you
prefer to keep the process around, even if Ignite freezes, then you can
always configure this behavior, but I still believe that the default should
be to kill the process.

Ignite is a horizontally scalable system, so killing of one node should not
be a significant event and should not disrupt the cluster. However, a
freeze of one node is a significant event and can bring the whole cluster
to a halt.

D.

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Nick Pordash <ni...@gmail.com>.

I can tell you as a user that if any library I was using in my application
called System.exit without my consent would result in a lot of frustration.

If ignite enters an unrecoverable state then I think that is something that
should be observable locally, similar to node segmentation and then the
application can decide the best course of action.

Of course, if ignite was started as a standalone process do what you think
is best, but don't think you can kill the process without backlash from
users if it's running in embedded mode.

- Nick

On Tue, Mar 13, 2018, 5:12 PM Valentin Kulichenko <
valentin.kulichenko@gmail.com> wrote:

> Ivan,
>
> If grid hangs, graceful shutdown would most likely hang as well. Almost
> never you can recover from a bad state using graceful procedures.
>
> I agree that we should not create two defaults, especially in this case.
> It's not even strictly defined what is embedded node in Ignite. For
> example, if I start it using a custom main class and/or custom script
> instead of ignite.sh, would it be embedded or standalone node?
>
> -Val
>
> On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <iv...@gmail.com> wrote:
>
> > One more note: "kill if standalone, stop if embedded" differs from what
> > you are suggesting "try graceful, then kill process regardless" only in
> > case when graceful shutdown hangs.
> > Do we have understanding, how often does graceful shutdown hang?
> > Obviously, *grid hang* is often case, but it shouldn't be messed with
> > *graceful shutdown hang*. From my experience, if something went wrong,
> > users just prefer to do kill -9  because it's much more reliable and
> easy.
> > Probably, in most of cases when kill -9 worked, graceful stop would have
> > worked as well - we just don't have such statistics.
> > It may be bad example, but: in our CI tests we intentionally break grid
> in
> > many harsh ways and perform a graceful stop after the test execution, and
> > it doesn't hang - otherwise we'd see many "Execution timeout" test suite
> > hangs.
> >
> > Best Regards,
> > Ivan Rakov
> >
> >
> > On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
> >
> >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <iv...@gmail.com>
> >> wrote:
> >>
> >> I just would like to add my +1 for "kill if standalone, stop if
> embedded"
> >>> default option. My arguments:
> >>>
> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
> >>> Unfortunately, it's true that Ignite can hang during stop procedure.
> >>> However, most of failures described under IEP-14 (storage IO
> exceptions,
> >>> death of critical system worker thread, etc) normally shouldn't turn
> node
> >>> into "impossible to stop" state. Turning into that state is a bug
> >>> itself. I
> >>> guess that we shouldn't choose system behavior on the basis of known
> >>> bugs.
> >>>
> >>
> >> The whole discussion is about protecting against force-major issues,
> >> including Ignite bugs. You are assuming that a user application will
> >> somehow continue to function if an Ignite node is stopped. In most cases
> >> it
> >> will just freeze itself and cause the rest of the application to hang.
> >>
> >> Again, "kill+stop" is the most deterministic and the safest default
> >> behavior. Try a graceful shutdown (which will make restart easier), and
> >> then kill the process regardless.
> >>
> >> Note that we are arguing about the default behavior. If a user does not
> >> like this default, then this user can change it to another behavior.
> >>
> >>
> >> 2) User might want to handle Ignite node crash before shutting down the
> >>> whole JVM - raise alert, close external resources, etc
> >>>
> >>> Very unlikely, but if a user is this advanced, then this user can
> change
> >> the default behavior. Most users will not even know how to configure
> such
> >> custom shutdown behavior and would prefer an automatic kill.
> >>
> >> 3) IEP-14 document has important notes: "More than one Ignite node could
> >> be
> >>
> >>> started in one JVM process" and "Different nodes in one JVM process
> could
> >>> belong to different clusters". This is possible only in embedded mode.
> I
> >>> think, we shouldn't shock user by sudden JVM halt (possibly, along with
> >>> another healthy nodes) if there's a chance of successful node stop.
> >>>
> >>> Has anyone actually seen a real example of that? I have not. This
> >> scenario
> >> is extremely unlikely and should not define the default behavior. Again,
> >> if
> >> a user is so advanced to come up with such a sophisticated deployment,
> >> then
> >> the same user should be able to set different default behaviors for
> >> different clusters.
> >>
> >>
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Thanks Andrey! I have added a few comments to the IEP-14 page.

D.

On Fri, Mar 16, 2018 at 6:44 AM, Andrey Gura <ag...@apache.org> wrote:

> Hi!
>
> Thank you all for your opinions and ideas!
>
> While reading the thread I made two important conclusions:
>
> 1. Proposed API should be changed because possible actions enumeration
> is bad idea. More clean and simple design should allow user provide
> failure handler implementation with custom logic of failure handling
> if needed.
>
> 2. Several failure handler implementations should be provided out-of
> box in order to provide simple way of changing default behaviour
> through configuration. The following implementations should be
> provided:
>
>      - NoOpFailureHandler - It's useful for tests and debugging.
>      - RestartProcessFailureHandler - Specific implementation that
> could be used only with ignite.(sh|bat).
>      - StopNodeFailureHandler - This implementation will stop Ignite
> node in case of critical error.
>      - StopNodeOrHaltFailureHandler(boolean tryStop, long timeout) -
> Default failure handler will try stop node if tryStop value is true.
> If node can't be stopped or tryStop value is false then JVM process
> will be terminated forcibly (Runtime.halt()). Default value for
> tryStop parameter is false. Of course we should limit time of node
> shutdown in order to prevent hangs.
>
> As for the default behavior, I agree with those who believe that most
> suitable default option is process termination (although I had a
> different opinion before) and most strong argument for this choice is
> impossibility of reasoning about system state in case of critical
> error.
> Also I believe that we can't choose solution that will be suitable for
> any community member and the best that we can do is provide simple way
> of changing this behavior.
>
> So, I think, default behavior discussion should be finished. I'll
> update IEP-14 [1] accordingly to my conclusions above. If you have any
> ideas or thoughts about this conclusions, please feel free to share.
>
> Thanks!
>
> [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 14+Ignite+failures+handling
>
> On Fri, Mar 16, 2018 at 1:07 AM, Dmitriy Setrakyan
> <ds...@apache.org> wrote:
> > On Thu, Mar 15, 2018 at 5:21 AM, Dmitry Pavlov <dp...@gmail.com>
> > wrote:
> >
> >> Hi Dmitriy,
> >>
> >> It seems, here everyone agrees that killing the process will give a more
> >> guaranteed result. The question is that the majority in the community
> does
> >> not consider this to be acceptable in case Ignite as started as embedded
> >> lib (e.g. from Java, using Ignition.start())
> >>
> >> What can help to accept the community's opinion? Let's remember Apache
> >> principle: "community first".
> >>
> >
> > I am still confused about the problem the majority of the community is
> > trying to solve. If our priority is to keep the cluster in frozen state,
> > then what is the reason for this task altogether?
> >
> > The priority should be to keep the cluster operational, not frozen. The
> > only solution here is "kill" or "stop+kill". If the community does not
> > accept this option as a default, then I propose to drop this task
> > altogether, because we do not have to do anything to keep the cluster
> > frozen.
> >
> >
> >> If release 2.5 will show us it was inpractical, we will change default
> to
> >> kill even for library. What do you think?
> >>
> >
> > See above. I do not see a reason to continue with this task if the end
> > result is identical to what we have today.
> >
> > I want to give the community another chance to speak up and voice their
> > opinions again, having fully understood the context and the problem being
> > solved here.
> >
> > D.
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Mon, Mar 19, 2018 at 2:24 PM, Yakov Zhdanov <yz...@apache.org> wrote:

> Andrey Gura,
>
> Why should we have any FailureHandler abstraction? We already have it -
> this is EventListener. In my view it is better (and cleaner design) to add
> events (similar to, for
> example, org.apache.ignite.events.EventType#EVT_NODE_SEGMENTED) like
> EVT_IGNITE_OOME, EVT_SYS_WORKER_FAILED and fire events accordingly to the
> situation + execute configured system logic. We have exactly same way with
> segmentation. We have policy which defines how system reacts and also allow
> user to add event listeners.
>

Yakov, how would it be possible to fire the events if Ignite is not in
operational state? For example, what can a user do if the Java application
ran out of memory?

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Yakov Zhdanov <yz...@apache.org>.

Andrey Gura,

Why should we have any FailureHandler abstraction? We already have it -
this is EventListener. In my view it is better (and cleaner design) to add
events (similar to, for
example, org.apache.ignite.events.EventType#EVT_NODE_SEGMENTED) like
EVT_IGNITE_OOME, EVT_SYS_WORKER_FAILED and fire events accordingly to the
situation + execute configured system logic. We have exactly same way with
segmentation. We have policy which defines how system reacts and also allow
user to add event listeners.

For better understanding please take a look
at org.apache.ignite.plugin.segmentation.SegmentationPolicy
and org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.DiscoveryWorker#onSegmentation.
Discovery manager records the event (allowing user to get notification on
it) and executes internal logic in case segmentation policy is not NOOP.

Thanks!

--Yakov

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Andrey Gura <ag...@apache.org>.

Hi!

Thank you all for your opinions and ideas!

While reading the thread I made two important conclusions:

1. Proposed API should be changed because possible actions enumeration
is bad idea. More clean and simple design should allow user provide
failure handler implementation with custom logic of failure handling
if needed.

2. Several failure handler implementations should be provided out-of
box in order to provide simple way of changing default behaviour
through configuration. The following implementations should be
provided:

     - NoOpFailureHandler - It's useful for tests and debugging.
     - RestartProcessFailureHandler - Specific implementation that
could be used only with ignite.(sh|bat).
     - StopNodeFailureHandler - This implementation will stop Ignite
node in case of critical error.
     - StopNodeOrHaltFailureHandler(boolean tryStop, long timeout) -
Default failure handler will try stop node if tryStop value is true.
If node can't be stopped or tryStop value is false then JVM process
will be terminated forcibly (Runtime.halt()). Default value for
tryStop parameter is false. Of course we should limit time of node
shutdown in order to prevent hangs.

As for the default behavior, I agree with those who believe that most
suitable default option is process termination (although I had a
different opinion before) and most strong argument for this choice is
impossibility of reasoning about system state in case of critical
error.
Also I believe that we can't choose solution that will be suitable for
any community member and the best that we can do is provide simple way
of changing this behavior.

So, I think, default behavior discussion should be finished. I'll
update IEP-14 [1] accordingly to my conclusions above. If you have any
ideas or thoughts about this conclusions, please feel free to share.

Thanks!

[1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling

On Fri, Mar 16, 2018 at 1:07 AM, Dmitriy Setrakyan
<ds...@apache.org> wrote:
> On Thu, Mar 15, 2018 at 5:21 AM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
>
>> Hi Dmitriy,
>>
>> It seems, here everyone agrees that killing the process will give a more
>> guaranteed result. The question is that the majority in the community does
>> not consider this to be acceptable in case Ignite as started as embedded
>> lib (e.g. from Java, using Ignition.start())
>>
>> What can help to accept the community's opinion? Let's remember Apache
>> principle: "community first".
>>
>
> I am still confused about the problem the majority of the community is
> trying to solve. If our priority is to keep the cluster in frozen state,
> then what is the reason for this task altogether?
>
> The priority should be to keep the cluster operational, not frozen. The
> only solution here is "kill" or "stop+kill". If the community does not
> accept this option as a default, then I propose to drop this task
> altogether, because we do not have to do anything to keep the cluster
> frozen.
>
>
>> If release 2.5 will show us it was inpractical, we will change default to
>> kill even for library. What do you think?
>>
>
> See above. I do not see a reason to continue with this task if the end
> result is identical to what we have today.
>
> I want to give the community another chance to speak up and voice their
> opinions again, having fully understood the context and the problem being
> solved here.
>
> D.

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Thu, Mar 15, 2018 at 5:21 AM, Dmitry Pavlov <dp...@gmail.com>
wrote:

> Hi Dmitriy,
>
> It seems, here everyone agrees that killing the process will give a more
> guaranteed result. The question is that the majority in the community does
> not consider this to be acceptable in case Ignite as started as embedded
> lib (e.g. from Java, using Ignition.start())
>
> What can help to accept the community's opinion? Let's remember Apache
> principle: "community first".
>

I am still confused about the problem the majority of the community is
trying to solve. If our priority is to keep the cluster in frozen state,
then what is the reason for this task altogether?

The priority should be to keep the cluster operational, not frozen. The
only solution here is "kill" or "stop+kill". If the community does not
accept this option as a default, then I propose to drop this task
altogether, because we do not have to do anything to keep the cluster
frozen.

> If release 2.5 will show us it was inpractical, we will change default to
> kill even for library. What do you think?
>

See above. I do not see a reason to continue with this task if the end
result is identical to what we have today.

I want to give the community another chance to speak up and voice their
opinions again, having fully understood the context and the problem being
solved here.

D.

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitry Pavlov <dp...@gmail.com>.

Hi Dmitriy,

It seems, here everyone agrees that killing the process will give a more
guaranteed result. The question is that the majority in the community does
not consider this to be acceptable in case Ignite as started as embedded
lib (e.g. from Java, using Ignition.start())

What can help to accept the community's opinion? Let's remember Apache
principle: "community first".

If release 2.5 will show us it was inpractical, we will change default to
kill even for library. What do you think?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 5:48, Dmitriy Setrakyan <ds...@apache.org>:

> On Wed, Mar 14, 2018 at 7:12 PM, Andrey Kornev <an...@hotmail.com>
> wrote:
>
> > I'm not disagreeing with you, Dmitriy.
> >
> > What I'm trying to say is that if we assume that a serious enough bug or
> > some environmental issue prevents Ignite node from functioning correctly,
> > then it's only logical to assume that Ignite process is completely hosed
> > (for example, due to a very very long STW pause) and can't make any
> > progress at all. In a situation like this the application can't reason
> > about the process state, and the process itself may not be able to even
> > kill itself. The only reliable way to handle cases like that is to have
> an
> > external observer (a health monitoring tool) that is not itself affected
> by
> > the bug or the env issue and can either make a decision by itself or
> send a
> > notification to the SRE team.
> >
>
> Agree about the external observers, but that is something a user should do
> outside of Ignite.
>
>
> > In my previous post I only suggest to go easy on the "cleverness" of the
> > self-monitoring implementation as IMHO it won't be used much in
> production
> > environment. I think Ignite as it is already provides sufficient means
> > of monitoring its health (they may or may not be robust enough, which is
> a
> > different issue).
> >
>
> The approach I am suggesting is pretty simple - "kill" the process in case
> of a critical error. The only intelligence I would like to add is to
> attempt shutting down the Ignite node gracefully before the "kill" is
> executed. If a node is shutdown gracefully, then the restart procedure will
> be faster, so it is worthwhile to try.
>
> Some of the critical errors include running out of disk, memory, loosing
> Ignite system threads, etc... These errors are truly unrecoverable from the
> application stand point and should mostly be handled with a process restart
> anyway.
>
> D.
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Wed, Mar 14, 2018 at 7:12 PM, Andrey Kornev <an...@hotmail.com>
wrote:

> I'm not disagreeing with you, Dmitriy.
>
> What I'm trying to say is that if we assume that a serious enough bug or
> some environmental issue prevents Ignite node from functioning correctly,
> then it's only logical to assume that Ignite process is completely hosed
> (for example, due to a very very long STW pause) and can't make any
> progress at all. In a situation like this the application can't reason
> about the process state, and the process itself may not be able to even
> kill itself. The only reliable way to handle cases like that is to have an
> external observer (a health monitoring tool) that is not itself affected by
> the bug or the env issue and can either make a decision by itself or send a
> notification to the SRE team.
>

Agree about the external observers, but that is something a user should do
outside of Ignite.

> In my previous post I only suggest to go easy on the "cleverness" of the
> self-monitoring implementation as IMHO it won't be used much in production
> environment. I think Ignite as it is already provides sufficient means
> of monitoring its health (they may or may not be robust enough, which is a
> different issue).
>

The approach I am suggesting is pretty simple - "kill" the process in case
of a critical error. The only intelligence I would like to add is to
attempt shutting down the Ignite node gracefully before the "kill" is
executed. If a node is shutdown gracefully, then the restart procedure will
be faster, so it is worthwhile to try.

Some of the critical errors include running out of disk, memory, loosing
Ignite system threads, etc... These errors are truly unrecoverable from the
application stand point and should mostly be handled with a process restart
anyway.

D.

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Andrey Kornev <an...@hotmail.com>.

I'm not disagreeing with you, Dmitriy.

What I'm trying to say is that if we assume that a serious enough bug or some environmental issue prevents Ignite node from functioning correctly, then it's only logical to assume that Ignite process is completely hosed (for example, due to a very very long STW pause) and can't make any progress at all. In a situation like this the application can't reason about the process state, and the process itself may not be able to even kill itself. The only reliable way to handle cases like that is to have an external observer (a health monitoring tool) that is not itself affected by the bug or the env issue and can either make a decision by itself or send a notification to the SRE team.

In my previous post I only suggest to go easy on the "cleverness" of the self-monitoring implementation as IMHO it won't be used much in production environment. I think Ignite as it is already provides sufficient means of monitoring its health (they may or may not be robust enough, which is a different issue).

Regards
Andrey

________________________________
From: Dmitriy Setrakyan <ds...@apache.org>
Sent: Wednesday, March 14, 2018 6:22 PM
To: dev@ignite.apache.org
Subject: Re: IEP-14: Ignite failures handling (Discussion)

On Wed, Mar 14, 2018 at 3:36 PM, Andrey Kornev <an...@hotmail.com>
wrote:

> If I were the one responsible for running Ignite-based applications (be it
> embedded or standalone Ignite) in my company's datacenter, I'd prefer the
> application nodes simply make their current state readily available to
> external tools (via JMX, health checks, etc.) and leave the decision of
> when to die and when to continue to run up to me. The last thing I need in
> production is a too clever an application that decides to kill itself based
> on its local (perhaps confused) state.
>
> Usually SRE teams build all sorts of technology-specific tools to monitor
> health of the applications and they like to be as much in control as
> possible when it comes to killing processes.
>
> I guess what I'm saying is this: keep things simple. Do not over engineer.
> In real production environments the companies will most likely have this
> feature disabled (I know I would) and instead rely on their own tooling for
> handling failures.
>
>
Andrey, our priority should be to keep the cluster operational. If a frozen
Ignite node is kept around, the whole cluster becomes un-operational. I bet
this is not what you would prefer in production either. However, if we kill
the process, then the cluster should continue to operate.

We are talking about a distributed system in which a failure of one node
should not matter. If we want to keep this promise to the users, then we
must kill the process if Ignite node freezes.

Also, keep in mind that we are talking about the "default" behavior. If you
are not happy with the "default" mode, then you will be able to configure
other behaviors, like keeping the frozen Ignite node around, if you like.

D.

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Wed, Mar 14, 2018 at 3:36 PM, Andrey Kornev <an...@hotmail.com>
wrote:

> If I were the one responsible for running Ignite-based applications (be it
> embedded or standalone Ignite) in my company's datacenter, I'd prefer the
> application nodes simply make their current state readily available to
> external tools (via JMX, health checks, etc.) and leave the decision of
> when to die and when to continue to run up to me. The last thing I need in
> production is a too clever an application that decides to kill itself based
> on its local (perhaps confused) state.
>
> Usually SRE teams build all sorts of technology-specific tools to monitor
> health of the applications and they like to be as much in control as
> possible when it comes to killing processes.
>
> I guess what I'm saying is this: keep things simple. Do not over engineer.
> In real production environments the companies will most likely have this
> feature disabled (I know I would) and instead rely on their own tooling for
> handling failures.
>
>
Andrey, our priority should be to keep the cluster operational. If a frozen
Ignite node is kept around, the whole cluster becomes un-operational. I bet
this is not what you would prefer in production either. However, if we kill
the process, then the cluster should continue to operate.

We are talking about a distributed system in which a failure of one node
should not matter. If we want to keep this promise to the users, then we
must kill the process if Ignite node freezes.

Also, keep in mind that we are talking about the "default" behavior. If you
are not happy with the "default" mode, then you will be able to configure
other behaviors, like keeping the frozen Ignite node around, if you like.

D.

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Andrey Kornev <an...@hotmail.com>.

If I were the one responsible for running Ignite-based applications (be it embedded or standalone Ignite) in my company's datacenter, I'd prefer the application nodes simply make their current state readily available to external tools (via JMX, health checks, etc.) and leave the decision of when to die and when to continue to run up to me. The last thing I need in production is a too clever an application that decides to kill itself based on its local (perhaps confused) state.

Usually SRE teams build all sorts of technology-specific tools to monitor health of the applications and they like to be as much in control as possible when it comes to killing processes.

I guess what I'm saying is this: keep things simple. Do not over engineer. In real production environments the companies will most likely have this feature disabled (I know I would) and instead rely on their own tooling for handling failures.

Regards
Andrey

________________________________
From: Vladimir Ozerov <vo...@gridgain.com>
Sent: Tuesday, March 13, 2018 10:43 PM
To: dev@ignite.apache.org
Subject: Re: IEP-14: Ignite failures handling (Discussion)

As far as shutdown, what we need to implement is “hard shutdown” mode. This
is when we first close all network sockets, then cancel all registered
futures. This would enough to unblock the cluster and local user threads.

ср, 14 марта 2018 г. в 8:40, Vladimir Ozerov <vo...@gridgain.com>:

> Valya,
>
> This is very easy to answer - if CommandLineStartup is used, then it is
> standalone node. In all other cases it is embedded.
>
> If node shutdown hangs - just let it continue hanging, so that application
> admins are able to decide on their own what to do next. Someone would want
> to get the stack trace, others would decide to restart outside of business
> hours (e.g. because Ignite is used only in part of their application),
> someone else would try to shutdown gracefully other components before
> stopping the process to minimize negative impact, etc.
>
> I am quite understand why are we guessing here how embedded Ignite is
> used. It could be used in any way and in any combination with other
> frameworks. Process stop by default is simply not an option.
>
> ср, 14 марта 2018 г. в 3:12, Valentin Kulichenko <
> valentin.kulichenko@gmail.com>:
>
>> Ivan,
>>
>> If grid hangs, graceful shutdown would most likely hang as well. Almost
>> never you can recover from a bad state using graceful procedures.
>>
>> I agree that we should not create two defaults, especially in this case.
>> It's not even strictly defined what is embedded node in Ignite. For
>> example, if I start it using a custom main class and/or custom script
>> instead of ignite.sh, would it be embedded or standalone node?
>>
>> -Val
>>
>> On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <iv...@gmail.com>
>> wrote:
>>
>> > One more note: "kill if standalone, stop if embedded" differs from what
>> > you are suggesting "try graceful, then kill process regardless" only in
>> > case when graceful shutdown hangs.
>> > Do we have understanding, how often does graceful shutdown hang?
>> > Obviously, *grid hang* is often case, but it shouldn't be messed with
>> > *graceful shutdown hang*. From my experience, if something went wrong,
>> > users just prefer to do kill -9  because it's much more reliable and
>> easy.
>> > Probably, in most of cases when kill -9 worked, graceful stop would have
>> > worked as well - we just don't have such statistics.
>> > It may be bad example, but: in our CI tests we intentionally break grid
>> in
>> > many harsh ways and perform a graceful stop after the test execution,
>> and
>> > it doesn't hang - otherwise we'd see many "Execution timeout" test suite
>> > hangs.
>> >
>> > Best Regards,
>> > Ivan Rakov
>> >
>> >
>> > On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
>> >
>> >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <iv...@gmail.com>
>> >> wrote:
>> >>
>> >> I just would like to add my +1 for "kill if standalone, stop if
>> embedded"
>> >>> default option. My arguments:
>> >>>
>> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
>> >>> Unfortunately, it's true that Ignite can hang during stop procedure.
>> >>> However, most of failures described under IEP-14 (storage IO
>> exceptions,
>> >>> death of critical system worker thread, etc) normally shouldn't turn
>> node
>> >>> into "impossible to stop" state. Turning into that state is a bug
>> >>> itself. I
>> >>> guess that we shouldn't choose system behavior on the basis of known
>> >>> bugs.
>> >>>
>> >>
>> >> The whole discussion is about protecting against force-major issues,
>> >> including Ignite bugs. You are assuming that a user application will
>> >> somehow continue to function if an Ignite node is stopped. In most
>> cases
>> >> it
>> >> will just freeze itself and cause the rest of the application to hang.
>> >>
>> >> Again, "kill+stop" is the most deterministic and the safest default
>> >> behavior. Try a graceful shutdown (which will make restart easier), and
>> >> then kill the process regardless.
>> >>
>> >> Note that we are arguing about the default behavior. If a user does not
>> >> like this default, then this user can change it to another behavior.
>> >>
>> >>
>> >> 2) User might want to handle Ignite node crash before shutting down the
>> >>> whole JVM - raise alert, close external resources, etc
>> >>>
>> >>> Very unlikely, but if a user is this advanced, then this user can
>> change
>> >> the default behavior. Most users will not even know how to configure
>> such
>> >> custom shutdown behavior and would prefer an automatic kill.
>> >>
>> >> 3) IEP-14 document has important notes: "More than one Ignite node
>> could
>> >> be
>> >>
>> >>> started in one JVM process" and "Different nodes in one JVM process
>> could
>> >>> belong to different clusters". This is possible only in embedded
>> mode. I
>> >>> think, we shouldn't shock user by sudden JVM halt (possibly, along
>> with
>> >>> another healthy nodes) if there's a chance of successful node stop.
>> >>>
>> >>> Has anyone actually seen a real example of that? I have not. This
>> >> scenario
>> >> is extremely unlikely and should not define the default behavior.
>> Again,
>> >> if
>> >> a user is so advanced to come up with such a sophisticated deployment,
>> >> then
>> >> the same user should be able to set different default behaviors for
>> >> different clusters.
>> >>
>> >>
>> >
>>
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Vladimir Ozerov <vo...@gridgain.com>.

As far as shutdown, what we need to implement is “hard shutdown” mode. This
is when we first close all network sockets, then cancel all registered
futures. This would enough to unblock the cluster and local user threads.

ср, 14 марта 2018 г. в 8:40, Vladimir Ozerov <vo...@gridgain.com>:

> Valya,
>
> This is very easy to answer - if CommandLineStartup is used, then it is
> standalone node. In all other cases it is embedded.
>
> If node shutdown hangs - just let it continue hanging, so that application
> admins are able to decide on their own what to do next. Someone would want
> to get the stack trace, others would decide to restart outside of business
> hours (e.g. because Ignite is used only in part of their application),
> someone else would try to shutdown gracefully other components before
> stopping the process to minimize negative impact, etc.
>
> I am quite understand why are we guessing here how embedded Ignite is
> used. It could be used in any way and in any combination with other
> frameworks. Process stop by default is simply not an option.
>
> ср, 14 марта 2018 г. в 3:12, Valentin Kulichenko <
> valentin.kulichenko@gmail.com>:
>
>> Ivan,
>>
>> If grid hangs, graceful shutdown would most likely hang as well. Almost
>> never you can recover from a bad state using graceful procedures.
>>
>> I agree that we should not create two defaults, especially in this case.
>> It's not even strictly defined what is embedded node in Ignite. For
>> example, if I start it using a custom main class and/or custom script
>> instead of ignite.sh, would it be embedded or standalone node?
>>
>> -Val
>>
>> On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <iv...@gmail.com>
>> wrote:
>>
>> > One more note: "kill if standalone, stop if embedded" differs from what
>> > you are suggesting "try graceful, then kill process regardless" only in
>> > case when graceful shutdown hangs.
>> > Do we have understanding, how often does graceful shutdown hang?
>> > Obviously, *grid hang* is often case, but it shouldn't be messed with
>> > *graceful shutdown hang*. From my experience, if something went wrong,
>> > users just prefer to do kill -9  because it's much more reliable and
>> easy.
>> > Probably, in most of cases when kill -9 worked, graceful stop would have
>> > worked as well - we just don't have such statistics.
>> > It may be bad example, but: in our CI tests we intentionally break grid
>> in
>> > many harsh ways and perform a graceful stop after the test execution,
>> and
>> > it doesn't hang - otherwise we'd see many "Execution timeout" test suite
>> > hangs.
>> >
>> > Best Regards,
>> > Ivan Rakov
>> >
>> >
>> > On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
>> >
>> >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <iv...@gmail.com>
>> >> wrote:
>> >>
>> >> I just would like to add my +1 for "kill if standalone, stop if
>> embedded"
>> >>> default option. My arguments:
>> >>>
>> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
>> >>> Unfortunately, it's true that Ignite can hang during stop procedure.
>> >>> However, most of failures described under IEP-14 (storage IO
>> exceptions,
>> >>> death of critical system worker thread, etc) normally shouldn't turn
>> node
>> >>> into "impossible to stop" state. Turning into that state is a bug
>> >>> itself. I
>> >>> guess that we shouldn't choose system behavior on the basis of known
>> >>> bugs.
>> >>>
>> >>
>> >> The whole discussion is about protecting against force-major issues,
>> >> including Ignite bugs. You are assuming that a user application will
>> >> somehow continue to function if an Ignite node is stopped. In most
>> cases
>> >> it
>> >> will just freeze itself and cause the rest of the application to hang.
>> >>
>> >> Again, "kill+stop" is the most deterministic and the safest default
>> >> behavior. Try a graceful shutdown (which will make restart easier), and
>> >> then kill the process regardless.
>> >>
>> >> Note that we are arguing about the default behavior. If a user does not
>> >> like this default, then this user can change it to another behavior.
>> >>
>> >>
>> >> 2) User might want to handle Ignite node crash before shutting down the
>> >>> whole JVM - raise alert, close external resources, etc
>> >>>
>> >>> Very unlikely, but if a user is this advanced, then this user can
>> change
>> >> the default behavior. Most users will not even know how to configure
>> such
>> >> custom shutdown behavior and would prefer an automatic kill.
>> >>
>> >> 3) IEP-14 document has important notes: "More than one Ignite node
>> could
>> >> be
>> >>
>> >>> started in one JVM process" and "Different nodes in one JVM process
>> could
>> >>> belong to different clusters". This is possible only in embedded
>> mode. I
>> >>> think, we shouldn't shock user by sudden JVM halt (possibly, along
>> with
>> >>> another healthy nodes) if there's a chance of successful node stop.
>> >>>
>> >>> Has anyone actually seen a real example of that? I have not. This
>> >> scenario
>> >> is extremely unlikely and should not define the default behavior.
>> Again,
>> >> if
>> >> a user is so advanced to come up with such a sophisticated deployment,
>> >> then
>> >> the same user should be able to set different default behaviors for
>> >> different clusters.
>> >>
>> >>
>> >
>>
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Vladimir Ozerov <vo...@gridgain.com>.

Valya,

This is very easy to answer - if CommandLineStartup is used, then it is
standalone node. In all other cases it is embedded.

If node shutdown hangs - just let it continue hanging, so that application
admins are able to decide on their own what to do next. Someone would want
to get the stack trace, others would decide to restart outside of business
hours (e.g. because Ignite is used only in part of their application),
someone else would try to shutdown gracefully other components before
stopping the process to minimize negative impact, etc.

I am quite understand why are we guessing here how embedded Ignite is used.
It could be used in any way and in any combination with other frameworks.
Process stop by default is simply not an option.

ср, 14 марта 2018 г. в 3:12, Valentin Kulichenko <
valentin.kulichenko@gmail.com>:

> Ivan,
>
> If grid hangs, graceful shutdown would most likely hang as well. Almost
> never you can recover from a bad state using graceful procedures.
>
> I agree that we should not create two defaults, especially in this case.
> It's not even strictly defined what is embedded node in Ignite. For
> example, if I start it using a custom main class and/or custom script
> instead of ignite.sh, would it be embedded or standalone node?
>
> -Val
>
> On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <iv...@gmail.com> wrote:
>
> > One more note: "kill if standalone, stop if embedded" differs from what
> > you are suggesting "try graceful, then kill process regardless" only in
> > case when graceful shutdown hangs.
> > Do we have understanding, how often does graceful shutdown hang?
> > Obviously, *grid hang* is often case, but it shouldn't be messed with
> > *graceful shutdown hang*. From my experience, if something went wrong,
> > users just prefer to do kill -9  because it's much more reliable and
> easy.
> > Probably, in most of cases when kill -9 worked, graceful stop would have
> > worked as well - we just don't have such statistics.
> > It may be bad example, but: in our CI tests we intentionally break grid
> in
> > many harsh ways and perform a graceful stop after the test execution, and
> > it doesn't hang - otherwise we'd see many "Execution timeout" test suite
> > hangs.
> >
> > Best Regards,
> > Ivan Rakov
> >
> >
> > On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
> >
> >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <iv...@gmail.com>
> >> wrote:
> >>
> >> I just would like to add my +1 for "kill if standalone, stop if
> embedded"
> >>> default option. My arguments:
> >>>
> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
> >>> Unfortunately, it's true that Ignite can hang during stop procedure.
> >>> However, most of failures described under IEP-14 (storage IO
> exceptions,
> >>> death of critical system worker thread, etc) normally shouldn't turn
> node
> >>> into "impossible to stop" state. Turning into that state is a bug
> >>> itself. I
> >>> guess that we shouldn't choose system behavior on the basis of known
> >>> bugs.
> >>>
> >>
> >> The whole discussion is about protecting against force-major issues,
> >> including Ignite bugs. You are assuming that a user application will
> >> somehow continue to function if an Ignite node is stopped. In most cases
> >> it
> >> will just freeze itself and cause the rest of the application to hang.
> >>
> >> Again, "kill+stop" is the most deterministic and the safest default
> >> behavior. Try a graceful shutdown (which will make restart easier), and
> >> then kill the process regardless.
> >>
> >> Note that we are arguing about the default behavior. If a user does not
> >> like this default, then this user can change it to another behavior.
> >>
> >>
> >> 2) User might want to handle Ignite node crash before shutting down the
> >>> whole JVM - raise alert, close external resources, etc
> >>>
> >>> Very unlikely, but if a user is this advanced, then this user can
> change
> >> the default behavior. Most users will not even know how to configure
> such
> >> custom shutdown behavior and would prefer an automatic kill.
> >>
> >> 3) IEP-14 document has important notes: "More than one Ignite node could
> >> be
> >>
> >>> started in one JVM process" and "Different nodes in one JVM process
> could
> >>> belong to different clusters". This is possible only in embedded mode.
> I
> >>> think, we shouldn't shock user by sudden JVM halt (possibly, along with
> >>> another healthy nodes) if there's a chance of successful node stop.
> >>>
> >>> Has anyone actually seen a real example of that? I have not. This
> >> scenario
> >> is extremely unlikely and should not define the default behavior. Again,
> >> if
> >> a user is so advanced to come up with such a sophisticated deployment,
> >> then
> >> the same user should be able to set different default behaviors for
> >> different clusters.
> >>
> >>
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Valentin Kulichenko <va...@gmail.com>.

Ivan,

If grid hangs, graceful shutdown would most likely hang as well. Almost
never you can recover from a bad state using graceful procedures.

I agree that we should not create two defaults, especially in this case.
It's not even strictly defined what is embedded node in Ignite. For
example, if I start it using a custom main class and/or custom script
instead of ignite.sh, would it be embedded or standalone node?

-Val

On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <iv...@gmail.com> wrote:

> One more note: "kill if standalone, stop if embedded" differs from what
> you are suggesting "try graceful, then kill process regardless" only in
> case when graceful shutdown hangs.
> Do we have understanding, how often does graceful shutdown hang?
> Obviously, *grid hang* is often case, but it shouldn't be messed with
> *graceful shutdown hang*. From my experience, if something went wrong,
> users just prefer to do kill -9  because it's much more reliable and easy.
> Probably, in most of cases when kill -9 worked, graceful stop would have
> worked as well - we just don't have such statistics.
> It may be bad example, but: in our CI tests we intentionally break grid in
> many harsh ways and perform a graceful stop after the test execution, and
> it doesn't hang - otherwise we'd see many "Execution timeout" test suite
> hangs.
>
> Best Regards,
> Ivan Rakov
>
>
> On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
>
>> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <iv...@gmail.com>
>> wrote:
>>
>> I just would like to add my +1 for "kill if standalone, stop if embedded"
>>> default option. My arguments:
>>>
>>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
>>> Unfortunately, it's true that Ignite can hang during stop procedure.
>>> However, most of failures described under IEP-14 (storage IO exceptions,
>>> death of critical system worker thread, etc) normally shouldn't turn node
>>> into "impossible to stop" state. Turning into that state is a bug
>>> itself. I
>>> guess that we shouldn't choose system behavior on the basis of known
>>> bugs.
>>>
>>
>> The whole discussion is about protecting against force-major issues,
>> including Ignite bugs. You are assuming that a user application will
>> somehow continue to function if an Ignite node is stopped. In most cases
>> it
>> will just freeze itself and cause the rest of the application to hang.
>>
>> Again, "kill+stop" is the most deterministic and the safest default
>> behavior. Try a graceful shutdown (which will make restart easier), and
>> then kill the process regardless.
>>
>> Note that we are arguing about the default behavior. If a user does not
>> like this default, then this user can change it to another behavior.
>>
>>
>> 2) User might want to handle Ignite node crash before shutting down the
>>> whole JVM - raise alert, close external resources, etc
>>>
>>> Very unlikely, but if a user is this advanced, then this user can change
>> the default behavior. Most users will not even know how to configure such
>> custom shutdown behavior and would prefer an automatic kill.
>>
>> 3) IEP-14 document has important notes: "More than one Ignite node could
>> be
>>
>>> started in one JVM process" and "Different nodes in one JVM process could
>>> belong to different clusters". This is possible only in embedded mode. I
>>> think, we shouldn't shock user by sudden JVM halt (possibly, along with
>>> another healthy nodes) if there's a chance of successful node stop.
>>>
>>> Has anyone actually seen a real example of that? I have not. This
>> scenario
>> is extremely unlikely and should not define the default behavior. Again,
>> if
>> a user is so advanced to come up with such a sophisticated deployment,
>> then
>> the same user should be able to set different default behaviors for
>> different clusters.
>>
>>
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Ivan Rakov <iv...@gmail.com>.

One more note: "kill if standalone, stop if embedded" differs from what 
you are suggesting "try graceful, then kill process regardless" only in 
case when graceful shutdown hangs.
Do we have understanding, how often does graceful shutdown hang? 
Obviously, *grid hang* is often case, but it shouldn't be messed with 
*graceful shutdown hang*. From my experience, if something went wrong, 
users just prefer to do kill -9  because it's much more reliable and 
easy. Probably, in most of cases when kill -9 worked, graceful stop 
would have worked as well - we just don't have such statistics.
It may be bad example, but: in our CI tests we intentionally break grid 
in many harsh ways and perform a graceful stop after the test execution, 
and it doesn't hang - otherwise we'd see many "Execution timeout" test 
suite hangs.

Best Regards,
Ivan Rakov

On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <iv...@gmail.com> wrote:
>
>> I just would like to add my +1 for "kill if standalone, stop if embedded"
>> default option. My arguments:
>>
>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
>> Unfortunately, it's true that Ignite can hang during stop procedure.
>> However, most of failures described under IEP-14 (storage IO exceptions,
>> death of critical system worker thread, etc) normally shouldn't turn node
>> into "impossible to stop" state. Turning into that state is a bug itself. I
>> guess that we shouldn't choose system behavior on the basis of known bugs.
>
> The whole discussion is about protecting against force-major issues,
> including Ignite bugs. You are assuming that a user application will
> somehow continue to function if an Ignite node is stopped. In most cases it
> will just freeze itself and cause the rest of the application to hang.
>
> Again, "kill+stop" is the most deterministic and the safest default
> behavior. Try a graceful shutdown (which will make restart easier), and
> then kill the process regardless.
>
> Note that we are arguing about the default behavior. If a user does not
> like this default, then this user can change it to another behavior.
>
>
>> 2) User might want to handle Ignite node crash before shutting down the
>> whole JVM - raise alert, close external resources, etc
>>
> Very unlikely, but if a user is this advanced, then this user can change
> the default behavior. Most users will not even know how to configure such
> custom shutdown behavior and would prefer an automatic kill.
>
> 3) IEP-14 document has important notes: "More than one Ignite node could be
>> started in one JVM process" and "Different nodes in one JVM process could
>> belong to different clusters". This is possible only in embedded mode. I
>> think, we shouldn't shock user by sudden JVM halt (possibly, along with
>> another healthy nodes) if there's a chance of successful node stop.
>>
> Has anyone actually seen a real example of that? I have not. This scenario
> is extremely unlikely and should not define the default behavior. Again, if
> a user is so advanced to come up with such a sophisticated deployment, then
> the same user should be able to set different default behaviors for
> different clusters.
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <iv...@gmail.com> wrote:

> I just would like to add my +1 for "kill if standalone, stop if embedded"
> default option. My arguments:
>
> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
> Unfortunately, it's true that Ignite can hang during stop procedure.
> However, most of failures described under IEP-14 (storage IO exceptions,
> death of critical system worker thread, etc) normally shouldn't turn node
> into "impossible to stop" state. Turning into that state is a bug itself. I
> guess that we shouldn't choose system behavior on the basis of known bugs.

The whole discussion is about protecting against force-major issues,
including Ignite bugs. You are assuming that a user application will
somehow continue to function if an Ignite node is stopped. In most cases it
will just freeze itself and cause the rest of the application to hang.

Again, "kill+stop" is the most deterministic and the safest default
behavior. Try a graceful shutdown (which will make restart easier), and
then kill the process regardless.

Note that we are arguing about the default behavior. If a user does not
like this default, then this user can change it to another behavior.

> 2) User might want to handle Ignite node crash before shutting down the
> whole JVM - raise alert, close external resources, etc
>

Very unlikely, but if a user is this advanced, then this user can change
the default behavior. Most users will not even know how to configure such
custom shutdown behavior and would prefer an automatic kill.

3) IEP-14 document has important notes: "More than one Ignite node could be
> started in one JVM process" and "Different nodes in one JVM process could
> belong to different clusters". This is possible only in embedded mode. I
> think, we shouldn't shock user by sudden JVM halt (possibly, along with
> another healthy nodes) if there's a chance of successful node stop.
>

Has anyone actually seen a real example of that? I have not. This scenario
is extremely unlikely and should not define the default behavior. Again, if
a user is so advanced to come up with such a sophisticated deployment, then
the same user should be able to set different default behaviors for
different clusters.

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Ivan Rakov <iv...@gmail.com>.

I just would like to add my +1 for "kill if standalone, stop if 
embedded" default option. My arguments:

1) Regarding "If Ignite hangs - it will likely be impossible to stop":
Unfortunately, it's true that Ignite can hang during stop procedure. 
However, most of failures described under IEP-14 (storage IO exceptions, 
death of critical system worker thread, etc) normally shouldn't turn 
node into "impossible to stop" state. Turning into that state is a bug 
itself. I guess that we shouldn't choose system behavior on the basis of 
known bugs.

2) User might want to handle Ignite node crash before shutting down the 
whole JVM - raise alert, close external resources, etc

3) IEP-14 document has important notes: "More than one Ignite node could 
be started in one JVM process" and "Different nodes in one JVM process 
could belong to different clusters". This is possible only in embedded 
mode. I think, we shouldn't shock user by sudden JVM halt (possibly, 
along with another healthy nodes) if there's a chance of successful node 
stop.

Best Regards,
Ivan Rakov

On 14.03.2018 1:47, Dmitriy Setrakyan wrote:
> Guys, I do not think there is an understanding here. If Ignite hangs - it
> will likely be impossible to stop. So if you are suggesting "stop if
> embedded", you might as well suggest "do nothing if embedded".
>
> I have seen many Ignite deployments, embedded or not, large and small, and
> in all those deployments if Ignite went into a frozen state, killing it was
> the best option. Moreover, it provided the most predictable behavior. I am
> not guessing here, but it seems to me that the rest of the community is
> guessing.
>
> Killing a frozen Ignite node should be a default behavior in all cases,
> embedded or not. Stopping a frozen Ignite node should be a configurable
> option, so a user has an ability to turn off auto-kill behavior. We should
> also have a 3rd option, "stop+kill", so if stopping fails, then the process
> is automatically killed (this is also a good default option).
>
> Personally, I am OK if the default behavior is "kill" or "stop+kill", but
> it should be the same default in all cases. We should stop the practice of
> creating different default behaviors for the same problem. It is confusing
> and hard to document.
>
> D.
>
> On Tue, Mar 13, 2018 at 2:19 PM, Denis Magda <dm...@apache.org> wrote:
>
>> +1 for "kill if standalone, stop if embedded" behavior. If the practice
>> shows that the node should be killed regardless of the mode, then it will
>> be an easy change. Now we are just guessing, and common sense suggests
>> going for "kill if standalone, stop if embedded" until we get feedback.
>>
>> -
>> Denis
>>
>> On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov <dp...@gmail.com>
>> wrote:
>>
>>> You are suggesting to kill the process, which was not started by Ignite,
>>> are not you?
>>>
>>> More consistently is to stop only those processes that are generated by
>> the
>>> control of Ignite, e.g. from ignite.sh - here it is ok for me.
>>>
>>> If we relese 'kill by default' as part of 2.5, we will end up with 2.6
>>> emergency release to change it back, if one user will face with such
>>> unexpected behaviour.
>>>
>>> вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <ds...@apache.org>:
>>>
>>>> Dmitriy,
>>>>
>>>> I think everyone is suggesting that stopping the node will likely be
>>>> impossible if Ignite is frozen. Moreover, it is very likely that all
>>> other
>>>> apps are frozen too.
>>>>
>>>> My comments are below...
>>>>
>>>> On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <dp...@gmail.com>
>>>> wrote:
>>>>
>>>>> Please consider that user application may use Ignite as optional
>> cache
>>>> for
>>>>> some low-priority feature, but main logic is well functioning without
>>>>> Ingnite. I can say, as Ignite user in the past, that it is quite real
>>>> case.
>>>> I have been a part of this project for a while, but I have never seen
>>>> Ignite used as an optional cache. Usually, Ignite is a mandatory part
>> of
>>>> the application, not optional.
>>>>
>>>>
>>>>> Second real case is using several war files within one application
>>>> server,
>>>>> running different logic. Some apps use Ignite, some applications -
>> not.
>>>>> Killing application server in this case is not an option too.
>>>>>
>>>> Not very likely, but possible. This is not a common use case. Most
>>> commonly
>>>> Ignite would be serving all WAR files with a common data layer.
>>>>
>>>>
>>>>> So default should be stopping all node threads, but not kill the
>>> process.
>>>>> If user is aware process may be killed, it may setup option.
>>>>>
>>>> No, the default should be to kill the process. If user does not like
>> it,
>>>> then it should be possible to change it to stop the node first.
>>>>
>>>>
>>>>> вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <
>> dsetrakyan@apache.org
>>>> :
>>>>>> On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <
>>> dpavlov.spb@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Dmitriy, alternative is "kill if standalone, stop if embedded"
>>>>>>
>>>>>>> User will be still able to set something like
>>>>>>> -DNODE_CRASH_ACTION="kill"
>>>>>>> if ignite.sh is not used and user accepts alternative that whole
>>>>> process
>>>>>>> would be killed if node is crashed.
>>>>>>>
>>>>>>> Default would be 'node stop', but not hang up infinetely.
>>>>>>>
>>>>>> Dmitriy, if Ignite if frozen, you will not be able to stop it. The
>>> only
>>>>>> guaranteed way to "un-freeze" the cluster is to kill the frozen
>> JVM.
>>>>>> On top of that, it is very likely that if you stop the "embedded"
>>>> Ignite,
>>>>>> the user application will not be able to function any way, so
>> killing
>>>> the
>>>>>> node does sound like a better and *safer* option.
>>>>>>
>>>>>> D.
>>>>>>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Nikolay Izhikov <ni...@apache.org>.

Dmitriy.

I think you and other participants of discussion are talking about different cases.

May be it be usefull to look at specific cases and discuss each of them separately?

I look at IEP page and see following:

```
File IO errors. Usually IOException's threw by read/write operations on file system. The following subsystems should be considered as critical:
* WAL
* Page store
* Meta store
* Binary meta store
```

Suppose, we ran out of disk space on some node.
The other things are all right.
Should we do `System.exit(-1);` in that case?

Personally, I fully agreed with Nick Podrash: 

"I can tell you as a user that if any library I was using in my application called System.exit without my consent would result in a lot of frustration."

Also, do you have any examples of other products that do `System.exit(-1);` in case of troubles?

В Вт, 13/03/2018 в 19:07 -0400, Dmitriy Setrakyan пишет:
> On Tue, Mar 13, 2018 at 6:55 PM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
> 
> > What do you think if stop is default for all cases?
> > 
> > Kill is configurable.
> > 
> > We can consider enforse sockets close for 'stop'. This will allow to ignore
> > hang node by rest of the cluster.
> > 
> 
> Dmitriy, I see that you cannot come to terms with stopping a process that
> was not started by Ignite. However, in majority of the deployments, users
> would prefer that you would "kill" the process instead of leaving it
> running in a "frozen" state. Frozen state is non-deterministic and it is
> impossible to create a recovery for it. Killing the process is very
> deterministic and can be recovered by restarting it in most cases.
> 
> "stop" does not fix the problem we are trying to solve. The whole point is
> to prevent frozen state, and "stop" without "kill" does not prevent it. I
> am OK if "stop+kill" is the default behavior, which means that we try a
> graceful shutdown and then always kill the process anyway.
> 
> I think we should have the following configurable options:
> - "stop+kill" (default)
> - "kill"
> - "stop"
> - "stop+restart" (if stop fails, we should kill regardless)

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Tue, Mar 13, 2018 at 6:55 PM, Dmitry Pavlov <dp...@gmail.com>
wrote:

> What do you think if stop is default for all cases?
>
> Kill is configurable.
>
> We can consider enforse sockets close for 'stop'. This will allow to ignore
> hang node by rest of the cluster.
>

Dmitriy, I see that you cannot come to terms with stopping a process that
was not started by Ignite. However, in majority of the deployments, users
would prefer that you would "kill" the process instead of leaving it
running in a "frozen" state. Frozen state is non-deterministic and it is
impossible to create a recovery for it. Killing the process is very
deterministic and can be recovered by restarting it in most cases.

"stop" does not fix the problem we are trying to solve. The whole point is
to prevent frozen state, and "stop" without "kill" does not prevent it. I
am OK if "stop+kill" is the default behavior, which means that we try a
graceful shutdown and then always kill the process anyway.

I think we should have the following configurable options:
- "stop+kill" (default)
- "kill"
- "stop"
- "stop+restart" (if stop fails, we should kill regardless)

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitry Pavlov <dp...@gmail.com>.

What do you think if stop is default for all cases?

Kill is configurable.

We can consider enforse sockets close for 'stop'. This will allow to ignore
hang node by rest of the cluster.

ср, 14 мар. 2018 г., 1:48 Dmitriy Setrakyan <ds...@apache.org>:

> Guys, I do not think there is an understanding here. If Ignite hangs - it
> will likely be impossible to stop. So if you are suggesting "stop if
> embedded", you might as well suggest "do nothing if embedded".
>
> I have seen many Ignite deployments, embedded or not, large and small, and
> in all those deployments if Ignite went into a frozen state, killing it was
> the best option. Moreover, it provided the most predictable behavior. I am
> not guessing here, but it seems to me that the rest of the community is
> guessing.
>
> Killing a frozen Ignite node should be a default behavior in all cases,
> embedded or not. Stopping a frozen Ignite node should be a configurable
> option, so a user has an ability to turn off auto-kill behavior. We should
> also have a 3rd option, "stop+kill", so if stopping fails, then the process
> is automatically killed (this is also a good default option).
>
> Personally, I am OK if the default behavior is "kill" or "stop+kill", but
> it should be the same default in all cases. We should stop the practice of
> creating different default behaviors for the same problem. It is confusing
> and hard to document.
>
> D.
>
> On Tue, Mar 13, 2018 at 2:19 PM, Denis Magda <dm...@apache.org> wrote:
>
> > +1 for "kill if standalone, stop if embedded" behavior. If the practice
> > shows that the node should be killed regardless of the mode, then it will
> > be an easy change. Now we are just guessing, and common sense suggests
> > going for "kill if standalone, stop if embedded" until we get feedback.
> >
> > -
> > Denis
> >
> > On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov <dp...@gmail.com>
> > wrote:
> >
> > > You are suggesting to kill the process, which was not started by
> Ignite,
> > > are not you?
> > >
> > > More consistently is to stop only those processes that are generated by
> > the
> > > control of Ignite, e.g. from ignite.sh - here it is ok for me.
> > >
> > > If we relese 'kill by default' as part of 2.5, we will end up with 2.6
> > > emergency release to change it back, if one user will face with such
> > > unexpected behaviour.
> > >
> > > вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <dsetrakyan@apache.org
> >:
> > >
> > > > Dmitriy,
> > > >
> > > > I think everyone is suggesting that stopping the node will likely be
> > > > impossible if Ignite is frozen. Moreover, it is very likely that all
> > > other
> > > > apps are frozen too.
> > > >
> > > > My comments are below...
> > > >
> > > > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <
> dpavlov.spb@gmail.com>
> > > > wrote:
> > > >
> > > > > Please consider that user application may use Ignite as optional
> > cache
> > > > for
> > > > > some low-priority feature, but main logic is well functioning
> without
> > > > > Ingnite. I can say, as Ignite user in the past, that it is quite
> real
> > > > case.
> > > > >
> > > >
> > > > I have been a part of this project for a while, but I have never seen
> > > > Ignite used as an optional cache. Usually, Ignite is a mandatory part
> > of
> > > > the application, not optional.
> > > >
> > > >
> > > > > Second real case is using several war files within one application
> > > > server,
> > > > > running different logic. Some apps use Ignite, some applications -
> > not.
> > > > > Killing application server in this case is not an option too.
> > > > >
> > > >
> > > > Not very likely, but possible. This is not a common use case. Most
> > > commonly
> > > > Ignite would be serving all WAR files with a common data layer.
> > > >
> > > >
> > > > >
> > > > > So default should be stopping all node threads, but not kill the
> > > process.
> > > > > If user is aware process may be killed, it may setup option.
> > > > >
> > > >
> > > > No, the default should be to kill the process. If user does not like
> > it,
> > > > then it should be possible to change it to stop the node first.
> > > >
> > > >
> > > > >
> > > > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <
> > dsetrakyan@apache.org
> > > >:
> > > > >
> > > > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <
> > > dpavlov.spb@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> > > > > >
> > > > > >
> > > > > > > User will be still able to set something like
> > > > > > > -DNODE_CRASH_ACTION="kill"
> > > > > > > if ignite.sh is not used and user accepts alternative that
> whole
> > > > > process
> > > > > > > would be killed if node is crashed.
> > > > > > >
> > > > > > > Default would be 'node stop', but not hang up infinetely.
> > > > > > >
> > > > > >
> > > > > > Dmitriy, if Ignite if frozen, you will not be able to stop it.
> The
> > > only
> > > > > > guaranteed way to "un-freeze" the cluster is to kill the frozen
> > JVM.
> > > > > >
> > > > > > On top of that, it is very likely that if you stop the "embedded"
> > > > Ignite,
> > > > > > the user application will not be able to function any way, so
> > killing
> > > > the
> > > > > > node does sound like a better and *safer* option.
> > > > > >
> > > > > > D.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Guys, I do not think there is an understanding here. If Ignite hangs - it
will likely be impossible to stop. So if you are suggesting "stop if
embedded", you might as well suggest "do nothing if embedded".

I have seen many Ignite deployments, embedded or not, large and small, and
in all those deployments if Ignite went into a frozen state, killing it was
the best option. Moreover, it provided the most predictable behavior. I am
not guessing here, but it seems to me that the rest of the community is
guessing.

Killing a frozen Ignite node should be a default behavior in all cases,
embedded or not. Stopping a frozen Ignite node should be a configurable
option, so a user has an ability to turn off auto-kill behavior. We should
also have a 3rd option, "stop+kill", so if stopping fails, then the process
is automatically killed (this is also a good default option).

Personally, I am OK if the default behavior is "kill" or "stop+kill", but
it should be the same default in all cases. We should stop the practice of
creating different default behaviors for the same problem. It is confusing
and hard to document.

D.

On Tue, Mar 13, 2018 at 2:19 PM, Denis Magda <dm...@apache.org> wrote:

> +1 for "kill if standalone, stop if embedded" behavior. If the practice
> shows that the node should be killed regardless of the mode, then it will
> be an easy change. Now we are just guessing, and common sense suggests
> going for "kill if standalone, stop if embedded" until we get feedback.
>
> -
> Denis
>
> On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
>
> > You are suggesting to kill the process, which was not started by Ignite,
> > are not you?
> >
> > More consistently is to stop only those processes that are generated by
> the
> > control of Ignite, e.g. from ignite.sh - here it is ok for me.
> >
> > If we relese 'kill by default' as part of 2.5, we will end up with 2.6
> > emergency release to change it back, if one user will face with such
> > unexpected behaviour.
> >
> > вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > Dmitriy,
> > >
> > > I think everyone is suggesting that stopping the node will likely be
> > > impossible if Ignite is frozen. Moreover, it is very likely that all
> > other
> > > apps are frozen too.
> > >
> > > My comments are below...
> > >
> > > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <dp...@gmail.com>
> > > wrote:
> > >
> > > > Please consider that user application may use Ignite as optional
> cache
> > > for
> > > > some low-priority feature, but main logic is well functioning without
> > > > Ingnite. I can say, as Ignite user in the past, that it is quite real
> > > case.
> > > >
> > >
> > > I have been a part of this project for a while, but I have never seen
> > > Ignite used as an optional cache. Usually, Ignite is a mandatory part
> of
> > > the application, not optional.
> > >
> > >
> > > > Second real case is using several war files within one application
> > > server,
> > > > running different logic. Some apps use Ignite, some applications -
> not.
> > > > Killing application server in this case is not an option too.
> > > >
> > >
> > > Not very likely, but possible. This is not a common use case. Most
> > commonly
> > > Ignite would be serving all WAR files with a common data layer.
> > >
> > >
> > > >
> > > > So default should be stopping all node threads, but not kill the
> > process.
> > > > If user is aware process may be killed, it may setup option.
> > > >
> > >
> > > No, the default should be to kill the process. If user does not like
> it,
> > > then it should be possible to change it to stop the node first.
> > >
> > >
> > > >
> > > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <
> dsetrakyan@apache.org
> > >:
> > > >
> > > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <
> > dpavlov.spb@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> > > > >
> > > > >
> > > > > > User will be still able to set something like
> > > > > > -DNODE_CRASH_ACTION="kill"
> > > > > > if ignite.sh is not used and user accepts alternative that whole
> > > > process
> > > > > > would be killed if node is crashed.
> > > > > >
> > > > > > Default would be 'node stop', but not hang up infinetely.
> > > > > >
> > > > >
> > > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The
> > only
> > > > > guaranteed way to "un-freeze" the cluster is to kill the frozen
> JVM.
> > > > >
> > > > > On top of that, it is very likely that if you stop the "embedded"
> > > Ignite,
> > > > > the user application will not be able to function any way, so
> killing
> > > the
> > > > > node does sound like a better and *safer* option.
> > > > >
> > > > > D.
> > > > >
> > > >
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Denis Magda <dm...@apache.org>.

+1 for "kill if standalone, stop if embedded" behavior. If the practice
shows that the node should be killed regardless of the mode, then it will
be an easy change. Now we are just guessing, and common sense suggests
going for "kill if standalone, stop if embedded" until we get feedback.

-
Denis

On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov <dp...@gmail.com>
wrote:

> You are suggesting to kill the process, which was not started by Ignite,
> are not you?
>
> More consistently is to stop only those processes that are generated by the
> control of Ignite, e.g. from ignite.sh - here it is ok for me.
>
> If we relese 'kill by default' as part of 2.5, we will end up with 2.6
> emergency release to change it back, if one user will face with such
> unexpected behaviour.
>
> вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <ds...@apache.org>:
>
> > Dmitriy,
> >
> > I think everyone is suggesting that stopping the node will likely be
> > impossible if Ignite is frozen. Moreover, it is very likely that all
> other
> > apps are frozen too.
> >
> > My comments are below...
> >
> > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <dp...@gmail.com>
> > wrote:
> >
> > > Please consider that user application may use Ignite as optional cache
> > for
> > > some low-priority feature, but main logic is well functioning without
> > > Ingnite. I can say, as Ignite user in the past, that it is quite real
> > case.
> > >
> >
> > I have been a part of this project for a while, but I have never seen
> > Ignite used as an optional cache. Usually, Ignite is a mandatory part of
> > the application, not optional.
> >
> >
> > > Second real case is using several war files within one application
> > server,
> > > running different logic. Some apps use Ignite, some applications - not.
> > > Killing application server in this case is not an option too.
> > >
> >
> > Not very likely, but possible. This is not a common use case. Most
> commonly
> > Ignite would be serving all WAR files with a common data layer.
> >
> >
> > >
> > > So default should be stopping all node threads, but not kill the
> process.
> > > If user is aware process may be killed, it may setup option.
> > >
> >
> > No, the default should be to kill the process. If user does not like it,
> > then it should be possible to change it to stop the node first.
> >
> >
> > >
> > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <dsetrakyan@apache.org
> >:
> > >
> > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <
> dpavlov.spb@gmail.com>
> > > > wrote:
> > > >
> > > > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> > > >
> > > >
> > > > > User will be still able to set something like
> > > > > -DNODE_CRASH_ACTION="kill"
> > > > > if ignite.sh is not used and user accepts alternative that whole
> > > process
> > > > > would be killed if node is crashed.
> > > > >
> > > > > Default would be 'node stop', but not hang up infinetely.
> > > > >
> > > >
> > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The
> only
> > > > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
> > > >
> > > > On top of that, it is very likely that if you stop the "embedded"
> > Ignite,
> > > > the user application will not be able to function any way, so killing
> > the
> > > > node does sound like a better and *safer* option.
> > > >
> > > > D.
> > > >
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitry Pavlov <dp...@gmail.com>.

You are suggesting to kill the process, which was not started by Ignite,
are not you?

More consistently is to stop only those processes that are generated by the
control of Ignite, e.g. from ignite.sh - here it is ok for me.

If we relese 'kill by default' as part of 2.5, we will end up with 2.6
emergency release to change it back, if one user will face with such
unexpected behaviour.

вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <ds...@apache.org>:

> Dmitriy,
>
> I think everyone is suggesting that stopping the node will likely be
> impossible if Ignite is frozen. Moreover, it is very likely that all other
> apps are frozen too.
>
> My comments are below...
>
> On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
>
> > Please consider that user application may use Ignite as optional cache
> for
> > some low-priority feature, but main logic is well functioning without
> > Ingnite. I can say, as Ignite user in the past, that it is quite real
> case.
> >
>
> I have been a part of this project for a while, but I have never seen
> Ignite used as an optional cache. Usually, Ignite is a mandatory part of
> the application, not optional.
>
>
> > Second real case is using several war files within one application
> server,
> > running different logic. Some apps use Ignite, some applications - not.
> > Killing application server in this case is not an option too.
> >
>
> Not very likely, but possible. This is not a common use case. Most commonly
> Ignite would be serving all WAR files with a common data layer.
>
>
> >
> > So default should be stopping all node threads, but not kill the process.
> > If user is aware process may be killed, it may setup option.
> >
>
> No, the default should be to kill the process. If user does not like it,
> then it should be possible to change it to stop the node first.
>
>
> >
> > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <dp...@gmail.com>
> > > wrote:
> > >
> > > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> > >
> > >
> > > > User will be still able to set something like
> > > > -DNODE_CRASH_ACTION="kill"
> > > > if ignite.sh is not used and user accepts alternative that whole
> > process
> > > > would be killed if node is crashed.
> > > >
> > > > Default would be 'node stop', but not hang up infinetely.
> > > >
> > >
> > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
> > > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
> > >
> > > On top of that, it is very likely that if you stop the "embedded"
> Ignite,
> > > the user application will not be able to function any way, so killing
> the
> > > node does sound like a better and *safer* option.
> > >
> > > D.
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Dmitriy,

I think everyone is suggesting that stopping the node will likely be
impossible if Ignite is frozen. Moreover, it is very likely that all other
apps are frozen too.

My comments are below...

On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <dp...@gmail.com>
wrote:

> Please consider that user application may use Ignite as optional cache for
> some low-priority feature, but main logic is well functioning without
> Ingnite. I can say, as Ignite user in the past, that it is quite real case.
>

I have been a part of this project for a while, but I have never seen
Ignite used as an optional cache. Usually, Ignite is a mandatory part of
the application, not optional.

> Second real case is using several war files within one application server,
> running different logic. Some apps use Ignite, some applications - not.
> Killing application server in this case is not an option too.
>

Not very likely, but possible. This is not a common use case. Most commonly
Ignite would be serving all WAR files with a common data layer.

>
> So default should be stopping all node threads, but not kill the process.
> If user is aware process may be killed, it may setup option.
>

No, the default should be to kill the process. If user does not like it,
then it should be possible to change it to stop the node first.

>
> вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <ds...@apache.org>:
>
> > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <dp...@gmail.com>
> > wrote:
> >
> > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> >
> >
> > > User will be still able to set something like
> > > -DNODE_CRASH_ACTION="kill"
> > > if ignite.sh is not used and user accepts alternative that whole
> process
> > > would be killed if node is crashed.
> > >
> > > Default would be 'node stop', but not hang up infinetely.
> > >
> >
> > Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
> > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
> >
> > On top of that, it is very likely that if you stop the "embedded" Ignite,
> > the user application will not be able to function any way, so killing the
> > node does sound like a better and *safer* option.
> >
> > D.
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitry Pavlov <dp...@gmail.com>.

Please consider that user application may use Ignite as optional cache for
some low-priority feature, but main logic is well functioning without
Ingnite. I can say, as Ignite user in the past, that it is quite real case.

Second real case is using several war files within one application server,
running different logic. Some apps use Ignite, some applications - not.
Killing application server in this case is not an option too.

So default should be stopping all node threads, but not kill the process.
If user is aware process may be killed, it may setup option.

вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <ds...@apache.org>:

> On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
>
> > Dmitriy, alternative is "kill if standalone, stop if embedded"
>
>
> > User will be still able to set something like
> > -DNODE_CRASH_ACTION="kill"
> > if ignite.sh is not used and user accepts alternative that whole process
> > would be killed if node is crashed.
> >
> > Default would be 'node stop', but not hang up infinetely.
> >
>
> Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
> guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
>
> On top of that, it is very likely that if you stop the "embedded" Ignite,
> the user application will not be able to function any way, so killing the
> node does sound like a better and *safer* option.
>
> D.
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <dp...@gmail.com>
wrote:

> Dmitriy, alternative is "kill if standalone, stop if embedded"

> User will be still able to set something like
> -DNODE_CRASH_ACTION="kill"
> if ignite.sh is not used and user accepts alternative that whole process
> would be killed if node is crashed.
>
> Default would be 'node stop', but not hang up infinetely.
>

Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.

On top of that, it is very likely that if you stop the "embedded" Ignite,
the user application will not be able to function any way, so killing the
node does sound like a better and *safer* option.

D.

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Andrey Kuznetsov <st...@gmail.com>.

The most doubtful thing is 'stopping'. What if node does not respond due to
critical failure?

2018-03-13 15:16 GMT+03:00 Dmitry Pavlov <dp...@gmail.com>:

> Dmitriy, alternative is "kill if standalone, stop if embedded"
>
> User will be still able to set something like
> -DNODE_CRASH_ACTION="kill"
> if ignite.sh is not used and user accepts alternative that whole process
> would be killed if node is crashed.
>
> Default would be 'node stop', but not hang up infinetely.
>
> Sincerely,
> Dmitriy Pavlov
>
> вт, 13 мар. 2018 г. в 14:53, Dmitriy Setrakyan <ds...@apache.org>:
>
> --
Best regards,
  Andrey Kuznetsov.

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitry Pavlov <dp...@gmail.com>.

Dmitriy, alternative is "kill if standalone, stop if embedded"

User will be still able to set something like
-DNODE_CRASH_ACTION="kill"
if ignite.sh is not used and user accepts alternative that whole process
would be killed if node is crashed.

Default would be 'node stop', but not hang up infinetely.

Sincerely,
Dmitriy Pavlov

вт, 13 мар. 2018 г. в 14:53, Dmitriy Setrakyan <ds...@apache.org>:

> Guys, I do not understand the alternative. If Ignite is frozen and causes
> the whole grid to freeze, how can we justify not killing it? Will uses
> rather have their applications freeze?
>
> I would consider real life use cases here. Can someone present a life
> example where keeping a frozen grid node around is better than killing JVM?
>
> D.
>
> On Tue, Mar 13, 2018 at 6:16 AM, Alexey Goncharuk <
> alexey.goncharuk@gmail.com> wrote:
>
> > I also like "kill if standalone, stop if embedded" by default. A use can
> > change it to kill for embedded mode, but it will be a controlled safe
> > choice.
> >
> > 2018-03-13 11:26 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:
> >
> > > +1 for "kill if standalone, stop if embedded". We should never kill a
> > > process in embedded node because it might be disastrous for user
> > > application.
> > >
> > > On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <dpavlov.spb@gmail.com
> >
> > > wrote:
> > >
> > > > Denis, Dmitriy, I am not sure I agree here, please see close
> analogue -
> > > JVM
> > > > itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
> > > >
> > > > If server node is started from sh script, kill OK for me, as process
> is
> > > > controlled only by ignite.  It is sufficient to add option to
> override
> > > > default for sh script.
> > > >
> > > > Users interested in this behaviour may also setup this option to
> "kill"
> > > >
> > > > If server node is started from java, it should never kill whole
> > process.
> > > > This mode is not prohibited by docs, users are allowed to start
> several
> > > > nodes in one process, run its own application logic in this node.
> > > >
> > > > Why we should kill user code running? It could be negative surprise
> to
> > > > user.
> > > >
> > > >
> > > >
> > > > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <dsetrakyan@apache.org
> >:
> > > >
> > > > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <
> > > andrewkornev@hotmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I believe the only reasonable way to handle a critical system
> > failure
> > > > (as
> > > > > > it is defined in the IEP) is a JVM halt (not a graceful
> > > > exit/shutdown!).
> > > > > > The sooner - the better, lesser impact. There’s simply no way to
> > > reason
> > > > > > about the state of the system in a situation like that, all bets
> > are
> > > > off.
> > > > > > Any other policy would only confuse the matters and in all
> > likelihood
> > > > > make
> > > > > > things worse.
> > > > > >
> > > > > > In practice, SREs/Operations would very much rather have a
> process
> > > die
> > > > a
> > > > > > quick clean death, than let it run indefinitely and hope that
> it’ll
> > > > > somehow
> > > > > > recover by itself at some point in future, potentially degrading
> > the
> > > > > > overall system stability and availability all the while.
> > > > > >
> > > > >
> > > > > Completely agree.
> > > > >
> > > >
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Guys, I do not understand the alternative. If Ignite is frozen and causes
the whole grid to freeze, how can we justify not killing it? Will uses
rather have their applications freeze?

I would consider real life use cases here. Can someone present a life
example where keeping a frozen grid node around is better than killing JVM?

D.

On Tue, Mar 13, 2018 at 6:16 AM, Alexey Goncharuk <
alexey.goncharuk@gmail.com> wrote:

> I also like "kill if standalone, stop if embedded" by default. A use can
> change it to kill for embedded mode, but it will be a controlled safe
> choice.
>
> 2018-03-13 11:26 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:
>
> > +1 for "kill if standalone, stop if embedded". We should never kill a
> > process in embedded node because it might be disastrous for user
> > application.
> >
> > On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <dp...@gmail.com>
> > wrote:
> >
> > > Denis, Dmitriy, I am not sure I agree here, please see close analogue -
> > JVM
> > > itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
> > >
> > > If server node is started from sh script, kill OK for me, as process is
> > > controlled only by ignite.  It is sufficient to add option to override
> > > default for sh script.
> > >
> > > Users interested in this behaviour may also setup this option to "kill"
> > >
> > > If server node is started from java, it should never kill whole
> process.
> > > This mode is not prohibited by docs, users are allowed to start several
> > > nodes in one process, run its own application logic in this node.
> > >
> > > Why we should kill user code running? It could be negative surprise to
> > > user.
> > >
> > >
> > >
> > > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <ds...@apache.org>:
> > >
> > > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <
> > andrewkornev@hotmail.com
> > > >
> > > > wrote:
> > > >
> > > > > I believe the only reasonable way to handle a critical system
> failure
> > > (as
> > > > > it is defined in the IEP) is a JVM halt (not a graceful
> > > exit/shutdown!).
> > > > > The sooner - the better, lesser impact. There’s simply no way to
> > reason
> > > > > about the state of the system in a situation like that, all bets
> are
> > > off.
> > > > > Any other policy would only confuse the matters and in all
> likelihood
> > > > make
> > > > > things worse.
> > > > >
> > > > > In practice, SREs/Operations would very much rather have a process
> > die
> > > a
> > > > > quick clean death, than let it run indefinitely and hope that it’ll
> > > > somehow
> > > > > recover by itself at some point in future, potentially degrading
> the
> > > > > overall system stability and availability all the while.
> > > > >
> > > >
> > > > Completely agree.
> > > >
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Alexey Goncharuk <al...@gmail.com>.

I also like "kill if standalone, stop if embedded" by default. A use can
change it to kill for embedded mode, but it will be a controlled safe
choice.

2018-03-13 11:26 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:

> +1 for "kill if standalone, stop if embedded". We should never kill a
> process in embedded node because it might be disastrous for user
> application.
>
> On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
>
> > Denis, Dmitriy, I am not sure I agree here, please see close analogue -
> JVM
> > itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
> >
> > If server node is started from sh script, kill OK for me, as process is
> > controlled only by ignite.  It is sufficient to add option to override
> > default for sh script.
> >
> > Users interested in this behaviour may also setup this option to "kill"
> >
> > If server node is started from java, it should never kill whole process.
> > This mode is not prohibited by docs, users are allowed to start several
> > nodes in one process, run its own application logic in this node.
> >
> > Why we should kill user code running? It could be negative surprise to
> > user.
> >
> >
> >
> > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <ds...@apache.org>:
> >
> > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <
> andrewkornev@hotmail.com
> > >
> > > wrote:
> > >
> > > > I believe the only reasonable way to handle a critical system failure
> > (as
> > > > it is defined in the IEP) is a JVM halt (not a graceful
> > exit/shutdown!).
> > > > The sooner - the better, lesser impact. There’s simply no way to
> reason
> > > > about the state of the system in a situation like that, all bets are
> > off.
> > > > Any other policy would only confuse the matters and in all likelihood
> > > make
> > > > things worse.
> > > >
> > > > In practice, SREs/Operations would very much rather have a process
> die
> > a
> > > > quick clean death, than let it run indefinitely and hope that it’ll
> > > somehow
> > > > recover by itself at some point in future, potentially degrading the
> > > > overall system stability and availability all the while.
> > > >
> > >
> > > Completely agree.
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Vladimir Ozerov <vo...@gridgain.com>.

+1 for "kill if standalone, stop if embedded". We should never kill a
process in embedded node because it might be disastrous for user
application.

On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <dp...@gmail.com>
wrote:

> Denis, Dmitriy, I am not sure I agree here, please see close analogue - JVM
> itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
>
> If server node is started from sh script, kill OK for me, as process is
> controlled only by ignite.  It is sufficient to add option to override
> default for sh script.
>
> Users interested in this behaviour may also setup this option to "kill"
>
> If server node is started from java, it should never kill whole process.
> This mode is not prohibited by docs, users are allowed to start several
> nodes in one process, run its own application logic in this node.
>
> Why we should kill user code running? It could be negative surprise to
> user.
>
>
>
> вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <ds...@apache.org>:
>
> > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <andrewkornev@hotmail.com
> >
> > wrote:
> >
> > > I believe the only reasonable way to handle a critical system failure
> (as
> > > it is defined in the IEP) is a JVM halt (not a graceful
> exit/shutdown!).
> > > The sooner - the better, lesser impact. There’s simply no way to reason
> > > about the state of the system in a situation like that, all bets are
> off.
> > > Any other policy would only confuse the matters and in all likelihood
> > make
> > > things worse.
> > >
> > > In practice, SREs/Operations would very much rather have a process die
> a
> > > quick clean death, than let it run indefinitely and hope that it’ll
> > somehow
> > > recover by itself at some point in future, potentially degrading the
> > > overall system stability and availability all the while.
> > >
> >
> > Completely agree.
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitry Pavlov <dp...@gmail.com>.

Denis, Dmitriy, I am not sure I agree here, please see close analogue - JVM
itself, and its parameter ExitOnOutOfMemoryError,- it is not default.

If server node is started from sh script, kill OK for me, as process is
controlled only by ignite.  It is sufficient to add option to override
default for sh script.

Users interested in this behaviour may also setup this option to "kill"

If server node is started from java, it should never kill whole process.
This mode is not prohibited by docs, users are allowed to start several
nodes in one process, run its own application logic in this node.

Why we should kill user code running? It could be negative surprise to user.

вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <ds...@apache.org>:

> On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <an...@hotmail.com>
> wrote:
>
> > I believe the only reasonable way to handle a critical system failure (as
> > it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!).
> > The sooner - the better, lesser impact. There’s simply no way to reason
> > about the state of the system in a situation like that, all bets are off.
> > Any other policy would only confuse the matters and in all likelihood
> make
> > things worse.
> >
> > In practice, SREs/Operations would very much rather have a process die a
> > quick clean death, than let it run indefinitely and hope that it’ll
> somehow
> > recover by itself at some point in future, potentially degrading the
> > overall system stability and availability all the while.
> >
>
> Completely agree.
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <an...@hotmail.com>
wrote:

> I believe the only reasonable way to handle a critical system failure (as
> it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!).
> The sooner - the better, lesser impact. There’s simply no way to reason
> about the state of the system in a situation like that, all bets are off.
> Any other policy would only confuse the matters and in all likelihood make
> things worse.
>
> In practice, SREs/Operations would very much rather have a process die a
> quick clean death, than let it run indefinitely and hope that it’ll somehow
> recover by itself at some point in future, potentially degrading the
> overall system stability and availability all the while.
>

Completely agree.

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Andrey Kornev <an...@hotmail.com>.

I believe the only reasonable way to handle a critical system failure (as it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!). The sooner - the better, lesser impact. There’s simply no way to reason about the state of the system in a situation like that, all bets are off. Any other policy would only confuse the matters and in all likelihood make things worse.

In practice, SREs/Operations would very much rather have a process die a quick clean death, than let it run indefinitely and hope that it’ll somehow recover by itself at some point in future, potentially degrading the overall system stability and availability all the while.

Andrey
_____________________________
From: Dmitriy Setrakyan <ds...@apache.org>
Sent: Monday, March 12, 2018 5:23 PM
Subject: Re: IEP-14: Ignite failures handling (Discussion)
To: <de...@ignite.apache.org>


On Mon, Mar 12, 2018 at 5:12 PM, Denis Magda <dm...@apache.org> wrote:

> Dmitriy,
>
> Ignite client node is usually used in the embedded mode. By killing the
> whole process, the node is running in, we're going to kill the entire
> application. That doesn't sound like a good plan. That's why my suggestion
> is to try to kill the node somehow instead rather than the whole process.
>

Agree. However, if the node cannot stop gracefully, we should kill the
process anyway. This should be the default behavior. User should be able to
turn it off as needed.


>
> As for the server nodes, which usually own the whole process, it's totally
> fine to kill the process right away.
>

Well, even here I would still try to gracefully stop the node first. If
that cannot be done, then we should kill the process.


>
> --
> Denis
>
> On Mon, Mar 12, 2018 at 4:12 PM, Dmitriy Setrakyan <ds...@apache.org>
> wrote:
>
> > Denis, what is the difference between killing the process and killing the
> > node and the process?
> >
> > D.
> >
> > On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <dm...@apache.org> wrote:
> >
> > > Guys,
> > >
> > > I would make a decision depending on a type of the problematic node:
> > >
> > > - If it's a *server node*, then let's kill the process simply
> because
> > > the node usually owns the whole process. Don't see a practical
> reason
> > > why a
> > > user wants to run 2 server nodes in a single process.
> > > - If it's a *client node*, then the best approach is to kill the
> node
> > > and not the process.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <dp...@gmail.com>
> > > wrote:
> > >
> > > > Hi Andrey, Igniters,
> > > >
> > > > Thank you for starting this topic, because this is really important
> > > > decision.
> > > >
> > > > JVM termination in case Ignite is started within application server
> > with
> > > > other application will kill all services started.
> > > >
> > > > So I suggest this option is not default. We can add this option
> > > > (action="JVM termination") as pre-configured for ignite.sh/bat since
> > we
> > > > know is it separate JVM. But I do not vote for the option, if it was
> > the
> > > > default in code.
> > > >
> > > > Sincerely,
> > > > Dmitriy Pavlov
> > > >
> > > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <st...@gmail.com>:
> > > >
> > > > > To my mind, the default action should be as severe as possible,
> since
> > > we
> > > > > deal with critical errors, that is, entire JVM termination. In the
> > case
> > > > of
> > > > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > > > response action should be configured explicitly.
> > > > >
> > > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <ag...@apache.org>:
> > > > >
> > > > > > Igniters!
> > > > > >
> > > > > > We are working on proposal described in IEP-14 Ignite failures
> > > > > > handling [1] and it's time to discuss it with community (although
> > it
> > > > > > was necessary to do this before).
> > > > > >
> > > > > > Most important question: what should be default behaviour in case
> > of
> > > > > > failure? There are 4 actions:
> > > > > >
> > > > > > 1. Restart JVM process (it's possible only if process was started
> > > from
> > > > > > ignite.(sh|bat) script)
> > > > > > 2. Terminate JVM;
> > > > > > 3. Stop node (if there is only one node in process then process
> > will
> > > > > > be also terminated);
> > > > > > 4. No operation.
> > > > > >
> > > > > > I believe that node should be stopped by default. But there is
> > chance
> > > > > > that node will not stopped correctly.
> > > > > >
> > > > > > May be we should terminate JVM process by default. But it will
> kill
> > > > > > all nodes in the JVM process. It's especially bad behaviour in
> case
> > > > > > when nodes belong different Ignite clusters (real use case).
> > > > > >
> > > > > > May be we should restart JVM process default. This approach has
> the
> > > > > > same problems as the previous one. And additionally it could lead
> > to
> > > > > > continues restarts and, therefore, continues exchanges and
> > > > > > rebalancing.
> > > > > >
> > > > > > Difficult choice. Could you please share your thoughts.
> > > > > >
> > > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > 14+Ignite+failures+handling
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey Kuznetsov.
> > > > >
> > > >
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

On Mon, Mar 12, 2018 at 5:12 PM, Denis Magda <dm...@apache.org> wrote:

> Dmitriy,
>
> Ignite client node is usually used in the embedded mode. By killing the
> whole process, the node is running in, we're going to kill the entire
> application. That doesn't sound like a good plan. That's why my suggestion
> is to try to kill the node somehow instead rather than the whole process.
>

Agree. However, if the node cannot stop gracefully, we should kill the
process anyway. This should be the default behavior. User should be able to
turn it off as needed.


>
> As for the server nodes, which usually own the whole process, it's totally
> fine to kill the process right away.
>

Well, even here I would still try to gracefully stop the node first. If
that cannot be done, then we should kill the process.


>
> --
> Denis
>
> On Mon, Mar 12, 2018 at 4:12 PM, Dmitriy Setrakyan <ds...@apache.org>
> wrote:
>
> > Denis, what is the difference between killing the process and killing the
> > node and the process?
> >
> > D.
> >
> > On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <dm...@apache.org> wrote:
> >
> > > Guys,
> > >
> > > I would make a decision depending on a type of the problematic node:
> > >
> > >    - If it's a *server node*, then let's kill the process simply
> because
> > >    the node usually owns the whole process. Don't see a practical
> reason
> > > why a
> > >    user wants to run 2 server nodes in a single process.
> > >    - If it's a *client node*, then the best approach is to kill the
> node
> > >    and not the process.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <dp...@gmail.com>
> > > wrote:
> > >
> > > > Hi Andrey, Igniters,
> > > >
> > > > Thank you for starting this topic, because this is really important
> > > > decision.
> > > >
> > > > JVM termination in case Ignite is started within application server
> > with
> > > > other application will kill all services started.
> > > >
> > > > So I suggest this option is not default. We can add this option
> > > > (action="JVM termination") as pre-configured for ignite.sh/bat since
> > we
> > > > know is it separate JVM. But I do not vote for the option, if it was
> > the
> > > > default in code.
> > > >
> > > > Sincerely,
> > > > Dmitriy Pavlov
> > > >
> > > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <st...@gmail.com>:
> > > >
> > > > > To my mind, the default action should be as severe as possible,
> since
> > > we
> > > > > deal with critical errors, that is, entire JVM termination. In the
> > case
> > > > of
> > > > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > > > response action should be configured explicitly.
> > > > >
> > > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <ag...@apache.org>:
> > > > >
> > > > > > Igniters!
> > > > > >
> > > > > > We are working on proposal described in IEP-14 Ignite failures
> > > > > > handling [1] and it's time to discuss it with community (although
> > it
> > > > > > was necessary to do this before).
> > > > > >
> > > > > > Most important question: what should be default behaviour in case
> > of
> > > > > > failure? There are 4 actions:
> > > > > >
> > > > > > 1. Restart JVM process (it's possible only if process was started
> > > from
> > > > > > ignite.(sh|bat) script)
> > > > > > 2. Terminate JVM;
> > > > > > 3. Stop node (if there is only one node in process then process
> > will
> > > > > > be also terminated);
> > > > > > 4. No operation.
> > > > > >
> > > > > > I believe that node should be stopped by default. But there is
> > chance
> > > > > > that node will not stopped correctly.
> > > > > >
> > > > > > May be we should terminate JVM process by default. But it will
> kill
> > > > > > all nodes in the JVM process. It's especially bad behaviour in
> case
> > > > > > when nodes belong different Ignite clusters (real use case).
> > > > > >
> > > > > > May be we should restart JVM process default. This approach has
> the
> > > > > > same problems as the previous one. And additionally it could lead
> > to
> > > > > > continues restarts and, therefore, continues exchanges and
> > > > > > rebalancing.
> > > > > >
> > > > > > Difficult choice. Could you please share your thoughts.
> > > > > >
> > > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > 14+Ignite+failures+handling
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >   Andrey Kuznetsov.
> > > > >
> > > >
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Denis Magda <dm...@apache.org>.

Dmitriy,

Ignite client node is usually used in the embedded mode. By killing the
whole process, the node is running in, we're going to kill the entire
application. That doesn't sound like a good plan. That's why my suggestion
is to try to kill the node somehow instead rather than the whole process.

As for the server nodes, which usually own the whole process, it's totally
fine to kill the process right away.

--
Denis

On Mon, Mar 12, 2018 at 4:12 PM, Dmitriy Setrakyan <ds...@apache.org>
wrote:

> Denis, what is the difference between killing the process and killing the
> node and the process?
>
> D.
>
> On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <dm...@apache.org> wrote:
>
> > Guys,
> >
> > I would make a decision depending on a type of the problematic node:
> >
> >    - If it's a *server node*, then let's kill the process simply because
> >    the node usually owns the whole process. Don't see a practical reason
> > why a
> >    user wants to run 2 server nodes in a single process.
> >    - If it's a *client node*, then the best approach is to kill the node
> >    and not the process.
> >
> > --
> > Denis
> >
> > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <dp...@gmail.com>
> > wrote:
> >
> > > Hi Andrey, Igniters,
> > >
> > > Thank you for starting this topic, because this is really important
> > > decision.
> > >
> > > JVM termination in case Ignite is started within application server
> with
> > > other application will kill all services started.
> > >
> > > So I suggest this option is not default. We can add this option
> > > (action="JVM termination") as pre-configured for ignite.sh/bat since
> we
> > > know is it separate JVM. But I do not vote for the option, if it was
> the
> > > default in code.
> > >
> > > Sincerely,
> > > Dmitriy Pavlov
> > >
> > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <st...@gmail.com>:
> > >
> > > > To my mind, the default action should be as severe as possible, since
> > we
> > > > deal with critical errors, that is, entire JVM termination. In the
> case
> > > of
> > > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > > response action should be configured explicitly.
> > > >
> > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <ag...@apache.org>:
> > > >
> > > > > Igniters!
> > > > >
> > > > > We are working on proposal described in IEP-14 Ignite failures
> > > > > handling [1] and it's time to discuss it with community (although
> it
> > > > > was necessary to do this before).
> > > > >
> > > > > Most important question: what should be default behaviour in case
> of
> > > > > failure? There are 4 actions:
> > > > >
> > > > > 1. Restart JVM process (it's possible only if process was started
> > from
> > > > > ignite.(sh|bat) script)
> > > > > 2. Terminate JVM;
> > > > > 3. Stop node (if there is only one node in process then process
> will
> > > > > be also terminated);
> > > > > 4. No operation.
> > > > >
> > > > > I believe that node should be stopped by default. But there is
> chance
> > > > > that node will not stopped correctly.
> > > > >
> > > > > May be we should terminate JVM process by default. But it will kill
> > > > > all nodes in the JVM process. It's especially bad behaviour in case
> > > > > when nodes belong different Ignite clusters (real use case).
> > > > >
> > > > > May be we should restart JVM process default. This approach has the
> > > > > same problems as the previous one. And additionally it could lead
> to
> > > > > continues restarts and, therefore, continues exchanges and
> > > > > rebalancing.
> > > > >
> > > > > Difficult choice. Could you please share your thoughts.
> > > > >
> > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > 14+Ignite+failures+handling
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >   Andrey Kuznetsov.
> > > >
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Denis, what is the difference between killing the process and killing the
node and the process?

D.

On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <dm...@apache.org> wrote:

> Guys,
>
> I would make a decision depending on a type of the problematic node:
>
>    - If it's a *server node*, then let's kill the process simply because
>    the node usually owns the whole process. Don't see a practical reason
> why a
>    user wants to run 2 server nodes in a single process.
>    - If it's a *client node*, then the best approach is to kill the node
>    and not the process.
>
> --
> Denis
>
> On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <dp...@gmail.com>
> wrote:
>
> > Hi Andrey, Igniters,
> >
> > Thank you for starting this topic, because this is really important
> > decision.
> >
> > JVM termination in case Ignite is started within application server with
> > other application will kill all services started.
> >
> > So I suggest this option is not default. We can add this option
> > (action="JVM termination") as pre-configured for ignite.sh/bat since we
> > know is it separate JVM. But I do not vote for the option, if it was the
> > default in code.
> >
> > Sincerely,
> > Dmitriy Pavlov
> >
> > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <st...@gmail.com>:
> >
> > > To my mind, the default action should be as severe as possible, since
> we
> > > deal with critical errors, that is, entire JVM termination. In the case
> > of
> > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > response action should be configured explicitly.
> > >
> > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <ag...@apache.org>:
> > >
> > > > Igniters!
> > > >
> > > > We are working on proposal described in IEP-14 Ignite failures
> > > > handling [1] and it's time to discuss it with community (although it
> > > > was necessary to do this before).
> > > >
> > > > Most important question: what should be default behaviour in case of
> > > > failure? There are 4 actions:
> > > >
> > > > 1. Restart JVM process (it's possible only if process was started
> from
> > > > ignite.(sh|bat) script)
> > > > 2. Terminate JVM;
> > > > 3. Stop node (if there is only one node in process then process will
> > > > be also terminated);
> > > > 4. No operation.
> > > >
> > > > I believe that node should be stopped by default. But there is chance
> > > > that node will not stopped correctly.
> > > >
> > > > May be we should terminate JVM process by default. But it will kill
> > > > all nodes in the JVM process. It's especially bad behaviour in case
> > > > when nodes belong different Ignite clusters (real use case).
> > > >
> > > > May be we should restart JVM process default. This approach has the
> > > > same problems as the previous one. And additionally it could lead to
> > > > continues restarts and, therefore, continues exchanges and
> > > > rebalancing.
> > > >
> > > > Difficult choice. Could you please share your thoughts.
> > > >
> > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > 14+Ignite+failures+handling
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >   Andrey Kuznetsov.
> > >
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Denis Magda <dm...@apache.org>.

Guys,

I would make a decision depending on a type of the problematic node:

   - If it's a *server node*, then let's kill the process simply because
   the node usually owns the whole process. Don't see a practical reason why a
   user wants to run 2 server nodes in a single process.
   - If it's a *client node*, then the best approach is to kill the node
   and not the process.

--
Denis

On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <dp...@gmail.com>
wrote:

> Hi Andrey, Igniters,
>
> Thank you for starting this topic, because this is really important
> decision.
>
> JVM termination in case Ignite is started within application server with
> other application will kill all services started.
>
> So I suggest this option is not default. We can add this option
> (action="JVM termination") as pre-configured for ignite.sh/bat since we
> know is it separate JVM. But I do not vote for the option, if it was the
> default in code.
>
> Sincerely,
> Dmitriy Pavlov
>
> пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <st...@gmail.com>:
>
> > To my mind, the default action should be as severe as possible, since we
> > deal with critical errors, that is, entire JVM termination. In the case
> of
> > some custom setup (e.g. different cluster nodes in one JVM) failure
> > response action should be configured explicitly.
> >
> > 2018-03-12 12:32 GMT+03:00 Andrey Gura <ag...@apache.org>:
> >
> > > Igniters!
> > >
> > > We are working on proposal described in IEP-14 Ignite failures
> > > handling [1] and it's time to discuss it with community (although it
> > > was necessary to do this before).
> > >
> > > Most important question: what should be default behaviour in case of
> > > failure? There are 4 actions:
> > >
> > > 1. Restart JVM process (it's possible only if process was started from
> > > ignite.(sh|bat) script)
> > > 2. Terminate JVM;
> > > 3. Stop node (if there is only one node in process then process will
> > > be also terminated);
> > > 4. No operation.
> > >
> > > I believe that node should be stopped by default. But there is chance
> > > that node will not stopped correctly.
> > >
> > > May be we should terminate JVM process by default. But it will kill
> > > all nodes in the JVM process. It's especially bad behaviour in case
> > > when nodes belong different Ignite clusters (real use case).
> > >
> > > May be we should restart JVM process default. This approach has the
> > > same problems as the previous one. And additionally it could lead to
> > > continues restarts and, therefore, continues exchanges and
> > > rebalancing.
> > >
> > > Difficult choice. Could you please share your thoughts.
> > >
> > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > 14+Ignite+failures+handling
> > >
> >
> >
> >
> > --
> > Best regards,
> >   Andrey Kuznetsov.
> >
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Dmitry Pavlov <dp...@gmail.com>.

Hi Andrey, Igniters,

Thank you for starting this topic, because this is really important
decision.

JVM termination in case Ignite is started within application server with
other application will kill all services started.

So I suggest this option is not default. We can add this option
(action="JVM termination") as pre-configured for ignite.sh/bat since we
know is it separate JVM. But I do not vote for the option, if it was the
default in code.

Sincerely,
Dmitriy Pavlov

пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <st...@gmail.com>:

> To my mind, the default action should be as severe as possible, since we
> deal with critical errors, that is, entire JVM termination. In the case of
> some custom setup (e.g. different cluster nodes in one JVM) failure
> response action should be configured explicitly.
>
> 2018-03-12 12:32 GMT+03:00 Andrey Gura <ag...@apache.org>:
>
> > Igniters!
> >
> > We are working on proposal described in IEP-14 Ignite failures
> > handling [1] and it's time to discuss it with community (although it
> > was necessary to do this before).
> >
> > Most important question: what should be default behaviour in case of
> > failure? There are 4 actions:
> >
> > 1. Restart JVM process (it's possible only if process was started from
> > ignite.(sh|bat) script)
> > 2. Terminate JVM;
> > 3. Stop node (if there is only one node in process then process will
> > be also terminated);
> > 4. No operation.
> >
> > I believe that node should be stopped by default. But there is chance
> > that node will not stopped correctly.
> >
> > May be we should terminate JVM process by default. But it will kill
> > all nodes in the JVM process. It's especially bad behaviour in case
> > when nodes belong different Ignite clusters (real use case).
> >
> > May be we should restart JVM process default. This approach has the
> > same problems as the previous one. And additionally it could lead to
> > continues restarts and, therefore, continues exchanges and
> > rebalancing.
> >
> > Difficult choice. Could you please share your thoughts.
> >
> > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 14+Ignite+failures+handling
> >
>
>
>
> --
> Best regards,
>   Andrey Kuznetsov.
>

Re: IEP-14: Ignite failures handling (Discussion)

Posted by Andrey Kuznetsov <st...@gmail.com>.

To my mind, the default action should be as severe as possible, since we
deal with critical errors, that is, entire JVM termination. In the case of
some custom setup (e.g. different cluster nodes in one JVM) failure
response action should be configured explicitly.

2018-03-12 12:32 GMT+03:00 Andrey Gura <ag...@apache.org>:

> Igniters!
>
> We are working on proposal described in IEP-14 Ignite failures
> handling [1] and it's time to discuss it with community (although it
> was necessary to do this before).
>
> Most important question: what should be default behaviour in case of
> failure? There are 4 actions:
>
> 1. Restart JVM process (it's possible only if process was started from
> ignite.(sh|bat) script)
> 2. Terminate JVM;
> 3. Stop node (if there is only one node in process then process will
> be also terminated);
> 4. No operation.
>
> I believe that node should be stopped by default. But there is chance
> that node will not stopped correctly.
>
> May be we should terminate JVM process by default. But it will kill
> all nodes in the JVM process. It's especially bad behaviour in case
> when nodes belong different Ignite clusters (real use case).
>
> May be we should restart JVM process default. This approach has the
> same problems as the previous one. And additionally it could lead to
> continues restarts and, therefore, continues exchanges and
> rebalancing.
>
> Difficult choice. Could you please share your thoughts.
>
> [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 14+Ignite+failures+handling
>



-- 
Best regards,
  Andrey Kuznetsov.