You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Nikolay Izhikov <ni...@apache.org> on 2018/12/27 08:56:55 UTC

System Worker Failure Handler on local laptop

Hello, Igniters.

I run into issue with critical system worker failure handler.
I just run `IgniteDataFrameSuite` and it terminates on random test.
My laptop doesn't have bleeding edge hardware, so tests can take
significant amount of time.
Looks like our watch dog too aggressive on development environment

Can you please, help me. What should I do to configure or turn off watch
dog?
Should we relax it a little bit? At least for a test environment.

Error message contains following message:

```
[2018-12-27 11:40:23,597][ERROR][exchange-worker-#5547%grid-2%][root]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class
o.a.i.IgniteCheckedException: Node is stopping: grid-2]]
class org.apache.ignite.IgniteCheckedException: Node is stopping: grid-2
```

Re: System Worker Failure Handler on local laptop

Posted by Andrey Gura <ag...@apache.org>.
Guys,

there is no problem in blocking thread monitroing. Please, look at the
error message: "failureCtx=FailureContext
[type=SYSTEM_WORKER_TERMINATION, err=class
o.a.i.IgniteCheckedException: Node is stopping: grid-2]]". Some
critical worker was terminated unexpectedly. So the problem isn't
related with any timeouts. It's a bug that should be investigated.



On Thu, Dec 27, 2018 at 9:27 PM Denis Magda <dm...@apache.org> wrote:
>
> Folks,
>
> What are the current timeouts? We need to know the probability of failures
> in dev environment. This affect usability.
>
> --
> Denis
>
> On Thu, Dec 27, 2018 at 4:59 AM Alexey Goncharuk <al...@gmail.com>
> wrote:
>
> > Nikolay,
> >
> > Yes, the fix is already in master. Looks like I was wrong, in your case
> > failure handler is triggered by 'Node is stopping: grid-2'. Can you please
> > share the full trace?
> >
> >
> >
> > чт, 27 дек. 2018 г. в 12:41, Nikolay Izhikov <ni...@apache.org>:
> >
> > > Alexey
> > >
> > > Fix for this issue already in master?
> > > I run tests on current master.
> > >
> > > > Should we somehow announce it on the user-list or highlight on
> > readme.io
> > > ?
> > >
> > > I don't think our users will be happy to users stuck with this behavior
> > in
> > > production.
> > >
> > > Am I understand you correctly:
> > > If someone use 2.7. release and Ignite process slowing for a few seconds
> > > for any reason(low-end hardwre, VM pause, other processes grab the
> > > resources) then Ignite node will be stopped?
> > >
> > > > This is the issue I mentioned in "Critical worker threads liveness
> > > checking
> > > drawbacks" topic
> > >
> > > Thanks for the link, I will check it out.
> > >
> > > чт, 27 дек. 2018 г. в 12:24, Alexey Goncharuk <
> > alexey.goncharuk@gmail.com
> > > >:
> > >
> > > > Hi Nikolay,
> > > >
> > > > This is the issue I mentioned in "Critical worker threads liveness
> > > checking
> > > > drawbacks" topic which I was expecting to be included to Ignite 2.7,
> > but
> > > it
> > > > was not. To workaround the issue, you should set
> > > > DataStorageConfiguration#setCheckpointReadLockTimeout to 0.
> > > >
> > > > Should we somehow announce it on the user-list or highlight on
> > readme.io
> > > ?
> > > >
> > > > чт, 27 дек. 2018 г. в 11:57, Nikolay Izhikov <ni...@apache.org>:
> > > >
> > > > > Hello, Igniters.
> > > > >
> > > > > I run into issue with critical system worker failure handler.
> > > > > I just run `IgniteDataFrameSuite` and it terminates on random test.
> > > > > My laptop doesn't have bleeding edge hardware, so tests can take
> > > > > significant amount of time.
> > > > > Looks like our watch dog too aggressive on development environment
> > > > >
> > > > > Can you please, help me. What should I do to configure or turn off
> > > watch
> > > > > dog?
> > > > > Should we relax it a little bit? At least for a test environment.
> > > > >
> > > > > Error message contains following message:
> > > > >
> > > > > ```
> > > > > [2018-12-27 11:40:23,597][ERROR][exchange-worker-#5547%grid-2%][root]
> > > > > Critical system error detected. Will be handled accordingly to
> > > configured
> > > > > handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> > > > > super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> > > > > [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> > > > > failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class
> > > > > o.a.i.IgniteCheckedException: Node is stopping: grid-2]]
> > > > > class org.apache.ignite.IgniteCheckedException: Node is stopping:
> > > grid-2
> > > > > ```
> > > > >
> > > >
> > >
> >

Re: System Worker Failure Handler on local laptop

Posted by Denis Magda <dm...@apache.org>.
Folks,

What are the current timeouts? We need to know the probability of failures
in dev environment. This affect usability.

--
Denis

On Thu, Dec 27, 2018 at 4:59 AM Alexey Goncharuk <al...@gmail.com>
wrote:

> Nikolay,
>
> Yes, the fix is already in master. Looks like I was wrong, in your case
> failure handler is triggered by 'Node is stopping: grid-2'. Can you please
> share the full trace?
>
>
>
> чт, 27 дек. 2018 г. в 12:41, Nikolay Izhikov <ni...@apache.org>:
>
> > Alexey
> >
> > Fix for this issue already in master?
> > I run tests on current master.
> >
> > > Should we somehow announce it on the user-list or highlight on
> readme.io
> > ?
> >
> > I don't think our users will be happy to users stuck with this behavior
> in
> > production.
> >
> > Am I understand you correctly:
> > If someone use 2.7. release and Ignite process slowing for a few seconds
> > for any reason(low-end hardwre, VM pause, other processes grab the
> > resources) then Ignite node will be stopped?
> >
> > > This is the issue I mentioned in "Critical worker threads liveness
> > checking
> > drawbacks" topic
> >
> > Thanks for the link, I will check it out.
> >
> > чт, 27 дек. 2018 г. в 12:24, Alexey Goncharuk <
> alexey.goncharuk@gmail.com
> > >:
> >
> > > Hi Nikolay,
> > >
> > > This is the issue I mentioned in "Critical worker threads liveness
> > checking
> > > drawbacks" topic which I was expecting to be included to Ignite 2.7,
> but
> > it
> > > was not. To workaround the issue, you should set
> > > DataStorageConfiguration#setCheckpointReadLockTimeout to 0.
> > >
> > > Should we somehow announce it on the user-list or highlight on
> readme.io
> > ?
> > >
> > > чт, 27 дек. 2018 г. в 11:57, Nikolay Izhikov <ni...@apache.org>:
> > >
> > > > Hello, Igniters.
> > > >
> > > > I run into issue with critical system worker failure handler.
> > > > I just run `IgniteDataFrameSuite` and it terminates on random test.
> > > > My laptop doesn't have bleeding edge hardware, so tests can take
> > > > significant amount of time.
> > > > Looks like our watch dog too aggressive on development environment
> > > >
> > > > Can you please, help me. What should I do to configure or turn off
> > watch
> > > > dog?
> > > > Should we relax it a little bit? At least for a test environment.
> > > >
> > > > Error message contains following message:
> > > >
> > > > ```
> > > > [2018-12-27 11:40:23,597][ERROR][exchange-worker-#5547%grid-2%][root]
> > > > Critical system error detected. Will be handled accordingly to
> > configured
> > > > handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> > > > super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> > > > [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> > > > failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class
> > > > o.a.i.IgniteCheckedException: Node is stopping: grid-2]]
> > > > class org.apache.ignite.IgniteCheckedException: Node is stopping:
> > grid-2
> > > > ```
> > > >
> > >
> >
>

Re: System Worker Failure Handler on local laptop

Posted by Alexey Goncharuk <al...@gmail.com>.
Nikolay,

Yes, the fix is already in master. Looks like I was wrong, in your case
failure handler is triggered by 'Node is stopping: grid-2'. Can you please
share the full trace?



чт, 27 дек. 2018 г. в 12:41, Nikolay Izhikov <ni...@apache.org>:

> Alexey
>
> Fix for this issue already in master?
> I run tests on current master.
>
> > Should we somehow announce it on the user-list or highlight on readme.io
> ?
>
> I don't think our users will be happy to users stuck with this behavior in
> production.
>
> Am I understand you correctly:
> If someone use 2.7. release and Ignite process slowing for a few seconds
> for any reason(low-end hardwre, VM pause, other processes grab the
> resources) then Ignite node will be stopped?
>
> > This is the issue I mentioned in "Critical worker threads liveness
> checking
> drawbacks" topic
>
> Thanks for the link, I will check it out.
>
> чт, 27 дек. 2018 г. в 12:24, Alexey Goncharuk <alexey.goncharuk@gmail.com
> >:
>
> > Hi Nikolay,
> >
> > This is the issue I mentioned in "Critical worker threads liveness
> checking
> > drawbacks" topic which I was expecting to be included to Ignite 2.7, but
> it
> > was not. To workaround the issue, you should set
> > DataStorageConfiguration#setCheckpointReadLockTimeout to 0.
> >
> > Should we somehow announce it on the user-list or highlight on readme.io
> ?
> >
> > чт, 27 дек. 2018 г. в 11:57, Nikolay Izhikov <ni...@apache.org>:
> >
> > > Hello, Igniters.
> > >
> > > I run into issue with critical system worker failure handler.
> > > I just run `IgniteDataFrameSuite` and it terminates on random test.
> > > My laptop doesn't have bleeding edge hardware, so tests can take
> > > significant amount of time.
> > > Looks like our watch dog too aggressive on development environment
> > >
> > > Can you please, help me. What should I do to configure or turn off
> watch
> > > dog?
> > > Should we relax it a little bit? At least for a test environment.
> > >
> > > Error message contains following message:
> > >
> > > ```
> > > [2018-12-27 11:40:23,597][ERROR][exchange-worker-#5547%grid-2%][root]
> > > Critical system error detected. Will be handled accordingly to
> configured
> > > handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> > > super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> > > [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> > > failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class
> > > o.a.i.IgniteCheckedException: Node is stopping: grid-2]]
> > > class org.apache.ignite.IgniteCheckedException: Node is stopping:
> grid-2
> > > ```
> > >
> >
>

Re: System Worker Failure Handler on local laptop

Posted by Nikolay Izhikov <ni...@apache.org>.
Alexey

Fix for this issue already in master?
I run tests on current master.

> Should we somehow announce it on the user-list or highlight on readme.io?

I don't think our users will be happy to users stuck with this behavior in
production.

Am I understand you correctly:
If someone use 2.7. release and Ignite process slowing for a few seconds
for any reason(low-end hardwre, VM pause, other processes grab the
resources) then Ignite node will be stopped?

> This is the issue I mentioned in "Critical worker threads liveness
checking
drawbacks" topic

Thanks for the link, I will check it out.

чт, 27 дек. 2018 г. в 12:24, Alexey Goncharuk <al...@gmail.com>:

> Hi Nikolay,
>
> This is the issue I mentioned in "Critical worker threads liveness checking
> drawbacks" topic which I was expecting to be included to Ignite 2.7, but it
> was not. To workaround the issue, you should set
> DataStorageConfiguration#setCheckpointReadLockTimeout to 0.
>
> Should we somehow announce it on the user-list or highlight on readme.io?
>
> чт, 27 дек. 2018 г. в 11:57, Nikolay Izhikov <ni...@apache.org>:
>
> > Hello, Igniters.
> >
> > I run into issue with critical system worker failure handler.
> > I just run `IgniteDataFrameSuite` and it terminates on random test.
> > My laptop doesn't have bleeding edge hardware, so tests can take
> > significant amount of time.
> > Looks like our watch dog too aggressive on development environment
> >
> > Can you please, help me. What should I do to configure or turn off watch
> > dog?
> > Should we relax it a little bit? At least for a test environment.
> >
> > Error message contains following message:
> >
> > ```
> > [2018-12-27 11:40:23,597][ERROR][exchange-worker-#5547%grid-2%][root]
> > Critical system error detected. Will be handled accordingly to configured
> > handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> > super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> > [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> > failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class
> > o.a.i.IgniteCheckedException: Node is stopping: grid-2]]
> > class org.apache.ignite.IgniteCheckedException: Node is stopping: grid-2
> > ```
> >
>

Re: System Worker Failure Handler on local laptop

Posted by Alexey Goncharuk <al...@gmail.com>.
Hi Nikolay,

This is the issue I mentioned in "Critical worker threads liveness checking
drawbacks" topic which I was expecting to be included to Ignite 2.7, but it
was not. To workaround the issue, you should set
DataStorageConfiguration#setCheckpointReadLockTimeout to 0.

Should we somehow announce it on the user-list or highlight on readme.io?

чт, 27 дек. 2018 г. в 11:57, Nikolay Izhikov <ni...@apache.org>:

> Hello, Igniters.
>
> I run into issue with critical system worker failure handler.
> I just run `IgniteDataFrameSuite` and it terminates on random test.
> My laptop doesn't have bleeding edge hardware, so tests can take
> significant amount of time.
> Looks like our watch dog too aggressive on development environment
>
> Can you please, help me. What should I do to configure or turn off watch
> dog?
> Should we relax it a little bit? At least for a test environment.
>
> Error message contains following message:
>
> ```
> [2018-12-27 11:40:23,597][ERROR][exchange-worker-#5547%grid-2%][root]
> Critical system error detected. Will be handled accordingly to configured
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]],
> failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class
> o.a.i.IgniteCheckedException: Node is stopping: grid-2]]
> class org.apache.ignite.IgniteCheckedException: Node is stopping: grid-2
> ```
>