You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Yakov Zhdanov <yz...@apache.org> on 2015/11/28 13:37:54 UTC

Communication exception handling

Guys,

I see the following code
(org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):

                    try {
                        cctx.io().send(n, req, tx.ioPolicy());
                    }
                    catch (ClusterTopologyCheckedException e) {
                        fut.onNodeLeft(e);
                    }
                    catch (IgniteCheckedException e) {
                        if (!cctx.kernalContext().isStopping())
                            fut.onResult(e);
                    }


Which means that in case if node has just started stop procedure, all cache
operations may potentially hang. If cache.put() is called from job and node
is stopping gracefully, stop process hangs with 100% probability.

This issue does not threaten failure detection and nodes crash cases since
this is handled by separate logic.

I fixed Communication SPI to use its internal stopping flag instead of the
system wide one and this seems to fix the issue with graceful stop.

Semyon, can you please see if this may cause any other issue of the kind?

My changes are here - https://github.com/apache/ignite/pull/278

--Yakov

Re: Communication exception handling

Posted by Semyon Boikov <sb...@gridgain.com>.

Fix looks good, but it still can be dangerous to merge last minute before
release.

On Sat, Nov 28, 2015 at 4:44 PM, Yakov Zhdanov <yz...@apache.org> wrote:

> Cache processor has not received stop signal since stopping thread is
> trapped in job processor waiting for all jobs to finish.
>
> --Yakov
>
> 2015-11-28 15:57 GMT+03:00 Semyon Boikov <sb...@gridgain.com>:
>
> > Yakov,
> >
> > When node is stopped all cache futures are completed with error, where
> did
> > you see hang?
> >
> >
> > On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <yz...@apache.org>
> > wrote:
> >
> > > Guys,
> > >
> > > I see the following code
> > >
> > >
> >
> (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):
> > >
> > >                     try {
> > >                         cctx.io().send(n, req, tx.ioPolicy());
> > >                     }
> > >                     catch (ClusterTopologyCheckedException e) {
> > >                         fut.onNodeLeft(e);
> > >                     }
> > >                     catch (IgniteCheckedException e) {
> > >                         if (!cctx.kernalContext().isStopping())
> > >                             fut.onResult(e);
> > >                     }
> > >
> > >
> > > Which means that in case if node has just started stop procedure, all
> > cache
> > > operations may potentially hang. If cache.put() is called from job and
> > node
> > > is stopping gracefully, stop process hangs with 100% probability.
> > >
> > > This issue does not threaten failure detection and nodes crash cases
> > since
> > > this is handled by separate logic.
> > >
> > > I fixed Communication SPI to use its internal stopping flag instead of
> > the
> > > system wide one and this seems to fix the issue with graceful stop.
> > >
> > > Semyon, can you please see if this may cause any other issue of the
> kind?
> > >
> > > My changes are here - https://github.com/apache/ignite/pull/278
> > >
> > > --Yakov
> > >
> >
>

Re: Communication exception handling

Posted by Yakov Zhdanov <yz...@apache.org>.

Cache processor has not received stop signal since stopping thread is
trapped in job processor waiting for all jobs to finish.

--Yakov

2015-11-28 15:57 GMT+03:00 Semyon Boikov <sb...@gridgain.com>:

> Yakov,
>
> When node is stopped all cache futures are completed with error, where did
> you see hang?
>
>
> On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <yz...@apache.org>
> wrote:
>
> > Guys,
> >
> > I see the following code
> >
> >
> (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):
> >
> >                     try {
> >                         cctx.io().send(n, req, tx.ioPolicy());
> >                     }
> >                     catch (ClusterTopologyCheckedException e) {
> >                         fut.onNodeLeft(e);
> >                     }
> >                     catch (IgniteCheckedException e) {
> >                         if (!cctx.kernalContext().isStopping())
> >                             fut.onResult(e);
> >                     }
> >
> >
> > Which means that in case if node has just started stop procedure, all
> cache
> > operations may potentially hang. If cache.put() is called from job and
> node
> > is stopping gracefully, stop process hangs with 100% probability.
> >
> > This issue does not threaten failure detection and nodes crash cases
> since
> > this is handled by separate logic.
> >
> > I fixed Communication SPI to use its internal stopping flag instead of
> the
> > system wide one and this seems to fix the issue with graceful stop.
> >
> > Semyon, can you please see if this may cause any other issue of the kind?
> >
> > My changes are here - https://github.com/apache/ignite/pull/278
> >
> > --Yakov
> >
>

Re: Communication exception handling

Posted by Semyon Boikov <sb...@gridgain.com>.

Yakov,

When node is stopped all cache futures are completed with error, where did
you see hang?


On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <yz...@apache.org> wrote:

> Guys,
>
> I see the following code
>
> (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):
>
>                     try {
>                         cctx.io().send(n, req, tx.ioPolicy());
>                     }
>                     catch (ClusterTopologyCheckedException e) {
>                         fut.onNodeLeft(e);
>                     }
>                     catch (IgniteCheckedException e) {
>                         if (!cctx.kernalContext().isStopping())
>                             fut.onResult(e);
>                     }
>
>
> Which means that in case if node has just started stop procedure, all cache
> operations may potentially hang. If cache.put() is called from job and node
> is stopping gracefully, stop process hangs with 100% probability.
>
> This issue does not threaten failure detection and nodes crash cases since
> this is handled by separate logic.
>
> I fixed Communication SPI to use its internal stopping flag instead of the
> system wide one and this seems to fix the issue with graceful stop.
>
> Semyon, can you please see if this may cause any other issue of the kind?
>
> My changes are here - https://github.com/apache/ignite/pull/278
>
> --Yakov
>