Communication exception handling

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Communication exception handling

yzhdanov
Guys,

I see the following code
(org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):

                    try {
                        cctx.io().send(n, req, tx.ioPolicy());
                    }
                    catch (ClusterTopologyCheckedException e) {
                        fut.onNodeLeft(e);
                    }
                    catch (IgniteCheckedException e) {
                        if (!cctx.kernalContext().isStopping())
                            fut.onResult(e);
                    }


Which means that in case if node has just started stop procedure, all cache
operations may potentially hang. If cache.put() is called from job and node
is stopping gracefully, stop process hangs with 100% probability.

This issue does not threaten failure detection and nodes crash cases since
this is handled by separate logic.

I fixed Communication SPI to use its internal stopping flag instead of the
system wide one and this seems to fix the issue with graceful stop.

Semyon, can you please see if this may cause any other issue of the kind?

My changes are here - https://github.com/apache/ignite/pull/278

--Yakov
Reply | Threaded
Open this post in threaded view
|

Re: Communication exception handling

Semyon Boikov
Yakov,

When node is stopped all cache futures are completed with error, where did
you see hang?


On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <[hidden email]> wrote:

> Guys,
>
> I see the following code
>
> (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):
>
>                     try {
>                         cctx.io().send(n, req, tx.ioPolicy());
>                     }
>                     catch (ClusterTopologyCheckedException e) {
>                         fut.onNodeLeft(e);
>                     }
>                     catch (IgniteCheckedException e) {
>                         if (!cctx.kernalContext().isStopping())
>                             fut.onResult(e);
>                     }
>
>
> Which means that in case if node has just started stop procedure, all cache
> operations may potentially hang. If cache.put() is called from job and node
> is stopping gracefully, stop process hangs with 100% probability.
>
> This issue does not threaten failure detection and nodes crash cases since
> this is handled by separate logic.
>
> I fixed Communication SPI to use its internal stopping flag instead of the
> system wide one and this seems to fix the issue with graceful stop.
>
> Semyon, can you please see if this may cause any other issue of the kind?
>
> My changes are here - https://github.com/apache/ignite/pull/278
>
> --Yakov
>
Reply | Threaded
Open this post in threaded view
|

Re: Communication exception handling

yzhdanov
Cache processor has not received stop signal since stopping thread is
trapped in job processor waiting for all jobs to finish.

--Yakov

2015-11-28 15:57 GMT+03:00 Semyon Boikov <[hidden email]>:

> Yakov,
>
> When node is stopped all cache futures are completed with error, where did
> you see hang?
>
>
> On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <[hidden email]>
> wrote:
>
> > Guys,
> >
> > I see the following code
> >
> >
> (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):
> >
> >                     try {
> >                         cctx.io().send(n, req, tx.ioPolicy());
> >                     }
> >                     catch (ClusterTopologyCheckedException e) {
> >                         fut.onNodeLeft(e);
> >                     }
> >                     catch (IgniteCheckedException e) {
> >                         if (!cctx.kernalContext().isStopping())
> >                             fut.onResult(e);
> >                     }
> >
> >
> > Which means that in case if node has just started stop procedure, all
> cache
> > operations may potentially hang. If cache.put() is called from job and
> node
> > is stopping gracefully, stop process hangs with 100% probability.
> >
> > This issue does not threaten failure detection and nodes crash cases
> since
> > this is handled by separate logic.
> >
> > I fixed Communication SPI to use its internal stopping flag instead of
> the
> > system wide one and this seems to fix the issue with graceful stop.
> >
> > Semyon, can you please see if this may cause any other issue of the kind?
> >
> > My changes are here - https://github.com/apache/ignite/pull/278
> >
> > --Yakov
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Communication exception handling

Semyon Boikov
Fix looks good, but it still can be dangerous to merge last minute before
release.

On Sat, Nov 28, 2015 at 4:44 PM, Yakov Zhdanov <[hidden email]> wrote:

> Cache processor has not received stop signal since stopping thread is
> trapped in job processor waiting for all jobs to finish.
>
> --Yakov
>
> 2015-11-28 15:57 GMT+03:00 Semyon Boikov <[hidden email]>:
>
> > Yakov,
> >
> > When node is stopped all cache futures are completed with error, where
> did
> > you see hang?
> >
> >
> > On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <[hidden email]>
> > wrote:
> >
> > > Guys,
> > >
> > > I see the following code
> > >
> > >
> >
> (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):
> > >
> > >                     try {
> > >                         cctx.io().send(n, req, tx.ioPolicy());
> > >                     }
> > >                     catch (ClusterTopologyCheckedException e) {
> > >                         fut.onNodeLeft(e);
> > >                     }
> > >                     catch (IgniteCheckedException e) {
> > >                         if (!cctx.kernalContext().isStopping())
> > >                             fut.onResult(e);
> > >                     }
> > >
> > >
> > > Which means that in case if node has just started stop procedure, all
> > cache
> > > operations may potentially hang. If cache.put() is called from job and
> > node
> > > is stopping gracefully, stop process hangs with 100% probability.
> > >
> > > This issue does not threaten failure detection and nodes crash cases
> > since
> > > this is handled by separate logic.
> > >
> > > I fixed Communication SPI to use its internal stopping flag instead of
> > the
> > > system wide one and this seems to fix the issue with graceful stop.
> > >
> > > Semyon, can you please see if this may cause any other issue of the
> kind?
> > >
> > > My changes are here - https://github.com/apache/ignite/pull/278
> > >
> > > --Yakov
> > >
> >
>