Apache Ignite Developers - Legacy Mail Archive

Communication exception handling

Classic

List

Threaded

4 messages Options

yzhdanov

Communication exception handling

Guys,

I see the following code
(org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):

try {
cctx.io().send(n, req, tx.ioPolicy());
}
catch (ClusterTopologyCheckedException e) {
fut.onNodeLeft(e);
}
catch (IgniteCheckedException e) {
if (!cctx.kernalContext().isStopping())
fut.onResult(e);
}

Which means that in case if node has just started stop procedure, all cache
operations may potentially hang. If cache.put() is called from job and node
is stopping gracefully, stop process hangs with 100% probability.

This issue does not threaten failure detection and nodes crash cases since
this is handled by separate logic.

I fixed Communication SPI to use its internal stopping flag instead of the
system wide one and this seems to fix the issue with graceful stop.

Semyon, can you please see if this may cause any other issue of the kind?

My changes are here - https://github.com/apache/ignite/pull/278

--Yakov

Semyon Boikov

Re: Communication exception handling

Yakov,

When node is stopped all cache futures are completed with error, where did
you see hang?

On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <[hidden email]> wrote:

> Guys,
>
> I see the following code
>
> (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):
>
> try {
> cctx.io().send(n, req, tx.ioPolicy());
> }
> catch (ClusterTopologyCheckedException e) {
> fut.onNodeLeft(e);
> }
> catch (IgniteCheckedException e) {
> if (!cctx.kernalContext().isStopping())
> fut.onResult(e);
> }
>
>
> Which means that in case if node has just started stop procedure, all cache
> operations may potentially hang. If cache.put() is called from job and node
> is stopping gracefully, stop process hangs with 100% probability.
>
> This issue does not threaten failure detection and nodes crash cases since
> this is handled by separate logic.
>
> I fixed Communication SPI to use its internal stopping flag instead of the
> system wide one and this seems to fix the issue with graceful stop.
>
> Semyon, can you please see if this may cause any other issue of the kind?
>
> My changes are here - https://github.com/apache/ignite/pull/278
>
> --Yakov
>

yzhdanov

Re: Communication exception handling

Cache processor has not received stop signal since stopping thread is
trapped in job processor waiting for all jobs to finish.

--Yakov

2015-11-28 15:57 GMT+03:00 Semyon Boikov <[hidden email]>:

> Yakov,
>
> When node is stopped all cache futures are completed with error, where did
> you see hang?
>
>
> On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <[hidden email]>
> wrote:
>
> > Guys,
> >
> > I see the following code
> >
> >
> (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):
> >
> > try {
> > cctx.io().send(n, req, tx.ioPolicy());
> > }
> > catch (ClusterTopologyCheckedException e) {
> > fut.onNodeLeft(e);
> > }
> > catch (IgniteCheckedException e) {
> > if (!cctx.kernalContext().isStopping())
> > fut.onResult(e);
> > }
> >
> >
> > Which means that in case if node has just started stop procedure, all
> cache
> > operations may potentially hang. If cache.put() is called from job and
> node
> > is stopping gracefully, stop process hangs with 100% probability.
> >
> > This issue does not threaten failure detection and nodes crash cases
> since
> > this is handled by separate logic.
> >
> > I fixed Communication SPI to use its internal stopping flag instead of
> the
> > system wide one and this seems to fix the issue with graceful stop.
> >
> > Semyon, can you please see if this may cause any other issue of the kind?
> >
> > My changes are here - https://github.com/apache/ignite/pull/278
> >
> > --Yakov
> >
>

Semyon Boikov

Re: Communication exception handling

Fix looks good, but it still can be dangerous to merge last minute before
release.

On Sat, Nov 28, 2015 at 4:44 PM, Yakov Zhdanov <[hidden email]> wrote:

> Cache processor has not received stop signal since stopping thread is
> trapped in job processor waiting for all jobs to finish.
>
> --Yakov
>
> 2015-11-28 15:57 GMT+03:00 Semyon Boikov <[hidden email]>:
>
> > Yakov,
> >
> > When node is stopped all cache futures are completed with error, where
> did
> > you see hang?
> >
> >
> > On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <[hidden email]>
> > wrote:
> >
> > > Guys,
> > >
> > > I see the following code
> > >
> > >
> >
> (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):
> > >
> > > try {
> > > cctx.io().send(n, req, tx.ioPolicy());
> > > }
> > > catch (ClusterTopologyCheckedException e) {
> > > fut.onNodeLeft(e);
> > > }
> > > catch (IgniteCheckedException e) {
> > > if (!cctx.kernalContext().isStopping())
> > > fut.onResult(e);
> > > }
> > >
> > >
> > > Which means that in case if node has just started stop procedure, all
> > cache
> > > operations may potentially hang. If cache.put() is called from job and
> > node
> > > is stopping gracefully, stop process hangs with 100% probability.
> > >
> > > This issue does not threaten failure detection and nodes crash cases
> > since
> > > this is handled by separate logic.
> > >
> > > I fixed Communication SPI to use its internal stopping flag instead of
> > the
> > > system wide one and this seems to fix the issue with graceful stop.
> > >
> > > Semyon, can you please see if this may cause any other issue of the
> kind?
> > >
> > > My changes are here - https://github.com/apache/ignite/pull/278
> > >
> > > --Yakov
> > >
> >
>