Guys,
I see the following code (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129): try { cctx.io().send(n, req, tx.ioPolicy()); } catch (ClusterTopologyCheckedException e) { fut.onNodeLeft(e); } catch (IgniteCheckedException e) { if (!cctx.kernalContext().isStopping()) fut.onResult(e); } Which means that in case if node has just started stop procedure, all cache operations may potentially hang. If cache.put() is called from job and node is stopping gracefully, stop process hangs with 100% probability. This issue does not threaten failure detection and nodes crash cases since this is handled by separate logic. I fixed Communication SPI to use its internal stopping flag instead of the system wide one and this seems to fix the issue with graceful stop. Semyon, can you please see if this may cause any other issue of the kind? My changes are here - https://github.com/apache/ignite/pull/278 --Yakov |
Yakov,
When node is stopped all cache futures are completed with error, where did you see hang? On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <[hidden email]> wrote: > Guys, > > I see the following code > > (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129): > > try { > cctx.io().send(n, req, tx.ioPolicy()); > } > catch (ClusterTopologyCheckedException e) { > fut.onNodeLeft(e); > } > catch (IgniteCheckedException e) { > if (!cctx.kernalContext().isStopping()) > fut.onResult(e); > } > > > Which means that in case if node has just started stop procedure, all cache > operations may potentially hang. If cache.put() is called from job and node > is stopping gracefully, stop process hangs with 100% probability. > > This issue does not threaten failure detection and nodes crash cases since > this is handled by separate logic. > > I fixed Communication SPI to use its internal stopping flag instead of the > system wide one and this seems to fix the issue with graceful stop. > > Semyon, can you please see if this may cause any other issue of the kind? > > My changes are here - https://github.com/apache/ignite/pull/278 > > --Yakov > |
Cache processor has not received stop signal since stopping thread is
trapped in job processor waiting for all jobs to finish. --Yakov 2015-11-28 15:57 GMT+03:00 Semyon Boikov <[hidden email]>: > Yakov, > > When node is stopped all cache futures are completed with error, where did > you see hang? > > > On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <[hidden email]> > wrote: > > > Guys, > > > > I see the following code > > > > > (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129): > > > > try { > > cctx.io().send(n, req, tx.ioPolicy()); > > } > > catch (ClusterTopologyCheckedException e) { > > fut.onNodeLeft(e); > > } > > catch (IgniteCheckedException e) { > > if (!cctx.kernalContext().isStopping()) > > fut.onResult(e); > > } > > > > > > Which means that in case if node has just started stop procedure, all > cache > > operations may potentially hang. If cache.put() is called from job and > node > > is stopping gracefully, stop process hangs with 100% probability. > > > > This issue does not threaten failure detection and nodes crash cases > since > > this is handled by separate logic. > > > > I fixed Communication SPI to use its internal stopping flag instead of > the > > system wide one and this seems to fix the issue with graceful stop. > > > > Semyon, can you please see if this may cause any other issue of the kind? > > > > My changes are here - https://github.com/apache/ignite/pull/278 > > > > --Yakov > > > |
Fix looks good, but it still can be dangerous to merge last minute before
release. On Sat, Nov 28, 2015 at 4:44 PM, Yakov Zhdanov <[hidden email]> wrote: > Cache processor has not received stop signal since stopping thread is > trapped in job processor waiting for all jobs to finish. > > --Yakov > > 2015-11-28 15:57 GMT+03:00 Semyon Boikov <[hidden email]>: > > > Yakov, > > > > When node is stopped all cache futures are completed with error, where > did > > you see hang? > > > > > > On Sat, Nov 28, 2015 at 3:37 PM, Yakov Zhdanov <[hidden email]> > > wrote: > > > > > Guys, > > > > > > I see the following code > > > > > > > > > (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129): > > > > > > try { > > > cctx.io().send(n, req, tx.ioPolicy()); > > > } > > > catch (ClusterTopologyCheckedException e) { > > > fut.onNodeLeft(e); > > > } > > > catch (IgniteCheckedException e) { > > > if (!cctx.kernalContext().isStopping()) > > > fut.onResult(e); > > > } > > > > > > > > > Which means that in case if node has just started stop procedure, all > > cache > > > operations may potentially hang. If cache.put() is called from job and > > node > > > is stopping gracefully, stop process hangs with 100% probability. > > > > > > This issue does not threaten failure detection and nodes crash cases > > since > > > this is handled by separate logic. > > > > > > I fixed Communication SPI to use its internal stopping flag instead of > > the > > > system wide one and this seems to fix the issue with graceful stop. > > > > > > Semyon, can you please see if this may cause any other issue of the > kind? > > > > > > My changes are here - https://github.com/apache/ignite/pull/278 > > > > > > --Yakov > > > > > > |
Free forum by Nabble | Edit this page |