Hi, Igniters!
We have a set of internal problems, which required graceful node shutdown, or other reaction configured (See discussion thread http://apache-ignite-developers.2346864.n4.nabble.com/Ignite-Enhancement-Proposal-7-Internal-problems-detection-td24460.html ): - IgniteOutOfMemoryException - https://issues.apache.org/jira/browse/IGNITE-6892 - Persistence errors - https://issues.apache.org/jira/browse/IGNITE-6891 - ExchangeWorker exits with error - https://issues.apache.org/jira/browse/IGNITE-6890 First, I propose reconsider 3rd problem as "System worker exit while node still running (node stopping process has not been started)", because we have at least 5 worker classes, which running is critical for node working. These workers are: - partition-exchanger (ExchangeWorker) - disco-event-worker - nio-acceptor - grid-nio-worker-tcp-comm-* - grid-timeout-worker Second, I propose to use FailureProcessingPolicy (already implemented in scope of task IGNITE-6890) for reaction definition on 1st and 2nd detected problems too. This policy can be configured similar to SegmentationPolicy in IgniteConfiguration. Opinions? |
Hi Dmitriy,
I’m totally for the FailureProcessingPolicy addition to IgniteConfiguration. Apart of this, may I ask you to create corresponding documentation tickets for 2.4 release and “documentation” component? Only for the improvements that are getting into the next release. Basically you can aggregate them if it helps. Feel free to assign the ticket on me right away. — Denis > On Nov 30, 2017, at 10:31 AM, Дмитрий Сорокин <[hidden email]> wrote: > > Hi, Igniters! > > We have a set of internal problems, which required graceful node shutdown, > or other reaction configured (See discussion thread > http://apache-ignite-developers.2346864.n4.nabble.com/Ignite-Enhancement-Proposal-7-Internal-problems-detection-td24460.html > ): > - IgniteOutOfMemoryException - > https://issues.apache.org/jira/browse/IGNITE-6892 > - Persistence errors - https://issues.apache.org/jira/browse/IGNITE-6891 > - ExchangeWorker exits with error - > https://issues.apache.org/jira/browse/IGNITE-6890 > > First, I propose reconsider 3rd problem as "System worker exit while node > still running (node stopping process has not been started)", because we > have at least 5 worker classes, which running is critical for node working. > > These workers are: > - partition-exchanger (ExchangeWorker) > - disco-event-worker > - nio-acceptor > - grid-nio-worker-tcp-comm-* > - grid-timeout-worker > > Second, I propose to use FailureProcessingPolicy (already implemented in > scope of task IGNITE-6890) for reaction definition on 1st and 2nd detected > problems too. This policy can be configured similar to SegmentationPolicy > in IgniteConfiguration. > > Opinions? |
HI Dmitry,
I do not think it is good idea to mix failures of different threads into a single event type. - Practice shows that the most common source of problem is exchange thread - if disco worker has died, not will be excluded from topology safely - "nio-acceptor" can be spawn from multiple places where GridNioServer is started, not all of them are ciritical - "grid-nio-worker-tcp-comm" is internal thread which doesn't do any complex processing, so risk of it's crash is minimal We could track most of them, but death of different threads may result in different actions from user side. So I propose to start with exchange thread only for now. Another important point, is that FailureProcessingPolicy should get enough information on what happened in order to decide how to react. E.g., as I explained earlier, IgniteOutOfMemoryException *is not critical error*. Nasty, but not deadly. And node should not be stopped blindly in response to this event. Vladimir. On Fri, Dec 1, 2017 at 3:50 AM, Denis Magda <[hidden email]> wrote: > Hi Dmitriy, > > I’m totally for the FailureProcessingPolicy addition to > IgniteConfiguration. > > Apart of this, may I ask you to create corresponding documentation tickets > for 2.4 release and “documentation” component? Only for the improvements > that are getting into the next release. Basically you can aggregate them if > it helps. Feel free to assign the ticket on me right away. > > — > Denis > > > On Nov 30, 2017, at 10:31 AM, Дмитрий Сорокин <[hidden email]> > wrote: > > > > Hi, Igniters! > > > > We have a set of internal problems, which required graceful node > shutdown, > > or other reaction configured (See discussion thread > > http://apache-ignite-developers.2346864.n4.nabble. > com/Ignite-Enhancement-Proposal-7-Internal-problems-detection-td24460.html > > ): > > - IgniteOutOfMemoryException - > > https://issues.apache.org/jira/browse/IGNITE-6892 > > - Persistence errors - https://issues.apache.org/jira/browse/IGNITE-6891 > > - ExchangeWorker exits with error - > > https://issues.apache.org/jira/browse/IGNITE-6890 > > > > First, I propose reconsider 3rd problem as "System worker exit while node > > still running (node stopping process has not been started)", because we > > have at least 5 worker classes, which running is critical for node > working. > > > > These workers are: > > - partition-exchanger (ExchangeWorker) > > - disco-event-worker > > - nio-acceptor > > - grid-nio-worker-tcp-comm-* > > - grid-timeout-worker > > > > Second, I propose to use FailureProcessingPolicy (already implemented in > > scope of task IGNITE-6890) for reaction definition on 1st and 2nd > detected > > problems too. This policy can be configured similar to SegmentationPolicy > > in IgniteConfiguration. > > > > Opinions? > > |
Dmitry,
Seems, we found that it's impossible to specify one action for all cases, but it's a good idea to allow user to decide what to do. We should make something like interface IgniteFailureHandler { IgniteFailureAction onFailure(IgniteFailureCause cause); } public enum IgniteFailureAction { RESTART_JVM, STOP, NOOP; } and ability to set it to IgniteConfiguration. Also, we should provide default implementation of IgniteFailureHandler which should be enabled by default and can be replaced by user's code. On Fri, Dec 1, 2017 at 4:27 PM, Vladimir Ozerov <[hidden email]> wrote: > HI Dmitry, > > I do not think it is good idea to mix failures of different threads into a > single event type. > - Practice shows that the most common source of problem is exchange thread > - if disco worker has died, not will be excluded from topology safely > - "nio-acceptor" can be spawn from multiple places where GridNioServer is > started, not all of them are ciritical > - "grid-nio-worker-tcp-comm" is internal thread which doesn't do any > complex processing, so risk of it's crash is minimal > > We could track most of them, but death of different threads may result in > different actions from user side. So I propose to start with exchange > thread only for now. > > Another important point, is that FailureProcessingPolicy should get enough > information on what happened in order to decide how to react. E.g., as I > explained earlier, IgniteOutOfMemoryException *is not critical error*. > Nasty, but not deadly. And node should not be stopped blindly in response > to this event. > > Vladimir. > > > On Fri, Dec 1, 2017 at 3:50 AM, Denis Magda <[hidden email]> wrote: > > > Hi Dmitriy, > > > > I’m totally for the FailureProcessingPolicy addition to > > IgniteConfiguration. > > > > Apart of this, may I ask you to create corresponding documentation > tickets > > for 2.4 release and “documentation” component? Only for the improvements > > that are getting into the next release. Basically you can aggregate them > if > > it helps. Feel free to assign the ticket on me right away. > > > > — > > Denis > > > > > On Nov 30, 2017, at 10:31 AM, Дмитрий Сорокин < > [hidden email]> > > wrote: > > > > > > Hi, Igniters! > > > > > > We have a set of internal problems, which required graceful node > > shutdown, > > > or other reaction configured (See discussion thread > > > http://apache-ignite-developers.2346864.n4.nabble. > > com/Ignite-Enhancement-Proposal-7-Internal-problems- > detection-td24460.html > > > ): > > > - IgniteOutOfMemoryException - > > > https://issues.apache.org/jira/browse/IGNITE-6892 > > > - Persistence errors - https://issues.apache.org/ > jira/browse/IGNITE-6891 > > > - ExchangeWorker exits with error - > > > https://issues.apache.org/jira/browse/IGNITE-6890 > > > > > > First, I propose reconsider 3rd problem as "System worker exit while > node > > > still running (node stopping process has not been started)", because we > > > have at least 5 worker classes, which running is critical for node > > working. > > > > > > These workers are: > > > - partition-exchanger (ExchangeWorker) > > > - disco-event-worker > > > - nio-acceptor > > > - grid-nio-worker-tcp-comm-* > > > - grid-timeout-worker > > > > > > Second, I propose to use FailureProcessingPolicy (already implemented > in > > > scope of task IGNITE-6890) for reaction definition on 1st and 2nd > > detected > > > problems too. This policy can be configured similar to > SegmentationPolicy > > > in IgniteConfiguration. > > > > > > Opinions? > > > > > |
Igniters,
I have implemented handling critical persistence I/O errors with temporary callback which stops the node. After PR <https://github.com/apache/ignite/pull/3394> merge the callback should be replaced with the generic solution proposed by Anton. Also I have added tests checking that node recovers successfully after Cache initialization, Checkpoint writing and WAL writing critical failures. -- Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/ |
Free forum by Nabble | Edit this page |