Apache Ignite Developers - Legacy Mail Archive

Internal problems requiring graceful node shutdown, reboot, etc.

Classic

List

Threaded

5 messages Options

Dmitriy_Sorokin

Internal problems requiring graceful node shutdown, reboot, etc.

Hi, Igniters!

We have a set of internal problems, which required graceful node shutdown,
or other reaction configured (See discussion thread
http://apache-ignite-developers.2346864.n4.nabble.com/Ignite-Enhancement-Proposal-7-Internal-problems-detection-td24460.html
):
- IgniteOutOfMemoryException -
https://issues.apache.org/jira/browse/IGNITE-6892
- Persistence errors - https://issues.apache.org/jira/browse/IGNITE-6891
- ExchangeWorker exits with error -
https://issues.apache.org/jira/browse/IGNITE-6890

First, I propose reconsider 3rd problem as "System worker exit while node
still running (node stopping process has not been started)", because we
have at least 5 worker classes, which running is critical for node working.

These workers are:
- partition-exchanger (ExchangeWorker)
- disco-event-worker
- nio-acceptor
- grid-nio-worker-tcp-comm-*
- grid-timeout-worker

Second, I propose to use FailureProcessingPolicy (already implemented in
scope of task IGNITE-6890) for reaction definition on 1st and 2nd detected
problems too. This policy can be configured similar to SegmentationPolicy
in IgniteConfiguration.

Opinions?

dmagda

Re: Internal problems requiring graceful node shutdown, reboot, etc.

Hi Dmitriy,

I’m totally for the FailureProcessingPolicy addition to IgniteConfiguration.

Apart of this, may I ask you to create corresponding documentation tickets for 2.4 release and “documentation” component? Only for the improvements that are getting into the next release. Basically you can aggregate them if it helps. Feel free to assign the ticket on me right away.

—
Denis

> On Nov 30, 2017, at 10:31 AM, Дмитрий Сорокин <[hidden email]> wrote:
>
> Hi, Igniters!
>
> We have a set of internal problems, which required graceful node shutdown,
> or other reaction configured (See discussion thread
> http://apache-ignite-developers.2346864.n4.nabble.com/Ignite-Enhancement-Proposal-7-Internal-problems-detection-td24460.html
> ):
> - IgniteOutOfMemoryException -
> https://issues.apache.org/jira/browse/IGNITE-6892
> - Persistence errors - https://issues.apache.org/jira/browse/IGNITE-6891
> - ExchangeWorker exits with error -
> https://issues.apache.org/jira/browse/IGNITE-6890
>
> First, I propose reconsider 3rd problem as "System worker exit while node
> still running (node stopping process has not been started)", because we
> have at least 5 worker classes, which running is critical for node working.
>
> These workers are:
> - partition-exchanger (ExchangeWorker)
> - disco-event-worker
> - nio-acceptor
> - grid-nio-worker-tcp-comm-*
> - grid-timeout-worker
>
> Second, I propose to use FailureProcessingPolicy (already implemented in
> scope of task IGNITE-6890) for reaction definition on 1st and 2nd detected
> problems too. This policy can be configured similar to SegmentationPolicy
> in IgniteConfiguration.
>
> Opinions?

Vladimir Ozerov

Re: Internal problems requiring graceful node shutdown, reboot, etc.

HI Dmitry,

I do not think it is good idea to mix failures of different threads into a
single event type.
- Practice shows that the most common source of problem is exchange thread
- if disco worker has died, not will be excluded from topology safely
- "nio-acceptor" can be spawn from multiple places where GridNioServer is
started, not all of them are ciritical
- "grid-nio-worker-tcp-comm" is internal thread which doesn't do any
complex processing, so risk of it's crash is minimal

We could track most of them, but death of different threads may result in
different actions from user side. So I propose to start with exchange
thread only for now.

Another important point, is that FailureProcessingPolicy should get enough
information on what happened in order to decide how to react. E.g., as I
explained earlier, IgniteOutOfMemoryException *is not critical error*.
Nasty, but not deadly. And node should not be stopped blindly in response
to this event.

Vladimir.

On Fri, Dec 1, 2017 at 3:50 AM, Denis Magda <[hidden email]> wrote:

> Hi Dmitriy,
>
> I’m totally for the FailureProcessingPolicy addition to
> IgniteConfiguration.
>
> Apart of this, may I ask you to create corresponding documentation tickets
> for 2.4 release and “documentation” component? Only for the improvements
> that are getting into the next release. Basically you can aggregate them if
> it helps. Feel free to assign the ticket on me right away.
>
> —
> Denis
>
> > On Nov 30, 2017, at 10:31 AM, Дмитрий Сорокин <[hidden email]>
> wrote:
> >
> > Hi, Igniters!
> >
> > We have a set of internal problems, which required graceful node
> shutdown,
> > or other reaction configured (See discussion thread
> > http://apache-ignite-developers.2346864.n4.nabble.
> com/Ignite-Enhancement-Proposal-7-Internal-problems-detection-td24460.html
> > ):
> > - IgniteOutOfMemoryException -
> > https://issues.apache.org/jira/browse/IGNITE-6892
> > - Persistence errors - https://issues.apache.org/jira/browse/IGNITE-6891
> > - ExchangeWorker exits with error -
> > https://issues.apache.org/jira/browse/IGNITE-6890
> >
> > First, I propose reconsider 3rd problem as "System worker exit while node
> > still running (node stopping process has not been started)", because we
> > have at least 5 worker classes, which running is critical for node
> working.
> >
> > These workers are:
> > - partition-exchanger (ExchangeWorker)
> > - disco-event-worker
> > - nio-acceptor
> > - grid-nio-worker-tcp-comm-*
> > - grid-timeout-worker
> >
> > Second, I propose to use FailureProcessingPolicy (already implemented in
> > scope of task IGNITE-6890) for reaction definition on 1st and 2nd
> detected
> > problems too. This policy can be configured similar to SegmentationPolicy
> > in IgniteConfiguration.
> >
> > Opinions?
>
>

Anton Vinogradov

Re: Internal problems requiring graceful node shutdown, reboot, etc.

Dmitry,

Seems, we found that it's impossible to specify one action for all cases,
but it's a good idea to allow user to decide what to do.
We should make something like

interface IgniteFailureHandler {
IgniteFailureAction onFailure(IgniteFailureCause cause);
}

public enum IgniteFailureAction {
RESTART_JVM,
STOP,
NOOP;
}

and ability to set it to IgniteConfiguration.
Also, we should provide default implementation of IgniteFailureHandler which
should be enabled by default and can be replaced by user's code.

On Fri, Dec 1, 2017 at 4:27 PM, Vladimir Ozerov <[hidden email]>
wrote:

> HI Dmitry,
>
> I do not think it is good idea to mix failures of different threads into a
> single event type.
> - Practice shows that the most common source of problem is exchange thread
> - if disco worker has died, not will be excluded from topology safely
> - "nio-acceptor" can be spawn from multiple places where GridNioServer is
> started, not all of them are ciritical
> - "grid-nio-worker-tcp-comm" is internal thread which doesn't do any
> complex processing, so risk of it's crash is minimal
>
> We could track most of them, but death of different threads may result in
> different actions from user side. So I propose to start with exchange
> thread only for now.
>
> Another important point, is that FailureProcessingPolicy should get enough
> information on what happened in order to decide how to react. E.g., as I
> explained earlier, IgniteOutOfMemoryException *is not critical error*.
> Nasty, but not deadly. And node should not be stopped blindly in response
> to this event.
>
> Vladimir.
>
>
> On Fri, Dec 1, 2017 at 3:50 AM, Denis Magda <[hidden email]> wrote:
>
> > Hi Dmitriy,
> >
> > I’m totally for the FailureProcessingPolicy addition to
> > IgniteConfiguration.
> >
> > Apart of this, may I ask you to create corresponding documentation
> tickets
> > for 2.4 release and “documentation” component? Only for the improvements
> > that are getting into the next release. Basically you can aggregate them
> if
> > it helps. Feel free to assign the ticket on me right away.
> >
> > —
> > Denis
> >
> > > On Nov 30, 2017, at 10:31 AM, Дмитрий Сорокин <
> [hidden email]>
> > wrote:
> > >
> > > Hi, Igniters!
> > >
> > > We have a set of internal problems, which required graceful node
> > shutdown,
> > > or other reaction configured (See discussion thread
> > > http://apache-ignite-developers.2346864.n4.nabble.
> > com/Ignite-Enhancement-Proposal-7-Internal-problems-
> detection-td24460.html
> > > ):
> > > - IgniteOutOfMemoryException -
> > > https://issues.apache.org/jira/browse/IGNITE-6892
> > > - Persistence errors - https://issues.apache.org/
> jira/browse/IGNITE-6891
> > > - ExchangeWorker exits with error -
> > > https://issues.apache.org/jira/browse/IGNITE-6890
> > >
> > > First, I propose reconsider 3rd problem as "System worker exit while
> node
> > > still running (node stopping process has not been started)", because we
> > > have at least 5 worker classes, which running is critical for node
> > working.
> > >
> > > These workers are:
> > > - partition-exchanger (ExchangeWorker)
> > > - disco-event-worker
> > > - nio-acceptor
> > > - grid-nio-worker-tcp-comm-*
> > > - grid-timeout-worker
> > >
> > > Second, I propose to use FailureProcessingPolicy (already implemented
> in
> > > scope of task IGNITE-6890) for reaction definition on 1st and 2nd
> > detected
> > > problems too. This policy can be configured similar to
> SegmentationPolicy
> > > in IgniteConfiguration.
> > >
> > > Opinions?
> >
> >
>

Pavel Kovalenko

Re: Internal problems requiring graceful node shutdown, reboot, etc.

Igniters,

I have implemented handling critical persistence I/O errors with temporary
callback which stops the node. After PR
<https://github.com/apache/ignite/pull/3394> merge the callback should be
replaced with the generic solution proposed by Anton.
Also I have added tests checking that node recovers successfully after Cache
initialization, Checkpoint writing and WAL writing critical failures.

--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/