Apache Ignite Developers - Legacy Mail Archive

IEP-14: Ignite failures handling (Discussion)

Classic

List

Threaded

46 messages Options

123

agura

IEP-14: Ignite failures handling (Discussion)

Igniters!

We are working on proposal described in IEP-14 Ignite failures
handling [1] and it's time to discuss it with community (although it
was necessary to do this before).

Most important question: what should be default behaviour in case of
failure? There are 4 actions:

1. Restart JVM process (it's possible only if process was started from
ignite.(sh|bat) script)
2. Terminate JVM;
3. Stop node (if there is only one node in process then process will
be also terminated);
4. No operation.

I believe that node should be stopped by default. But there is chance
that node will not stopped correctly.

May be we should terminate JVM process by default. But it will kill
all nodes in the JVM process. It's especially bad behaviour in case
when nodes belong different Ignite clusters (real use case).

May be we should restart JVM process default. This approach has the
same problems as the previous one. And additionally it could lead to
continues restarts and, therefore, continues exchanges and
rebalancing.

Difficult choice. Could you please share your thoughts.

[1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling

Andrey Kuznetsov

Re: IEP-14: Ignite failures handling (Discussion)

To my mind, the default action should be as severe as possible, since we
deal with critical errors, that is, entire JVM termination. In the case of
some custom setup (e.g. different cluster nodes in one JVM) failure
response action should be configured explicitly.

2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:

> Igniters!
>
> We are working on proposal described in IEP-14 Ignite failures
> handling [1] and it's time to discuss it with community (although it
> was necessary to do this before).
>
> Most important question: what should be default behaviour in case of
> failure? There are 4 actions:
>
> 1. Restart JVM process (it's possible only if process was started from
> ignite.(sh|bat) script)
> 2. Terminate JVM;
> 3. Stop node (if there is only one node in process then process will
> be also terminated);
> 4. No operation.
>
> I believe that node should be stopped by default. But there is chance
> that node will not stopped correctly.
>
> May be we should terminate JVM process by default. But it will kill
> all nodes in the JVM process. It's especially bad behaviour in case
> when nodes belong different Ignite clusters (real use case).
>
> May be we should restart JVM process default. This approach has the
> same problems as the previous one. And additionally it could lead to
> continues restarts and, therefore, continues exchanges and
> rebalancing.
>
> Difficult choice. Could you please share your thoughts.
>
> [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 14+Ignite+failures+handling
>

--
Best regards,
Andrey Kuznetsov.

Dmitriy Pavlov

Re: IEP-14: Ignite failures handling (Discussion)

Hi Andrey, Igniters,

Thank you for starting this topic, because this is really important
decision.

JVM termination in case Ignite is started within application server with
other application will kill all services started.

So I suggest this option is not default. We can add this option
(action="JVM termination") as pre-configured for ignite.sh/bat since we
know is it separate JVM. But I do not vote for the option, if it was the
default in code.

Sincerely,
Dmitriy Pavlov

пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:

> To my mind, the default action should be as severe as possible, since we
> deal with critical errors, that is, entire JVM termination. In the case of
> some custom setup (e.g. different cluster nodes in one JVM) failure
> response action should be configured explicitly.
>
> 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
>
> > Igniters!
> >
> > We are working on proposal described in IEP-14 Ignite failures
> > handling [1] and it's time to discuss it with community (although it
> > was necessary to do this before).
> >
> > Most important question: what should be default behaviour in case of
> > failure? There are 4 actions:
> >
> > 1. Restart JVM process (it's possible only if process was started from
> > ignite.(sh|bat) script)
> > 2. Terminate JVM;
> > 3. Stop node (if there is only one node in process then process will
> > be also terminated);
> > 4. No operation.
> >
> > I believe that node should be stopped by default. But there is chance
> > that node will not stopped correctly.
> >
> > May be we should terminate JVM process by default. But it will kill
> > all nodes in the JVM process. It's especially bad behaviour in case
> > when nodes belong different Ignite clusters (real use case).
> >
> > May be we should restart JVM process default. This approach has the
> > same problems as the previous one. And additionally it could lead to
> > continues restarts and, therefore, continues exchanges and
> > rebalancing.
> >
> > Difficult choice. Could you please share your thoughts.
> >
> > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 14+Ignite+failures+handling
> >
>
>
>
> --
> Best regards,
> Andrey Kuznetsov.
>

dmagda

Re: IEP-14: Ignite failures handling (Discussion)

Guys,

I would make a decision depending on a type of the problematic node:

- If it's a *server node*, then let's kill the process simply because
the node usually owns the whole process. Don't see a practical reason why a
user wants to run 2 server nodes in a single process.
- If it's a *client node*, then the best approach is to kill the node
and not the process.

--
Denis

On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <[hidden email]>
wrote:

> Hi Andrey, Igniters,
>
> Thank you for starting this topic, because this is really important
> decision.
>
> JVM termination in case Ignite is started within application server with
> other application will kill all services started.
>
> So I suggest this option is not default. We can add this option
> (action="JVM termination") as pre-configured for ignite.sh/bat since we
> know is it separate JVM. But I do not vote for the option, if it was the
> default in code.
>
> Sincerely,
> Dmitriy Pavlov
>
> пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:
>
> > To my mind, the default action should be as severe as possible, since we
> > deal with critical errors, that is, entire JVM termination. In the case
> of
> > some custom setup (e.g. different cluster nodes in one JVM) failure
> > response action should be configured explicitly.
> >
> > 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
> >
> > > Igniters!
> > >
> > > We are working on proposal described in IEP-14 Ignite failures
> > > handling [1] and it's time to discuss it with community (although it
> > > was necessary to do this before).
> > >
> > > Most important question: what should be default behaviour in case of
> > > failure? There are 4 actions:
> > >
> > > 1. Restart JVM process (it's possible only if process was started from
> > > ignite.(sh|bat) script)
> > > 2. Terminate JVM;
> > > 3. Stop node (if there is only one node in process then process will
> > > be also terminated);
> > > 4. No operation.
> > >
> > > I believe that node should be stopped by default. But there is chance
> > > that node will not stopped correctly.
> > >
> > > May be we should terminate JVM process by default. But it will kill
> > > all nodes in the JVM process. It's especially bad behaviour in case
> > > when nodes belong different Ignite clusters (real use case).
> > >
> > > May be we should restart JVM process default. This approach has the
> > > same problems as the previous one. And additionally it could lead to
> > > continues restarts and, therefore, continues exchanges and
> > > rebalancing.
> > >
> > > Difficult choice. Could you please share your thoughts.
> > >
> > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > 14+Ignite+failures+handling
> > >
> >
> >
> >
> > --
> > Best regards,
> > Andrey Kuznetsov.
> >
>

dsetrakyan

Re: IEP-14: Ignite failures handling (Discussion)

Denis, what is the difference between killing the process and killing the
node and the process?

D.

On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <[hidden email]> wrote:

> Guys,
>
> I would make a decision depending on a type of the problematic node:
>
> - If it's a *server node*, then let's kill the process simply because
> the node usually owns the whole process. Don't see a practical reason
> why a
> user wants to run 2 server nodes in a single process.
> - If it's a *client node*, then the best approach is to kill the node
> and not the process.
>
> --
> Denis
>
> On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > Hi Andrey, Igniters,
> >
> > Thank you for starting this topic, because this is really important
> > decision.
> >
> > JVM termination in case Ignite is started within application server with
> > other application will kill all services started.
> >
> > So I suggest this option is not default. We can add this option
> > (action="JVM termination") as pre-configured for ignite.sh/bat since we
> > know is it separate JVM. But I do not vote for the option, if it was the
> > default in code.
> >
> > Sincerely,
> > Dmitriy Pavlov
> >
> > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:
> >
> > > To my mind, the default action should be as severe as possible, since
> we
> > > deal with critical errors, that is, entire JVM termination. In the case
> > of
> > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > response action should be configured explicitly.
> > >
> > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
> > >
> > > > Igniters!
> > > >
> > > > We are working on proposal described in IEP-14 Ignite failures
> > > > handling [1] and it's time to discuss it with community (although it
> > > > was necessary to do this before).
> > > >
> > > > Most important question: what should be default behaviour in case of
> > > > failure? There are 4 actions:
> > > >
> > > > 1. Restart JVM process (it's possible only if process was started
> from
> > > > ignite.(sh|bat) script)
> > > > 2. Terminate JVM;
> > > > 3. Stop node (if there is only one node in process then process will
> > > > be also terminated);
> > > > 4. No operation.
> > > >
> > > > I believe that node should be stopped by default. But there is chance
> > > > that node will not stopped correctly.
> > > >
> > > > May be we should terminate JVM process by default. But it will kill
> > > > all nodes in the JVM process. It's especially bad behaviour in case
> > > > when nodes belong different Ignite clusters (real use case).
> > > >
> > > > May be we should restart JVM process default. This approach has the
> > > > same problems as the previous one. And additionally it could lead to
> > > > continues restarts and, therefore, continues exchanges and
> > > > rebalancing.
> > > >
> > > > Difficult choice. Could you please share your thoughts.
> > > >
> > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > 14+Ignite+failures+handling
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey Kuznetsov.
> > >
> >
>

dmagda

Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy,

Ignite client node is usually used in the embedded mode. By killing the
whole process, the node is running in, we're going to kill the entire
application. That doesn't sound like a good plan. That's why my suggestion
is to try to kill the node somehow instead rather than the whole process.

As for the server nodes, which usually own the whole process, it's totally
fine to kill the process right away.

--
Denis

On Mon, Mar 12, 2018 at 4:12 PM, Dmitriy Setrakyan <[hidden email]>
wrote:

> Denis, what is the difference between killing the process and killing the
> node and the process?
>
> D.
>
> On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <[hidden email]> wrote:
>
> > Guys,
> >
> > I would make a decision depending on a type of the problematic node:
> >
> > - If it's a *server node*, then let's kill the process simply because
> > the node usually owns the whole process. Don't see a practical reason
> > why a
> > user wants to run 2 server nodes in a single process.
> > - If it's a *client node*, then the best approach is to kill the node
> > and not the process.
> >
> > --
> > Denis
> >
> > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <[hidden email]>
> > wrote:
> >
> > > Hi Andrey, Igniters,
> > >
> > > Thank you for starting this topic, because this is really important
> > > decision.
> > >
> > > JVM termination in case Ignite is started within application server
> with
> > > other application will kill all services started.
> > >
> > > So I suggest this option is not default. We can add this option
> > > (action="JVM termination") as pre-configured for ignite.sh/bat since
> we
> > > know is it separate JVM. But I do not vote for the option, if it was
> the
> > > default in code.
> > >
> > > Sincerely,
> > > Dmitriy Pavlov
> > >
> > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:
> > >
> > > > To my mind, the default action should be as severe as possible, since
> > we
> > > > deal with critical errors, that is, entire JVM termination. In the
> case
> > > of
> > > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > > response action should be configured explicitly.
> > > >
> > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
> > > >
> > > > > Igniters!
> > > > >
> > > > > We are working on proposal described in IEP-14 Ignite failures
> > > > > handling [1] and it's time to discuss it with community (although
> it
> > > > > was necessary to do this before).
> > > > >
> > > > > Most important question: what should be default behaviour in case
> of
> > > > > failure? There are 4 actions:
> > > > >
> > > > > 1. Restart JVM process (it's possible only if process was started
> > from
> > > > > ignite.(sh|bat) script)
> > > > > 2. Terminate JVM;
> > > > > 3. Stop node (if there is only one node in process then process
> will
> > > > > be also terminated);
> > > > > 4. No operation.
> > > > >
> > > > > I believe that node should be stopped by default. But there is
> chance
> > > > > that node will not stopped correctly.
> > > > >
> > > > > May be we should terminate JVM process by default. But it will kill
> > > > > all nodes in the JVM process. It's especially bad behaviour in case
> > > > > when nodes belong different Ignite clusters (real use case).
> > > > >
> > > > > May be we should restart JVM process default. This approach has the
> > > > > same problems as the previous one. And additionally it could lead
> to
> > > > > continues restarts and, therefore, continues exchanges and
> > > > > rebalancing.
> > > > >
> > > > > Difficult choice. Could you please share your thoughts.
> > > > >
> > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > 14+Ignite+failures+handling
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrey Kuznetsov.
> > > >
> > >
> >
>

dsetrakyan

Re: IEP-14: Ignite failures handling (Discussion)

On Mon, Mar 12, 2018 at 5:12 PM, Denis Magda <[hidden email]> wrote:

> Dmitriy,
>
> Ignite client node is usually used in the embedded mode. By killing the
> whole process, the node is running in, we're going to kill the entire
> application. That doesn't sound like a good plan. That's why my suggestion
> is to try to kill the node somehow instead rather than the whole process.
>

Agree. However, if the node cannot stop gracefully, we should kill the
process anyway. This should be the default behavior. User should be able to
turn it off as needed.

>
> As for the server nodes, which usually own the whole process, it's totally
> fine to kill the process right away.
>

Well, even here I would still try to gracefully stop the node first. If
that cannot be done, then we should kill the process.

>
> --
> Denis
>
> On Mon, Mar 12, 2018 at 4:12 PM, Dmitriy Setrakyan <[hidden email]>
> wrote:
>
> > Denis, what is the difference between killing the process and killing the
> > node and the process?
> >
> > D.
> >
> > On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <[hidden email]> wrote:
> >
> > > Guys,
> > >
> > > I would make a decision depending on a type of the problematic node:
> > >
> > > - If it's a *server node*, then let's kill the process simply
> because
> > > the node usually owns the whole process. Don't see a practical
> reason
> > > why a
> > > user wants to run 2 server nodes in a single process.
> > > - If it's a *client node*, then the best approach is to kill the
> node
> > > and not the process.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <[hidden email]>
> > > wrote:
> > >
> > > > Hi Andrey, Igniters,
> > > >
> > > > Thank you for starting this topic, because this is really important
> > > > decision.
> > > >
> > > > JVM termination in case Ignite is started within application server
> > with
> > > > other application will kill all services started.
> > > >
> > > > So I suggest this option is not default. We can add this option
> > > > (action="JVM termination") as pre-configured for ignite.sh/bat since
> > we
> > > > know is it separate JVM. But I do not vote for the option, if it was
> > the
> > > > default in code.
> > > >
> > > > Sincerely,
> > > > Dmitriy Pavlov
> > > >
> > > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:
> > > >
> > > > > To my mind, the default action should be as severe as possible,
> since
> > > we
> > > > > deal with critical errors, that is, entire JVM termination. In the
> > case
> > > > of
> > > > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > > > response action should be configured explicitly.
> > > > >
> > > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
> > > > >
> > > > > > Igniters!
> > > > > >
> > > > > > We are working on proposal described in IEP-14 Ignite failures
> > > > > > handling [1] and it's time to discuss it with community (although
> > it
> > > > > > was necessary to do this before).
> > > > > >
> > > > > > Most important question: what should be default behaviour in case
> > of
> > > > > > failure? There are 4 actions:
> > > > > >
> > > > > > 1. Restart JVM process (it's possible only if process was started
> > > from
> > > > > > ignite.(sh|bat) script)
> > > > > > 2. Terminate JVM;
> > > > > > 3. Stop node (if there is only one node in process then process
> > will
> > > > > > be also terminated);
> > > > > > 4. No operation.
> > > > > >
> > > > > > I believe that node should be stopped by default. But there is
> > chance
> > > > > > that node will not stopped correctly.
> > > > > >
> > > > > > May be we should terminate JVM process by default. But it will
> kill
> > > > > > all nodes in the JVM process. It's especially bad behaviour in
> case
> > > > > > when nodes belong different Ignite clusters (real use case).
> > > > > >
> > > > > > May be we should restart JVM process default. This approach has
> the
> > > > > > same problems as the previous one. And additionally it could lead
> > to
> > > > > > continues restarts and, therefore, continues exchanges and
> > > > > > rebalancing.
> > > > > >
> > > > > > Difficult choice. Could you please share your thoughts.
> > > > > >
> > > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > 14+Ignite+failures+handling
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey Kuznetsov.
> > > > >
> > > >
> > >
> >
>

Andrey Kornev

Re: IEP-14: Ignite failures handling (Discussion)

I believe the only reasonable way to handle a critical system failure (as it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!). The sooner - the better, lesser impact. There’s simply no way to reason about the state of the system in a situation like that, all bets are off. Any other policy would only confuse the matters and in all likelihood make things worse.

In practice, SREs/Operations would very much rather have a process die a quick clean death, than let it run indefinitely and hope that it’ll somehow recover by itself at some point in future, potentially degrading the overall system stability and availability all the while.

Andrey
_____________________________
From: Dmitriy Setrakyan <[hidden email]>
Sent: Monday, March 12, 2018 5:23 PM
Subject: Re: IEP-14: Ignite failures handling (Discussion)
To: <[hidden email]>

On Mon, Mar 12, 2018 at 5:12 PM, Denis Magda <[hidden email]> wrote:

> Dmitriy,
>
> Ignite client node is usually used in the embedded mode. By killing the
> whole process, the node is running in, we're going to kill the entire
> application. That doesn't sound like a good plan. That's why my suggestion
> is to try to kill the node somehow instead rather than the whole process.
>

Agree. However, if the node cannot stop gracefully, we should kill the
process anyway. This should be the default behavior. User should be able to
turn it off as needed.

>
> As for the server nodes, which usually own the whole process, it's totally
> fine to kill the process right away.
>

Well, even here I would still try to gracefully stop the node first. If
that cannot be done, then we should kill the process.

dsetrakyan

Re: IEP-14: Ignite failures handling (Discussion)

On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <[hidden email]>
wrote:

> I believe the only reasonable way to handle a critical system failure (as
> it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!).
> The sooner - the better, lesser impact. There’s simply no way to reason
> about the state of the system in a situation like that, all bets are off.
> Any other policy would only confuse the matters and in all likelihood make
> things worse.
>
> In practice, SREs/Operations would very much rather have a process die a
> quick clean death, than let it run indefinitely and hope that it’ll somehow
> recover by itself at some point in future, potentially degrading the
> overall system stability and availability all the while.
>

Completely agree.

Dmitriy Pavlov

Re: IEP-14: Ignite failures handling (Discussion)

Denis, Dmitriy, I am not sure I agree here, please see close analogue - JVM
itself, and its parameter ExitOnOutOfMemoryError,- it is not default.

If server node is started from sh script, kill OK for me, as process is
controlled only by ignite. It is sufficient to add option to override
default for sh script.

Users interested in this behaviour may also setup this option to "kill"

If server node is started from java, it should never kill whole process.
This mode is not prohibited by docs, users are allowed to start several
nodes in one process, run its own application logic in this node.

Why we should kill user code running? It could be negative surprise to user.

вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <[hidden email]>:

> On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <[hidden email]>
> wrote:
>
> > I believe the only reasonable way to handle a critical system failure (as
> > it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!).
> > The sooner - the better, lesser impact. There’s simply no way to reason
> > about the state of the system in a situation like that, all bets are off.
> > Any other policy would only confuse the matters and in all likelihood
> make
> > things worse.
> >
> > In practice, SREs/Operations would very much rather have a process die a
> > quick clean death, than let it run indefinitely and hope that it’ll
> somehow
> > recover by itself at some point in future, potentially degrading the
> > overall system stability and availability all the while.
> >
>
> Completely agree.
>

Vladimir Ozerov

Re: IEP-14: Ignite failures handling (Discussion)

+1 for "kill if standalone, stop if embedded". We should never kill a
process in embedded node because it might be disastrous for user
application.

On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <[hidden email]>
wrote:

> Denis, Dmitriy, I am not sure I agree here, please see close analogue - JVM
> itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
>
> If server node is started from sh script, kill OK for me, as process is
> controlled only by ignite. It is sufficient to add option to override
> default for sh script.
>
> Users interested in this behaviour may also setup this option to "kill"
>
> If server node is started from java, it should never kill whole process.
> This mode is not prohibited by docs, users are allowed to start several
> nodes in one process, run its own application logic in this node.
>
> Why we should kill user code running? It could be negative surprise to
> user.
>
>
>
> вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <[hidden email]>:
>
> > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <[hidden email]
> >
> > wrote:
> >
> > > I believe the only reasonable way to handle a critical system failure
> (as
> > > it is defined in the IEP) is a JVM halt (not a graceful
> exit/shutdown!).
> > > The sooner - the better, lesser impact. There’s simply no way to reason
> > > about the state of the system in a situation like that, all bets are
> off.
> > > Any other policy would only confuse the matters and in all likelihood
> > make
> > > things worse.
> > >
> > > In practice, SREs/Operations would very much rather have a process die
> a
> > > quick clean death, than let it run indefinitely and hope that it’ll
> > somehow
> > > recover by itself at some point in future, potentially degrading the
> > > overall system stability and availability all the while.
> > >
> >
> > Completely agree.
> >
>

Alexey Goncharuk

Re: IEP-14: Ignite failures handling (Discussion)

I also like "kill if standalone, stop if embedded" by default. A use can
change it to kill for embedded mode, but it will be a controlled safe
choice.

2018-03-13 11:26 GMT+03:00 Vladimir Ozerov <[hidden email]>:

> +1 for "kill if standalone, stop if embedded". We should never kill a
> process in embedded node because it might be disastrous for user
> application.
>
> On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > Denis, Dmitriy, I am not sure I agree here, please see close analogue -
> JVM
> > itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
> >
> > If server node is started from sh script, kill OK for me, as process is
> > controlled only by ignite. It is sufficient to add option to override
> > default for sh script.
> >
> > Users interested in this behaviour may also setup this option to "kill"
> >
> > If server node is started from java, it should never kill whole process.
> > This mode is not prohibited by docs, users are allowed to start several
> > nodes in one process, run its own application logic in this node.
> >
> > Why we should kill user code running? It could be negative surprise to
> > user.
> >
> >
> >
> > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <[hidden email]>:
> >
> > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > I believe the only reasonable way to handle a critical system failure
> > (as
> > > > it is defined in the IEP) is a JVM halt (not a graceful
> > exit/shutdown!).
> > > > The sooner - the better, lesser impact. There’s simply no way to
> reason
> > > > about the state of the system in a situation like that, all bets are
> > off.
> > > > Any other policy would only confuse the matters and in all likelihood
> > > make
> > > > things worse.
> > > >
> > > > In practice, SREs/Operations would very much rather have a process
> die
> > a
> > > > quick clean death, than let it run indefinitely and hope that it’ll
> > > somehow
> > > > recover by itself at some point in future, potentially degrading the
> > > > overall system stability and availability all the while.
> > > >
> > >
> > > Completely agree.
> > >
> >
>

dsetrakyan

Re: IEP-14: Ignite failures handling (Discussion)

Guys, I do not understand the alternative. If Ignite is frozen and causes
the whole grid to freeze, how can we justify not killing it? Will uses
rather have their applications freeze?

I would consider real life use cases here. Can someone present a life
example where keeping a frozen grid node around is better than killing JVM?

D.

On Tue, Mar 13, 2018 at 6:16 AM, Alexey Goncharuk <
[hidden email]> wrote:

> I also like "kill if standalone, stop if embedded" by default. A use can
> change it to kill for embedded mode, but it will be a controlled safe
> choice.
>
> 2018-03-13 11:26 GMT+03:00 Vladimir Ozerov <[hidden email]>:
>
> > +1 for "kill if standalone, stop if embedded". We should never kill a
> > process in embedded node because it might be disastrous for user
> > application.
> >
> > On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <[hidden email]>
> > wrote:
> >
> > > Denis, Dmitriy, I am not sure I agree here, please see close analogue -
> > JVM
> > > itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
> > >
> > > If server node is started from sh script, kill OK for me, as process is
> > > controlled only by ignite. It is sufficient to add option to override
> > > default for sh script.
> > >
> > > Users interested in this behaviour may also setup this option to "kill"
> > >
> > > If server node is started from java, it should never kill whole
> process.
> > > This mode is not prohibited by docs, users are allowed to start several
> > > nodes in one process, run its own application logic in this node.
> > >
> > > Why we should kill user code running? It could be negative surprise to
> > > user.
> > >
> > >
> > >
> > > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <[hidden email]>:
> > >
> > > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > I believe the only reasonable way to handle a critical system
> failure
> > > (as
> > > > > it is defined in the IEP) is a JVM halt (not a graceful
> > > exit/shutdown!).
> > > > > The sooner - the better, lesser impact. There’s simply no way to
> > reason
> > > > > about the state of the system in a situation like that, all bets
> are
> > > off.
> > > > > Any other policy would only confuse the matters and in all
> likelihood
> > > > make
> > > > > things worse.
> > > > >
> > > > > In practice, SREs/Operations would very much rather have a process
> > die
> > > a
> > > > > quick clean death, than let it run indefinitely and hope that it’ll
> > > > somehow
> > > > > recover by itself at some point in future, potentially degrading
> the
> > > > > overall system stability and availability all the while.
> > > > >
> > > >
> > > > Completely agree.
> > > >
> > >
> >
>

Dmitriy Pavlov

Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy, alternative is "kill if standalone, stop if embedded"

User will be still able to set something like
-DNODE_CRASH_ACTION="kill"
if ignite.sh is not used and user accepts alternative that whole process
would be killed if node is crashed.

Default would be 'node stop', but not hang up infinetely.

Sincerely,
Dmitriy Pavlov

вт, 13 мар. 2018 г. в 14:53, Dmitriy Setrakyan <[hidden email]>:

> Guys, I do not understand the alternative. If Ignite is frozen and causes
> the whole grid to freeze, how can we justify not killing it? Will uses
> rather have their applications freeze?
>
> I would consider real life use cases here. Can someone present a life
> example where keeping a frozen grid node around is better than killing JVM?
>
> D.
>
> On Tue, Mar 13, 2018 at 6:16 AM, Alexey Goncharuk <
> [hidden email]> wrote:
>
> > I also like "kill if standalone, stop if embedded" by default. A use can
> > change it to kill for embedded mode, but it will be a controlled safe
> > choice.
> >
> > 2018-03-13 11:26 GMT+03:00 Vladimir Ozerov <[hidden email]>:
> >
> > > +1 for "kill if standalone, stop if embedded". We should never kill a
> > > process in embedded node because it might be disastrous for user
> > > application.
> > >
> > > On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <[hidden email]
> >
> > > wrote:
> > >
> > > > Denis, Dmitriy, I am not sure I agree here, please see close
> analogue -
> > > JVM
> > > > itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
> > > >
> > > > If server node is started from sh script, kill OK for me, as process
> is
> > > > controlled only by ignite. It is sufficient to add option to
> override
> > > > default for sh script.
> > > >
> > > > Users interested in this behaviour may also setup this option to
> "kill"
> > > >
> > > > If server node is started from java, it should never kill whole
> > process.
> > > > This mode is not prohibited by docs, users are allowed to start
> several
> > > > nodes in one process, run its own application logic in this node.
> > > >
> > > > Why we should kill user code running? It could be negative surprise
> to
> > > > user.
> > > >
> > > >
> > > >
> > > > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <[hidden email]
> >:
> > > >
> > > > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <
> > > [hidden email]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I believe the only reasonable way to handle a critical system
> > failure
> > > > (as
> > > > > > it is defined in the IEP) is a JVM halt (not a graceful
> > > > exit/shutdown!).
> > > > > > The sooner - the better, lesser impact. There’s simply no way to
> > > reason
> > > > > > about the state of the system in a situation like that, all bets
> > are
> > > > off.
> > > > > > Any other policy would only confuse the matters and in all
> > likelihood
> > > > > make
> > > > > > things worse.
> > > > > >
> > > > > > In practice, SREs/Operations would very much rather have a
> process
> > > die
> > > > a
> > > > > > quick clean death, than let it run indefinitely and hope that
> it’ll
> > > > > somehow
> > > > > > recover by itself at some point in future, potentially degrading
> > the
> > > > > > overall system stability and availability all the while.
> > > > > >
> > > > >
> > > > > Completely agree.
> > > > >
> > > >
> > >
> >
>

Andrey Kuznetsov

Re: IEP-14: Ignite failures handling (Discussion)

The most doubtful thing is 'stopping'. What if node does not respond due to
critical failure?

2018-03-13 15:16 GMT+03:00 Dmitry Pavlov <[hidden email]>:

> Dmitriy, alternative is "kill if standalone, stop if embedded"
>
> User will be still able to set something like
> -DNODE_CRASH_ACTION="kill"
> if ignite.sh is not used and user accepts alternative that whole process
> would be killed if node is crashed.
>
> Default would be 'node stop', but not hang up infinetely.
>
> Sincerely,
> Dmitriy Pavlov
>
> вт, 13 мар. 2018 г. в 14:53, Dmitriy Setrakyan <[hidden email]>:
>
> --

Best regards,
Andrey Kuznetsov.

dsetrakyan

Re: IEP-14: Ignite failures handling (Discussion)

In reply to this post by Dmitriy Pavlov

On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <[hidden email]>
wrote:

> Dmitriy, alternative is "kill if standalone, stop if embedded"

> User will be still able to set something like
> -DNODE_CRASH_ACTION="kill"
> if ignite.sh is not used and user accepts alternative that whole process
> would be killed if node is crashed.
>
> Default would be 'node stop', but not hang up infinetely.
>

Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.

On top of that, it is very likely that if you stop the "embedded" Ignite,
the user application will not be able to function any way, so killing the
node does sound like a better and *safer* option.

D.

Dmitriy Pavlov

Re: IEP-14: Ignite failures handling (Discussion)

Please consider that user application may use Ignite as optional cache for
some low-priority feature, but main logic is well functioning without
Ingnite. I can say, as Ignite user in the past, that it is quite real case.

Second real case is using several war files within one application server,
running different logic. Some apps use Ignite, some applications - not.
Killing application server in this case is not an option too.

So default should be stopping all node threads, but not kill the process.
If user is aware process may be killed, it may setup option.

вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <[hidden email]>:

> On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > Dmitriy, alternative is "kill if standalone, stop if embedded"
>
>
> > User will be still able to set something like
> > -DNODE_CRASH_ACTION="kill"
> > if ignite.sh is not used and user accepts alternative that whole process
> > would be killed if node is crashed.
> >
> > Default would be 'node stop', but not hang up infinetely.
> >
>
> Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
> guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
>
> On top of that, it is very likely that if you stop the "embedded" Ignite,
> the user application will not be able to function any way, so killing the
> node does sound like a better and *safer* option.
>
> D.
>

dsetrakyan

Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy,

I think everyone is suggesting that stopping the node will likely be
impossible if Ignite is frozen. Moreover, it is very likely that all other
apps are frozen too.

My comments are below...

On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <[hidden email]>
wrote:

> Please consider that user application may use Ignite as optional cache for
> some low-priority feature, but main logic is well functioning without
> Ingnite. I can say, as Ignite user in the past, that it is quite real case.
>

I have been a part of this project for a while, but I have never seen
Ignite used as an optional cache. Usually, Ignite is a mandatory part of
the application, not optional.

> Second real case is using several war files within one application server,
> running different logic. Some apps use Ignite, some applications - not.
> Killing application server in this case is not an option too.
>

Not very likely, but possible. This is not a common use case. Most commonly
Ignite would be serving all WAR files with a common data layer.

>
> So default should be stopping all node threads, but not kill the process.
> If user is aware process may be killed, it may setup option.
>

No, the default should be to kill the process. If user does not like it,
then it should be possible to change it to stop the node first.

>
> вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <[hidden email]>:
>
> > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <[hidden email]>
> > wrote:
> >
> > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> >
> >
> > > User will be still able to set something like
> > > -DNODE_CRASH_ACTION="kill"
> > > if ignite.sh is not used and user accepts alternative that whole
> process
> > > would be killed if node is crashed.
> > >
> > > Default would be 'node stop', but not hang up infinetely.
> > >
> >
> > Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
> > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
> >
> > On top of that, it is very likely that if you stop the "embedded" Ignite,
> > the user application will not be able to function any way, so killing the
> > node does sound like a better and *safer* option.
> >
> > D.
> >
>

Dmitriy Pavlov

Re: IEP-14: Ignite failures handling (Discussion)

You are suggesting to kill the process, which was not started by Ignite,
are not you?

More consistently is to stop only those processes that are generated by the
control of Ignite, e.g. from ignite.sh - here it is ok for me.

If we relese 'kill by default' as part of 2.5, we will end up with 2.6
emergency release to change it back, if one user will face with such
unexpected behaviour.

вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <[hidden email]>:

> Dmitriy,
>
> I think everyone is suggesting that stopping the node will likely be
> impossible if Ignite is frozen. Moreover, it is very likely that all other
> apps are frozen too.
>
> My comments are below...
>
> On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > Please consider that user application may use Ignite as optional cache
> for
> > some low-priority feature, but main logic is well functioning without
> > Ingnite. I can say, as Ignite user in the past, that it is quite real
> case.
> >
>
> I have been a part of this project for a while, but I have never seen
> Ignite used as an optional cache. Usually, Ignite is a mandatory part of
> the application, not optional.
>
>
> > Second real case is using several war files within one application
> server,
> > running different logic. Some apps use Ignite, some applications - not.
> > Killing application server in this case is not an option too.
> >
>
> Not very likely, but possible. This is not a common use case. Most commonly
> Ignite would be serving all WAR files with a common data layer.
>
>
> >
> > So default should be stopping all node threads, but not kill the process.
> > If user is aware process may be killed, it may setup option.
> >
>
> No, the default should be to kill the process. If user does not like it,
> then it should be possible to change it to stop the node first.
>
>
> >
> > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <[hidden email]>:
> >
> > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <[hidden email]>
> > > wrote:
> > >
> > > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> > >
> > >
> > > > User will be still able to set something like
> > > > -DNODE_CRASH_ACTION="kill"
> > > > if ignite.sh is not used and user accepts alternative that whole
> > process
> > > > would be killed if node is crashed.
> > > >
> > > > Default would be 'node stop', but not hang up infinetely.
> > > >
> > >
> > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
> > > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
> > >
> > > On top of that, it is very likely that if you stop the "embedded"
> Ignite,
> > > the user application will not be able to function any way, so killing
> the
> > > node does sound like a better and *safer* option.
> > >
> > > D.
> > >
> >
>

dmagda

Re: IEP-14: Ignite failures handling (Discussion)

+1 for "kill if standalone, stop if embedded" behavior. If the practice
shows that the node should be killed regardless of the mode, then it will
be an easy change. Now we are just guessing, and common sense suggests
going for "kill if standalone, stop if embedded" until we get feedback.

-
Denis

On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov <[hidden email]>
wrote:

> You are suggesting to kill the process, which was not started by Ignite,
> are not you?
>
> More consistently is to stop only those processes that are generated by the
> control of Ignite, e.g. from ignite.sh - here it is ok for me.
>
> If we relese 'kill by default' as part of 2.5, we will end up with 2.6
> emergency release to change it back, if one user will face with such
> unexpected behaviour.
>
> вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <[hidden email]>:
>
> > Dmitriy,
> >
> > I think everyone is suggesting that stopping the node will likely be
> > impossible if Ignite is frozen. Moreover, it is very likely that all
> other
> > apps are frozen too.
> >
> > My comments are below...
> >
> > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <[hidden email]>
> > wrote:
> >
> > > Please consider that user application may use Ignite as optional cache
> > for
> > > some low-priority feature, but main logic is well functioning without
> > > Ingnite. I can say, as Ignite user in the past, that it is quite real
> > case.
> > >
> >
> > I have been a part of this project for a while, but I have never seen
> > Ignite used as an optional cache. Usually, Ignite is a mandatory part of
> > the application, not optional.
> >
> >
> > > Second real case is using several war files within one application
> > server,
> > > running different logic. Some apps use Ignite, some applications - not.
> > > Killing application server in this case is not an option too.
> > >
> >
> > Not very likely, but possible. This is not a common use case. Most
> commonly
> > Ignite would be serving all WAR files with a common data layer.
> >
> >
> > >
> > > So default should be stopping all node threads, but not kill the
> process.
> > > If user is aware process may be killed, it may setup option.
> > >
> >
> > No, the default should be to kill the process. If user does not like it,
> > then it should be possible to change it to stop the node first.
> >
> >
> > >
> > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <[hidden email]
> >:
> > >
> > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> > > >
> > > >
> > > > > User will be still able to set something like
> > > > > -DNODE_CRASH_ACTION="kill"
> > > > > if ignite.sh is not used and user accepts alternative that whole
> > > process
> > > > > would be killed if node is crashed.
> > > > >
> > > > > Default would be 'node stop', but not hang up infinetely.
> > > > >
> > > >
> > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The
> only
> > > > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
> > > >
> > > > On top of that, it is very likely that if you stop the "embedded"
> > Ignite,
> > > > the user application will not be able to function any way, so killing
> > the
> > > > node does sound like a better and *safer* option.
> > > >
> > > > D.
> > > >
> > >
> >
>

123