IEP-14: Ignite failures handling (Discussion)

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

IEP-14: Ignite failures handling (Discussion)

agura
Igniters!

We are working on proposal described in IEP-14 Ignite failures
handling [1] and it's time to discuss it with community (although it
was necessary to do this before).

Most important question: what should be default behaviour in case of
failure? There are 4 actions:

1. Restart JVM process (it's possible only if process was started from
ignite.(sh|bat) script)
2. Terminate JVM;
3. Stop node (if there is only one node in process then process will
be also terminated);
4. No operation.

I believe that node should be stopped by default. But there is chance
that node will not stopped correctly.

May be we should terminate JVM process by default. But it will kill
all nodes in the JVM process. It's especially bad behaviour in case
when nodes belong different Ignite clusters (real use case).

May be we should restart JVM process default. This approach has the
same problems as the previous one. And additionally it could lead to
continues restarts and, therefore, continues exchanges and
rebalancing.

Difficult choice. Could you please share your thoughts.

[1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Andrey Kuznetsov
To my mind, the default action should be as severe as possible, since we
deal with critical errors, that is, entire JVM termination. In the case of
some custom setup (e.g. different cluster nodes in one JVM) failure
response action should be configured explicitly.

2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:

> Igniters!
>
> We are working on proposal described in IEP-14 Ignite failures
> handling [1] and it's time to discuss it with community (although it
> was necessary to do this before).
>
> Most important question: what should be default behaviour in case of
> failure? There are 4 actions:
>
> 1. Restart JVM process (it's possible only if process was started from
> ignite.(sh|bat) script)
> 2. Terminate JVM;
> 3. Stop node (if there is only one node in process then process will
> be also terminated);
> 4. No operation.
>
> I believe that node should be stopped by default. But there is chance
> that node will not stopped correctly.
>
> May be we should terminate JVM process by default. But it will kill
> all nodes in the JVM process. It's especially bad behaviour in case
> when nodes belong different Ignite clusters (real use case).
>
> May be we should restart JVM process default. This approach has the
> same problems as the previous one. And additionally it could lead to
> continues restarts and, therefore, continues exchanges and
> rebalancing.
>
> Difficult choice. Could you please share your thoughts.
>
> [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 14+Ignite+failures+handling
>



--
Best regards,
  Andrey Kuznetsov.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy Pavlov
Hi Andrey, Igniters,

Thank you for starting this topic, because this is really important
decision.

JVM termination in case Ignite is started within application server with
other application will kill all services started.

So I suggest this option is not default. We can add this option
(action="JVM termination") as pre-configured for ignite.sh/bat since we
know is it separate JVM. But I do not vote for the option, if it was the
default in code.

Sincerely,
Dmitriy Pavlov

пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:

> To my mind, the default action should be as severe as possible, since we
> deal with critical errors, that is, entire JVM termination. In the case of
> some custom setup (e.g. different cluster nodes in one JVM) failure
> response action should be configured explicitly.
>
> 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
>
> > Igniters!
> >
> > We are working on proposal described in IEP-14 Ignite failures
> > handling [1] and it's time to discuss it with community (although it
> > was necessary to do this before).
> >
> > Most important question: what should be default behaviour in case of
> > failure? There are 4 actions:
> >
> > 1. Restart JVM process (it's possible only if process was started from
> > ignite.(sh|bat) script)
> > 2. Terminate JVM;
> > 3. Stop node (if there is only one node in process then process will
> > be also terminated);
> > 4. No operation.
> >
> > I believe that node should be stopped by default. But there is chance
> > that node will not stopped correctly.
> >
> > May be we should terminate JVM process by default. But it will kill
> > all nodes in the JVM process. It's especially bad behaviour in case
> > when nodes belong different Ignite clusters (real use case).
> >
> > May be we should restart JVM process default. This approach has the
> > same problems as the previous one. And additionally it could lead to
> > continues restarts and, therefore, continues exchanges and
> > rebalancing.
> >
> > Difficult choice. Could you please share your thoughts.
> >
> > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 14+Ignite+failures+handling
> >
>
>
>
> --
> Best regards,
>   Andrey Kuznetsov.
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dmagda
Guys,

I would make a decision depending on a type of the problematic node:

   - If it's a *server node*, then let's kill the process simply because
   the node usually owns the whole process. Don't see a practical reason why a
   user wants to run 2 server nodes in a single process.
   - If it's a *client node*, then the best approach is to kill the node
   and not the process.

--
Denis

On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <[hidden email]>
wrote:

> Hi Andrey, Igniters,
>
> Thank you for starting this topic, because this is really important
> decision.
>
> JVM termination in case Ignite is started within application server with
> other application will kill all services started.
>
> So I suggest this option is not default. We can add this option
> (action="JVM termination") as pre-configured for ignite.sh/bat since we
> know is it separate JVM. But I do not vote for the option, if it was the
> default in code.
>
> Sincerely,
> Dmitriy Pavlov
>
> пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:
>
> > To my mind, the default action should be as severe as possible, since we
> > deal with critical errors, that is, entire JVM termination. In the case
> of
> > some custom setup (e.g. different cluster nodes in one JVM) failure
> > response action should be configured explicitly.
> >
> > 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
> >
> > > Igniters!
> > >
> > > We are working on proposal described in IEP-14 Ignite failures
> > > handling [1] and it's time to discuss it with community (although it
> > > was necessary to do this before).
> > >
> > > Most important question: what should be default behaviour in case of
> > > failure? There are 4 actions:
> > >
> > > 1. Restart JVM process (it's possible only if process was started from
> > > ignite.(sh|bat) script)
> > > 2. Terminate JVM;
> > > 3. Stop node (if there is only one node in process then process will
> > > be also terminated);
> > > 4. No operation.
> > >
> > > I believe that node should be stopped by default. But there is chance
> > > that node will not stopped correctly.
> > >
> > > May be we should terminate JVM process by default. But it will kill
> > > all nodes in the JVM process. It's especially bad behaviour in case
> > > when nodes belong different Ignite clusters (real use case).
> > >
> > > May be we should restart JVM process default. This approach has the
> > > same problems as the previous one. And additionally it could lead to
> > > continues restarts and, therefore, continues exchanges and
> > > rebalancing.
> > >
> > > Difficult choice. Could you please share your thoughts.
> > >
> > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > 14+Ignite+failures+handling
> > >
> >
> >
> >
> > --
> > Best regards,
> >   Andrey Kuznetsov.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
Denis, what is the difference between killing the process and killing the
node and the process?

D.

On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <[hidden email]> wrote:

> Guys,
>
> I would make a decision depending on a type of the problematic node:
>
>    - If it's a *server node*, then let's kill the process simply because
>    the node usually owns the whole process. Don't see a practical reason
> why a
>    user wants to run 2 server nodes in a single process.
>    - If it's a *client node*, then the best approach is to kill the node
>    and not the process.
>
> --
> Denis
>
> On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > Hi Andrey, Igniters,
> >
> > Thank you for starting this topic, because this is really important
> > decision.
> >
> > JVM termination in case Ignite is started within application server with
> > other application will kill all services started.
> >
> > So I suggest this option is not default. We can add this option
> > (action="JVM termination") as pre-configured for ignite.sh/bat since we
> > know is it separate JVM. But I do not vote for the option, if it was the
> > default in code.
> >
> > Sincerely,
> > Dmitriy Pavlov
> >
> > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:
> >
> > > To my mind, the default action should be as severe as possible, since
> we
> > > deal with critical errors, that is, entire JVM termination. In the case
> > of
> > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > response action should be configured explicitly.
> > >
> > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
> > >
> > > > Igniters!
> > > >
> > > > We are working on proposal described in IEP-14 Ignite failures
> > > > handling [1] and it's time to discuss it with community (although it
> > > > was necessary to do this before).
> > > >
> > > > Most important question: what should be default behaviour in case of
> > > > failure? There are 4 actions:
> > > >
> > > > 1. Restart JVM process (it's possible only if process was started
> from
> > > > ignite.(sh|bat) script)
> > > > 2. Terminate JVM;
> > > > 3. Stop node (if there is only one node in process then process will
> > > > be also terminated);
> > > > 4. No operation.
> > > >
> > > > I believe that node should be stopped by default. But there is chance
> > > > that node will not stopped correctly.
> > > >
> > > > May be we should terminate JVM process by default. But it will kill
> > > > all nodes in the JVM process. It's especially bad behaviour in case
> > > > when nodes belong different Ignite clusters (real use case).
> > > >
> > > > May be we should restart JVM process default. This approach has the
> > > > same problems as the previous one. And additionally it could lead to
> > > > continues restarts and, therefore, continues exchanges and
> > > > rebalancing.
> > > >
> > > > Difficult choice. Could you please share your thoughts.
> > > >
> > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > 14+Ignite+failures+handling
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >   Andrey Kuznetsov.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dmagda
Dmitriy,

Ignite client node is usually used in the embedded mode. By killing the
whole process, the node is running in, we're going to kill the entire
application. That doesn't sound like a good plan. That's why my suggestion
is to try to kill the node somehow instead rather than the whole process.

As for the server nodes, which usually own the whole process, it's totally
fine to kill the process right away.

--
Denis

On Mon, Mar 12, 2018 at 4:12 PM, Dmitriy Setrakyan <[hidden email]>
wrote:

> Denis, what is the difference between killing the process and killing the
> node and the process?
>
> D.
>
> On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <[hidden email]> wrote:
>
> > Guys,
> >
> > I would make a decision depending on a type of the problematic node:
> >
> >    - If it's a *server node*, then let's kill the process simply because
> >    the node usually owns the whole process. Don't see a practical reason
> > why a
> >    user wants to run 2 server nodes in a single process.
> >    - If it's a *client node*, then the best approach is to kill the node
> >    and not the process.
> >
> > --
> > Denis
> >
> > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <[hidden email]>
> > wrote:
> >
> > > Hi Andrey, Igniters,
> > >
> > > Thank you for starting this topic, because this is really important
> > > decision.
> > >
> > > JVM termination in case Ignite is started within application server
> with
> > > other application will kill all services started.
> > >
> > > So I suggest this option is not default. We can add this option
> > > (action="JVM termination") as pre-configured for ignite.sh/bat since
> we
> > > know is it separate JVM. But I do not vote for the option, if it was
> the
> > > default in code.
> > >
> > > Sincerely,
> > > Dmitriy Pavlov
> > >
> > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:
> > >
> > > > To my mind, the default action should be as severe as possible, since
> > we
> > > > deal with critical errors, that is, entire JVM termination. In the
> case
> > > of
> > > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > > response action should be configured explicitly.
> > > >
> > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
> > > >
> > > > > Igniters!
> > > > >
> > > > > We are working on proposal described in IEP-14 Ignite failures
> > > > > handling [1] and it's time to discuss it with community (although
> it
> > > > > was necessary to do this before).
> > > > >
> > > > > Most important question: what should be default behaviour in case
> of
> > > > > failure? There are 4 actions:
> > > > >
> > > > > 1. Restart JVM process (it's possible only if process was started
> > from
> > > > > ignite.(sh|bat) script)
> > > > > 2. Terminate JVM;
> > > > > 3. Stop node (if there is only one node in process then process
> will
> > > > > be also terminated);
> > > > > 4. No operation.
> > > > >
> > > > > I believe that node should be stopped by default. But there is
> chance
> > > > > that node will not stopped correctly.
> > > > >
> > > > > May be we should terminate JVM process by default. But it will kill
> > > > > all nodes in the JVM process. It's especially bad behaviour in case
> > > > > when nodes belong different Ignite clusters (real use case).
> > > > >
> > > > > May be we should restart JVM process default. This approach has the
> > > > > same problems as the previous one. And additionally it could lead
> to
> > > > > continues restarts and, therefore, continues exchanges and
> > > > > rebalancing.
> > > > >
> > > > > Difficult choice. Could you please share your thoughts.
> > > > >
> > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > 14+Ignite+failures+handling
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >   Andrey Kuznetsov.
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
On Mon, Mar 12, 2018 at 5:12 PM, Denis Magda <[hidden email]> wrote:

> Dmitriy,
>
> Ignite client node is usually used in the embedded mode. By killing the
> whole process, the node is running in, we're going to kill the entire
> application. That doesn't sound like a good plan. That's why my suggestion
> is to try to kill the node somehow instead rather than the whole process.
>

Agree. However, if the node cannot stop gracefully, we should kill the
process anyway. This should be the default behavior. User should be able to
turn it off as needed.


>
> As for the server nodes, which usually own the whole process, it's totally
> fine to kill the process right away.
>

Well, even here I would still try to gracefully stop the node first. If
that cannot be done, then we should kill the process.


>
> --
> Denis
>
> On Mon, Mar 12, 2018 at 4:12 PM, Dmitriy Setrakyan <[hidden email]>
> wrote:
>
> > Denis, what is the difference between killing the process and killing the
> > node and the process?
> >
> > D.
> >
> > On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <[hidden email]> wrote:
> >
> > > Guys,
> > >
> > > I would make a decision depending on a type of the problematic node:
> > >
> > >    - If it's a *server node*, then let's kill the process simply
> because
> > >    the node usually owns the whole process. Don't see a practical
> reason
> > > why a
> > >    user wants to run 2 server nodes in a single process.
> > >    - If it's a *client node*, then the best approach is to kill the
> node
> > >    and not the process.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <[hidden email]>
> > > wrote:
> > >
> > > > Hi Andrey, Igniters,
> > > >
> > > > Thank you for starting this topic, because this is really important
> > > > decision.
> > > >
> > > > JVM termination in case Ignite is started within application server
> > with
> > > > other application will kill all services started.
> > > >
> > > > So I suggest this option is not default. We can add this option
> > > > (action="JVM termination") as pre-configured for ignite.sh/bat since
> > we
> > > > know is it separate JVM. But I do not vote for the option, if it was
> > the
> > > > default in code.
> > > >
> > > > Sincerely,
> > > > Dmitriy Pavlov
> > > >
> > > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:
> > > >
> > > > > To my mind, the default action should be as severe as possible,
> since
> > > we
> > > > > deal with critical errors, that is, entire JVM termination. In the
> > case
> > > > of
> > > > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > > > response action should be configured explicitly.
> > > > >
> > > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
> > > > >
> > > > > > Igniters!
> > > > > >
> > > > > > We are working on proposal described in IEP-14 Ignite failures
> > > > > > handling [1] and it's time to discuss it with community (although
> > it
> > > > > > was necessary to do this before).
> > > > > >
> > > > > > Most important question: what should be default behaviour in case
> > of
> > > > > > failure? There are 4 actions:
> > > > > >
> > > > > > 1. Restart JVM process (it's possible only if process was started
> > > from
> > > > > > ignite.(sh|bat) script)
> > > > > > 2. Terminate JVM;
> > > > > > 3. Stop node (if there is only one node in process then process
> > will
> > > > > > be also terminated);
> > > > > > 4. No operation.
> > > > > >
> > > > > > I believe that node should be stopped by default. But there is
> > chance
> > > > > > that node will not stopped correctly.
> > > > > >
> > > > > > May be we should terminate JVM process by default. But it will
> kill
> > > > > > all nodes in the JVM process. It's especially bad behaviour in
> case
> > > > > > when nodes belong different Ignite clusters (real use case).
> > > > > >
> > > > > > May be we should restart JVM process default. This approach has
> the
> > > > > > same problems as the previous one. And additionally it could lead
> > to
> > > > > > continues restarts and, therefore, continues exchanges and
> > > > > > rebalancing.
> > > > > >
> > > > > > Difficult choice. Could you please share your thoughts.
> > > > > >
> > > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > 14+Ignite+failures+handling
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >   Andrey Kuznetsov.
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Andrey Kornev
I believe the only reasonable way to handle a critical system failure (as it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!). The sooner - the better, lesser impact. There’s simply no way to reason about the state of the system in a situation like that, all bets are off. Any other policy would only confuse the matters and in all likelihood make things worse.

In practice, SREs/Operations would very much rather have a process die a quick clean death, than let it run indefinitely and hope that it’ll somehow recover by itself at some point in future, potentially degrading the overall system stability and availability all the while.

Andrey
_____________________________
From: Dmitriy Setrakyan <[hidden email]>
Sent: Monday, March 12, 2018 5:23 PM
Subject: Re: IEP-14: Ignite failures handling (Discussion)
To: <[hidden email]>


On Mon, Mar 12, 2018 at 5:12 PM, Denis Magda <[hidden email]> wrote:

> Dmitriy,
>
> Ignite client node is usually used in the embedded mode. By killing the
> whole process, the node is running in, we're going to kill the entire
> application. That doesn't sound like a good plan. That's why my suggestion
> is to try to kill the node somehow instead rather than the whole process.
>

Agree. However, if the node cannot stop gracefully, we should kill the
process anyway. This should be the default behavior. User should be able to
turn it off as needed.


>
> As for the server nodes, which usually own the whole process, it's totally
> fine to kill the process right away.
>

Well, even here I would still try to gracefully stop the node first. If
that cannot be done, then we should kill the process.


>
> --
> Denis
>
> On Mon, Mar 12, 2018 at 4:12 PM, Dmitriy Setrakyan <[hidden email]>
> wrote:
>
> > Denis, what is the difference between killing the process and killing the
> > node and the process?
> >
> > D.
> >
> > On Mon, Mar 12, 2018 at 12:03 PM, Denis Magda <[hidden email]> wrote:
> >
> > > Guys,
> > >
> > > I would make a decision depending on a type of the problematic node:
> > >
> > > - If it's a *server node*, then let's kill the process simply
> because
> > > the node usually owns the whole process. Don't see a practical
> reason
> > > why a
> > > user wants to run 2 server nodes in a single process.
> > > - If it's a *client node*, then the best approach is to kill the
> node
> > > and not the process.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Mar 12, 2018 at 3:04 AM, Dmitry Pavlov <[hidden email]>
> > > wrote:
> > >
> > > > Hi Andrey, Igniters,
> > > >
> > > > Thank you for starting this topic, because this is really important
> > > > decision.
> > > >
> > > > JVM termination in case Ignite is started within application server
> > with
> > > > other application will kill all services started.
> > > >
> > > > So I suggest this option is not default. We can add this option
> > > > (action="JVM termination") as pre-configured for ignite.sh/bat since
> > we
> > > > know is it separate JVM. But I do not vote for the option, if it was
> > the
> > > > default in code.
> > > >
> > > > Sincerely,
> > > > Dmitriy Pavlov
> > > >
> > > > пн, 12 мар. 2018 г. в 12:57, Andrey Kuznetsov <[hidden email]>:
> > > >
> > > > > To my mind, the default action should be as severe as possible,
> since
> > > we
> > > > > deal with critical errors, that is, entire JVM termination. In the
> > case
> > > > of
> > > > > some custom setup (e.g. different cluster nodes in one JVM) failure
> > > > > response action should be configured explicitly.
> > > > >
> > > > > 2018-03-12 12:32 GMT+03:00 Andrey Gura <[hidden email]>:
> > > > >
> > > > > > Igniters!
> > > > > >
> > > > > > We are working on proposal described in IEP-14 Ignite failures
> > > > > > handling [1] and it's time to discuss it with community (although
> > it
> > > > > > was necessary to do this before).
> > > > > >
> > > > > > Most important question: what should be default behaviour in case
> > of
> > > > > > failure? There are 4 actions:
> > > > > >
> > > > > > 1. Restart JVM process (it's possible only if process was started
> > > from
> > > > > > ignite.(sh|bat) script)
> > > > > > 2. Terminate JVM;
> > > > > > 3. Stop node (if there is only one node in process then process
> > will
> > > > > > be also terminated);
> > > > > > 4. No operation.
> > > > > >
> > > > > > I believe that node should be stopped by default. But there is
> > chance
> > > > > > that node will not stopped correctly.
> > > > > >
> > > > > > May be we should terminate JVM process by default. But it will
> kill
> > > > > > all nodes in the JVM process. It's especially bad behaviour in
> case
> > > > > > when nodes belong different Ignite clusters (real use case).
> > > > > >
> > > > > > May be we should restart JVM process default. This approach has
> the
> > > > > > same problems as the previous one. And additionally it could lead
> > to
> > > > > > continues restarts and, therefore, continues exchanges and
> > > > > > rebalancing.
> > > > > >
> > > > > > Difficult choice. Could you please share your thoughts.
> > > > > >
> > > > > > [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > 14+Ignite+failures+handling
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey Kuznetsov.
> > > > >
> > > >
> > >
> >
>


Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <[hidden email]>
wrote:

> I believe the only reasonable way to handle a critical system failure (as
> it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!).
> The sooner - the better, lesser impact. There’s simply no way to reason
> about the state of the system in a situation like that, all bets are off.
> Any other policy would only confuse the matters and in all likelihood make
> things worse.
>
> In practice, SREs/Operations would very much rather have a process die a
> quick clean death, than let it run indefinitely and hope that it’ll somehow
> recover by itself at some point in future, potentially degrading the
> overall system stability and availability all the while.
>

Completely agree.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy Pavlov
Denis, Dmitriy, I am not sure I agree here, please see close analogue - JVM
itself, and its parameter ExitOnOutOfMemoryError,- it is not default.

If server node is started from sh script, kill OK for me, as process is
controlled only by ignite.  It is sufficient to add option to override
default for sh script.

Users interested in this behaviour may also setup this option to "kill"

If server node is started from java, it should never kill whole process.
This mode is not prohibited by docs, users are allowed to start several
nodes in one process, run its own application logic in this node.

Why we should kill user code running? It could be negative surprise to user.



вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <[hidden email]>:

> On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <[hidden email]>
> wrote:
>
> > I believe the only reasonable way to handle a critical system failure (as
> > it is defined in the IEP) is a JVM halt (not a graceful exit/shutdown!).
> > The sooner - the better, lesser impact. There’s simply no way to reason
> > about the state of the system in a situation like that, all bets are off.
> > Any other policy would only confuse the matters and in all likelihood
> make
> > things worse.
> >
> > In practice, SREs/Operations would very much rather have a process die a
> > quick clean death, than let it run indefinitely and hope that it’ll
> somehow
> > recover by itself at some point in future, potentially degrading the
> > overall system stability and availability all the while.
> >
>
> Completely agree.
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Vladimir Ozerov
+1 for "kill if standalone, stop if embedded". We should never kill a
process in embedded node because it might be disastrous for user
application.

On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <[hidden email]>
wrote:

> Denis, Dmitriy, I am not sure I agree here, please see close analogue - JVM
> itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
>
> If server node is started from sh script, kill OK for me, as process is
> controlled only by ignite.  It is sufficient to add option to override
> default for sh script.
>
> Users interested in this behaviour may also setup this option to "kill"
>
> If server node is started from java, it should never kill whole process.
> This mode is not prohibited by docs, users are allowed to start several
> nodes in one process, run its own application logic in this node.
>
> Why we should kill user code running? It could be negative surprise to
> user.
>
>
>
> вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <[hidden email]>:
>
> > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <[hidden email]
> >
> > wrote:
> >
> > > I believe the only reasonable way to handle a critical system failure
> (as
> > > it is defined in the IEP) is a JVM halt (not a graceful
> exit/shutdown!).
> > > The sooner - the better, lesser impact. There’s simply no way to reason
> > > about the state of the system in a situation like that, all bets are
> off.
> > > Any other policy would only confuse the matters and in all likelihood
> > make
> > > things worse.
> > >
> > > In practice, SREs/Operations would very much rather have a process die
> a
> > > quick clean death, than let it run indefinitely and hope that it’ll
> > somehow
> > > recover by itself at some point in future, potentially degrading the
> > > overall system stability and availability all the while.
> > >
> >
> > Completely agree.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Alexey Goncharuk
I also like "kill if standalone, stop if embedded" by default. A use can
change it to kill for embedded mode, but it will be a controlled safe
choice.

2018-03-13 11:26 GMT+03:00 Vladimir Ozerov <[hidden email]>:

> +1 for "kill if standalone, stop if embedded". We should never kill a
> process in embedded node because it might be disastrous for user
> application.
>
> On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > Denis, Dmitriy, I am not sure I agree here, please see close analogue -
> JVM
> > itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
> >
> > If server node is started from sh script, kill OK for me, as process is
> > controlled only by ignite.  It is sufficient to add option to override
> > default for sh script.
> >
> > Users interested in this behaviour may also setup this option to "kill"
> >
> > If server node is started from java, it should never kill whole process.
> > This mode is not prohibited by docs, users are allowed to start several
> > nodes in one process, run its own application logic in this node.
> >
> > Why we should kill user code running? It could be negative surprise to
> > user.
> >
> >
> >
> > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <[hidden email]>:
> >
> > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > I believe the only reasonable way to handle a critical system failure
> > (as
> > > > it is defined in the IEP) is a JVM halt (not a graceful
> > exit/shutdown!).
> > > > The sooner - the better, lesser impact. There’s simply no way to
> reason
> > > > about the state of the system in a situation like that, all bets are
> > off.
> > > > Any other policy would only confuse the matters and in all likelihood
> > > make
> > > > things worse.
> > > >
> > > > In practice, SREs/Operations would very much rather have a process
> die
> > a
> > > > quick clean death, than let it run indefinitely and hope that it’ll
> > > somehow
> > > > recover by itself at some point in future, potentially degrading the
> > > > overall system stability and availability all the while.
> > > >
> > >
> > > Completely agree.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
Guys, I do not understand the alternative. If Ignite is frozen and causes
the whole grid to freeze, how can we justify not killing it? Will uses
rather have their applications freeze?

I would consider real life use cases here. Can someone present a life
example where keeping a frozen grid node around is better than killing JVM?

D.

On Tue, Mar 13, 2018 at 6:16 AM, Alexey Goncharuk <
[hidden email]> wrote:

> I also like "kill if standalone, stop if embedded" by default. A use can
> change it to kill for embedded mode, but it will be a controlled safe
> choice.
>
> 2018-03-13 11:26 GMT+03:00 Vladimir Ozerov <[hidden email]>:
>
> > +1 for "kill if standalone, stop if embedded". We should never kill a
> > process in embedded node because it might be disastrous for user
> > application.
> >
> > On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <[hidden email]>
> > wrote:
> >
> > > Denis, Dmitriy, I am not sure I agree here, please see close analogue -
> > JVM
> > > itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
> > >
> > > If server node is started from sh script, kill OK for me, as process is
> > > controlled only by ignite.  It is sufficient to add option to override
> > > default for sh script.
> > >
> > > Users interested in this behaviour may also setup this option to "kill"
> > >
> > > If server node is started from java, it should never kill whole
> process.
> > > This mode is not prohibited by docs, users are allowed to start several
> > > nodes in one process, run its own application logic in this node.
> > >
> > > Why we should kill user code running? It could be negative surprise to
> > > user.
> > >
> > >
> > >
> > > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <[hidden email]>:
> > >
> > > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > I believe the only reasonable way to handle a critical system
> failure
> > > (as
> > > > > it is defined in the IEP) is a JVM halt (not a graceful
> > > exit/shutdown!).
> > > > > The sooner - the better, lesser impact. There’s simply no way to
> > reason
> > > > > about the state of the system in a situation like that, all bets
> are
> > > off.
> > > > > Any other policy would only confuse the matters and in all
> likelihood
> > > > make
> > > > > things worse.
> > > > >
> > > > > In practice, SREs/Operations would very much rather have a process
> > die
> > > a
> > > > > quick clean death, than let it run indefinitely and hope that it’ll
> > > > somehow
> > > > > recover by itself at some point in future, potentially degrading
> the
> > > > > overall system stability and availability all the while.
> > > > >
> > > >
> > > > Completely agree.
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy Pavlov
Dmitriy, alternative is "kill if standalone, stop if embedded"

User will be still able to set something like
-DNODE_CRASH_ACTION="kill"
if ignite.sh is not used and user accepts alternative that whole process
would be killed if node is crashed.

Default would be 'node stop', but not hang up infinetely.

Sincerely,
Dmitriy Pavlov

вт, 13 мар. 2018 г. в 14:53, Dmitriy Setrakyan <[hidden email]>:

> Guys, I do not understand the alternative. If Ignite is frozen and causes
> the whole grid to freeze, how can we justify not killing it? Will uses
> rather have their applications freeze?
>
> I would consider real life use cases here. Can someone present a life
> example where keeping a frozen grid node around is better than killing JVM?
>
> D.
>
> On Tue, Mar 13, 2018 at 6:16 AM, Alexey Goncharuk <
> [hidden email]> wrote:
>
> > I also like "kill if standalone, stop if embedded" by default. A use can
> > change it to kill for embedded mode, but it will be a controlled safe
> > choice.
> >
> > 2018-03-13 11:26 GMT+03:00 Vladimir Ozerov <[hidden email]>:
> >
> > > +1 for "kill if standalone, stop if embedded". We should never kill a
> > > process in embedded node because it might be disastrous for user
> > > application.
> > >
> > > On Tue, Mar 13, 2018 at 10:41 AM, Dmitry Pavlov <[hidden email]
> >
> > > wrote:
> > >
> > > > Denis, Dmitriy, I am not sure I agree here, please see close
> analogue -
> > > JVM
> > > > itself, and its parameter ExitOnOutOfMemoryError,- it is not default.
> > > >
> > > > If server node is started from sh script, kill OK for me, as process
> is
> > > > controlled only by ignite.  It is sufficient to add option to
> override
> > > > default for sh script.
> > > >
> > > > Users interested in this behaviour may also setup this option to
> "kill"
> > > >
> > > > If server node is started from java, it should never kill whole
> > process.
> > > > This mode is not prohibited by docs, users are allowed to start
> several
> > > > nodes in one process, run its own application logic in this node.
> > > >
> > > > Why we should kill user code running? It could be negative surprise
> to
> > > > user.
> > > >
> > > >
> > > >
> > > > вт, 13 мар. 2018 г. в 8:26, Dmitriy Setrakyan <[hidden email]
> >:
> > > >
> > > > > On Tue, Mar 13, 2018 at 1:18 AM, Andrey Kornev <
> > > [hidden email]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I believe the only reasonable way to handle a critical system
> > failure
> > > > (as
> > > > > > it is defined in the IEP) is a JVM halt (not a graceful
> > > > exit/shutdown!).
> > > > > > The sooner - the better, lesser impact. There’s simply no way to
> > > reason
> > > > > > about the state of the system in a situation like that, all bets
> > are
> > > > off.
> > > > > > Any other policy would only confuse the matters and in all
> > likelihood
> > > > > make
> > > > > > things worse.
> > > > > >
> > > > > > In practice, SREs/Operations would very much rather have a
> process
> > > die
> > > > a
> > > > > > quick clean death, than let it run indefinitely and hope that
> it’ll
> > > > > somehow
> > > > > > recover by itself at some point in future, potentially degrading
> > the
> > > > > > overall system stability and availability all the while.
> > > > > >
> > > > >
> > > > > Completely agree.
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Andrey Kuznetsov
The most doubtful thing is 'stopping'. What if node does not respond due to
critical failure?

2018-03-13 15:16 GMT+03:00 Dmitry Pavlov <[hidden email]>:

> Dmitriy, alternative is "kill if standalone, stop if embedded"
>
> User will be still able to set something like
> -DNODE_CRASH_ACTION="kill"
> if ignite.sh is not used and user accepts alternative that whole process
> would be killed if node is crashed.
>
> Default would be 'node stop', but not hang up infinetely.
>
> Sincerely,
> Dmitriy Pavlov
>
> вт, 13 мар. 2018 г. в 14:53, Dmitriy Setrakyan <[hidden email]>:
>
> --
Best regards,
  Andrey Kuznetsov.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
In reply to this post by Dmitriy Pavlov
On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <[hidden email]>
wrote:

> Dmitriy, alternative is "kill if standalone, stop if embedded"


> User will be still able to set something like
> -DNODE_CRASH_ACTION="kill"
> if ignite.sh is not used and user accepts alternative that whole process
> would be killed if node is crashed.
>
> Default would be 'node stop', but not hang up infinetely.
>

Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.

On top of that, it is very likely that if you stop the "embedded" Ignite,
the user application will not be able to function any way, so killing the
node does sound like a better and *safer* option.

D.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy Pavlov
Please consider that user application may use Ignite as optional cache for
some low-priority feature, but main logic is well functioning without
Ingnite. I can say, as Ignite user in the past, that it is quite real case.

Second real case is using several war files within one application server,
running different logic. Some apps use Ignite, some applications - not.
Killing application server in this case is not an option too.

So default should be stopping all node threads, but not kill the process.
If user is aware process may be killed, it may setup option.

вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <[hidden email]>:

> On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > Dmitriy, alternative is "kill if standalone, stop if embedded"
>
>
> > User will be still able to set something like
> > -DNODE_CRASH_ACTION="kill"
> > if ignite.sh is not used and user accepts alternative that whole process
> > would be killed if node is crashed.
> >
> > Default would be 'node stop', but not hang up infinetely.
> >
>
> Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
> guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
>
> On top of that, it is very likely that if you stop the "embedded" Ignite,
> the user application will not be able to function any way, so killing the
> node does sound like a better and *safer* option.
>
> D.
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
Dmitriy,

I think everyone is suggesting that stopping the node will likely be
impossible if Ignite is frozen. Moreover, it is very likely that all other
apps are frozen too.

My comments are below...

On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <[hidden email]>
wrote:

> Please consider that user application may use Ignite as optional cache for
> some low-priority feature, but main logic is well functioning without
> Ingnite. I can say, as Ignite user in the past, that it is quite real case.
>

I have been a part of this project for a while, but I have never seen
Ignite used as an optional cache. Usually, Ignite is a mandatory part of
the application, not optional.


> Second real case is using several war files within one application server,
> running different logic. Some apps use Ignite, some applications - not.
> Killing application server in this case is not an option too.
>

Not very likely, but possible. This is not a common use case. Most commonly
Ignite would be serving all WAR files with a common data layer.


>
> So default should be stopping all node threads, but not kill the process.
> If user is aware process may be killed, it may setup option.
>

No, the default should be to kill the process. If user does not like it,
then it should be possible to change it to stop the node first.


>
> вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <[hidden email]>:
>
> > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <[hidden email]>
> > wrote:
> >
> > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> >
> >
> > > User will be still able to set something like
> > > -DNODE_CRASH_ACTION="kill"
> > > if ignite.sh is not used and user accepts alternative that whole
> process
> > > would be killed if node is crashed.
> > >
> > > Default would be 'node stop', but not hang up infinetely.
> > >
> >
> > Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
> > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
> >
> > On top of that, it is very likely that if you stop the "embedded" Ignite,
> > the user application will not be able to function any way, so killing the
> > node does sound like a better and *safer* option.
> >
> > D.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy Pavlov
You are suggesting to kill the process, which was not started by Ignite,
are not you?

More consistently is to stop only those processes that are generated by the
control of Ignite, e.g. from ignite.sh - here it is ok for me.

If we relese 'kill by default' as part of 2.5, we will end up with 2.6
emergency release to change it back, if one user will face with such
unexpected behaviour.

вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <[hidden email]>:

> Dmitriy,
>
> I think everyone is suggesting that stopping the node will likely be
> impossible if Ignite is frozen. Moreover, it is very likely that all other
> apps are frozen too.
>
> My comments are below...
>
> On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > Please consider that user application may use Ignite as optional cache
> for
> > some low-priority feature, but main logic is well functioning without
> > Ingnite. I can say, as Ignite user in the past, that it is quite real
> case.
> >
>
> I have been a part of this project for a while, but I have never seen
> Ignite used as an optional cache. Usually, Ignite is a mandatory part of
> the application, not optional.
>
>
> > Second real case is using several war files within one application
> server,
> > running different logic. Some apps use Ignite, some applications - not.
> > Killing application server in this case is not an option too.
> >
>
> Not very likely, but possible. This is not a common use case. Most commonly
> Ignite would be serving all WAR files with a common data layer.
>
>
> >
> > So default should be stopping all node threads, but not kill the process.
> > If user is aware process may be killed, it may setup option.
> >
>
> No, the default should be to kill the process. If user does not like it,
> then it should be possible to change it to stop the node first.
>
>
> >
> > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <[hidden email]>:
> >
> > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <[hidden email]>
> > > wrote:
> > >
> > > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> > >
> > >
> > > > User will be still able to set something like
> > > > -DNODE_CRASH_ACTION="kill"
> > > > if ignite.sh is not used and user accepts alternative that whole
> > process
> > > > would be killed if node is crashed.
> > > >
> > > > Default would be 'node stop', but not hang up infinetely.
> > > >
> > >
> > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The only
> > > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
> > >
> > > On top of that, it is very likely that if you stop the "embedded"
> Ignite,
> > > the user application will not be able to function any way, so killing
> the
> > > node does sound like a better and *safer* option.
> > >
> > > D.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dmagda
+1 for "kill if standalone, stop if embedded" behavior. If the practice
shows that the node should be killed regardless of the mode, then it will
be an easy change. Now we are just guessing, and common sense suggests
going for "kill if standalone, stop if embedded" until we get feedback.

-
Denis

On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov <[hidden email]>
wrote:

> You are suggesting to kill the process, which was not started by Ignite,
> are not you?
>
> More consistently is to stop only those processes that are generated by the
> control of Ignite, e.g. from ignite.sh - here it is ok for me.
>
> If we relese 'kill by default' as part of 2.5, we will end up with 2.6
> emergency release to change it back, if one user will face with such
> unexpected behaviour.
>
> вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <[hidden email]>:
>
> > Dmitriy,
> >
> > I think everyone is suggesting that stopping the node will likely be
> > impossible if Ignite is frozen. Moreover, it is very likely that all
> other
> > apps are frozen too.
> >
> > My comments are below...
> >
> > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <[hidden email]>
> > wrote:
> >
> > > Please consider that user application may use Ignite as optional cache
> > for
> > > some low-priority feature, but main logic is well functioning without
> > > Ingnite. I can say, as Ignite user in the past, that it is quite real
> > case.
> > >
> >
> > I have been a part of this project for a while, but I have never seen
> > Ignite used as an optional cache. Usually, Ignite is a mandatory part of
> > the application, not optional.
> >
> >
> > > Second real case is using several war files within one application
> > server,
> > > running different logic. Some apps use Ignite, some applications - not.
> > > Killing application server in this case is not an option too.
> > >
> >
> > Not very likely, but possible. This is not a common use case. Most
> commonly
> > Ignite would be serving all WAR files with a common data layer.
> >
> >
> > >
> > > So default should be stopping all node threads, but not kill the
> process.
> > > If user is aware process may be killed, it may setup option.
> > >
> >
> > No, the default should be to kill the process. If user does not like it,
> > then it should be possible to change it to stop the node first.
> >
> >
> > >
> > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <[hidden email]
> >:
> > >
> > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> > > >
> > > >
> > > > > User will be still able to set something like
> > > > > -DNODE_CRASH_ACTION="kill"
> > > > > if ignite.sh is not used and user accepts alternative that whole
> > > process
> > > > > would be killed if node is crashed.
> > > > >
> > > > > Default would be 'node stop', but not hang up infinetely.
> > > > >
> > > >
> > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The
> only
> > > > guaranteed way to "un-freeze" the cluster is to kill the frozen JVM.
> > > >
> > > > On top of that, it is very likely that if you stop the "embedded"
> > Ignite,
> > > > the user application will not be able to function any way, so killing
> > the
> > > > node does sound like a better and *safer* option.
> > > >
> > > > D.
> > > >
> > >
> >
>
123