IEP-14: Ignite failures handling (Discussion)

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
Guys, I do not think there is an understanding here. If Ignite hangs - it
will likely be impossible to stop. So if you are suggesting "stop if
embedded", you might as well suggest "do nothing if embedded".

I have seen many Ignite deployments, embedded or not, large and small, and
in all those deployments if Ignite went into a frozen state, killing it was
the best option. Moreover, it provided the most predictable behavior. I am
not guessing here, but it seems to me that the rest of the community is
guessing.

Killing a frozen Ignite node should be a default behavior in all cases,
embedded or not. Stopping a frozen Ignite node should be a configurable
option, so a user has an ability to turn off auto-kill behavior. We should
also have a 3rd option, "stop+kill", so if stopping fails, then the process
is automatically killed (this is also a good default option).

Personally, I am OK if the default behavior is "kill" or "stop+kill", but
it should be the same default in all cases. We should stop the practice of
creating different default behaviors for the same problem. It is confusing
and hard to document.

D.

On Tue, Mar 13, 2018 at 2:19 PM, Denis Magda <[hidden email]> wrote:

> +1 for "kill if standalone, stop if embedded" behavior. If the practice
> shows that the node should be killed regardless of the mode, then it will
> be an easy change. Now we are just guessing, and common sense suggests
> going for "kill if standalone, stop if embedded" until we get feedback.
>
> -
> Denis
>
> On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > You are suggesting to kill the process, which was not started by Ignite,
> > are not you?
> >
> > More consistently is to stop only those processes that are generated by
> the
> > control of Ignite, e.g. from ignite.sh - here it is ok for me.
> >
> > If we relese 'kill by default' as part of 2.5, we will end up with 2.6
> > emergency release to change it back, if one user will face with such
> > unexpected behaviour.
> >
> > вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <[hidden email]>:
> >
> > > Dmitriy,
> > >
> > > I think everyone is suggesting that stopping the node will likely be
> > > impossible if Ignite is frozen. Moreover, it is very likely that all
> > other
> > > apps are frozen too.
> > >
> > > My comments are below...
> > >
> > > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <[hidden email]>
> > > wrote:
> > >
> > > > Please consider that user application may use Ignite as optional
> cache
> > > for
> > > > some low-priority feature, but main logic is well functioning without
> > > > Ingnite. I can say, as Ignite user in the past, that it is quite real
> > > case.
> > > >
> > >
> > > I have been a part of this project for a while, but I have never seen
> > > Ignite used as an optional cache. Usually, Ignite is a mandatory part
> of
> > > the application, not optional.
> > >
> > >
> > > > Second real case is using several war files within one application
> > > server,
> > > > running different logic. Some apps use Ignite, some applications -
> not.
> > > > Killing application server in this case is not an option too.
> > > >
> > >
> > > Not very likely, but possible. This is not a common use case. Most
> > commonly
> > > Ignite would be serving all WAR files with a common data layer.
> > >
> > >
> > > >
> > > > So default should be stopping all node threads, but not kill the
> > process.
> > > > If user is aware process may be killed, it may setup option.
> > > >
> > >
> > > No, the default should be to kill the process. If user does not like
> it,
> > > then it should be possible to change it to stop the node first.
> > >
> > >
> > > >
> > > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <
> [hidden email]
> > >:
> > > >
> > > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <
> > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> > > > >
> > > > >
> > > > > > User will be still able to set something like
> > > > > > -DNODE_CRASH_ACTION="kill"
> > > > > > if ignite.sh is not used and user accepts alternative that whole
> > > > process
> > > > > > would be killed if node is crashed.
> > > > > >
> > > > > > Default would be 'node stop', but not hang up infinetely.
> > > > > >
> > > > >
> > > > > Dmitriy, if Ignite if frozen, you will not be able to stop it. The
> > only
> > > > > guaranteed way to "un-freeze" the cluster is to kill the frozen
> JVM.
> > > > >
> > > > > On top of that, it is very likely that if you stop the "embedded"
> > > Ignite,
> > > > > the user application will not be able to function any way, so
> killing
> > > the
> > > > > node does sound like a better and *safer* option.
> > > > >
> > > > > D.
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy Pavlov
What do you think if stop is default for all cases?

Kill is configurable.

We can consider enforse sockets close for 'stop'. This will allow to ignore
hang node by rest of the cluster.

ср, 14 мар. 2018 г., 1:48 Dmitriy Setrakyan <[hidden email]>:

> Guys, I do not think there is an understanding here. If Ignite hangs - it
> will likely be impossible to stop. So if you are suggesting "stop if
> embedded", you might as well suggest "do nothing if embedded".
>
> I have seen many Ignite deployments, embedded or not, large and small, and
> in all those deployments if Ignite went into a frozen state, killing it was
> the best option. Moreover, it provided the most predictable behavior. I am
> not guessing here, but it seems to me that the rest of the community is
> guessing.
>
> Killing a frozen Ignite node should be a default behavior in all cases,
> embedded or not. Stopping a frozen Ignite node should be a configurable
> option, so a user has an ability to turn off auto-kill behavior. We should
> also have a 3rd option, "stop+kill", so if stopping fails, then the process
> is automatically killed (this is also a good default option).
>
> Personally, I am OK if the default behavior is "kill" or "stop+kill", but
> it should be the same default in all cases. We should stop the practice of
> creating different default behaviors for the same problem. It is confusing
> and hard to document.
>
> D.
>
> On Tue, Mar 13, 2018 at 2:19 PM, Denis Magda <[hidden email]> wrote:
>
> > +1 for "kill if standalone, stop if embedded" behavior. If the practice
> > shows that the node should be killed regardless of the mode, then it will
> > be an easy change. Now we are just guessing, and common sense suggests
> > going for "kill if standalone, stop if embedded" until we get feedback.
> >
> > -
> > Denis
> >
> > On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov <[hidden email]>
> > wrote:
> >
> > > You are suggesting to kill the process, which was not started by
> Ignite,
> > > are not you?
> > >
> > > More consistently is to stop only those processes that are generated by
> > the
> > > control of Ignite, e.g. from ignite.sh - here it is ok for me.
> > >
> > > If we relese 'kill by default' as part of 2.5, we will end up with 2.6
> > > emergency release to change it back, if one user will face with such
> > > unexpected behaviour.
> > >
> > > вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <[hidden email]
> >:
> > >
> > > > Dmitriy,
> > > >
> > > > I think everyone is suggesting that stopping the node will likely be
> > > > impossible if Ignite is frozen. Moreover, it is very likely that all
> > > other
> > > > apps are frozen too.
> > > >
> > > > My comments are below...
> > > >
> > > > On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > Please consider that user application may use Ignite as optional
> > cache
> > > > for
> > > > > some low-priority feature, but main logic is well functioning
> without
> > > > > Ingnite. I can say, as Ignite user in the past, that it is quite
> real
> > > > case.
> > > > >
> > > >
> > > > I have been a part of this project for a while, but I have never seen
> > > > Ignite used as an optional cache. Usually, Ignite is a mandatory part
> > of
> > > > the application, not optional.
> > > >
> > > >
> > > > > Second real case is using several war files within one application
> > > > server,
> > > > > running different logic. Some apps use Ignite, some applications -
> > not.
> > > > > Killing application server in this case is not an option too.
> > > > >
> > > >
> > > > Not very likely, but possible. This is not a common use case. Most
> > > commonly
> > > > Ignite would be serving all WAR files with a common data layer.
> > > >
> > > >
> > > > >
> > > > > So default should be stopping all node threads, but not kill the
> > > process.
> > > > > If user is aware process may be killed, it may setup option.
> > > > >
> > > >
> > > > No, the default should be to kill the process. If user does not like
> > it,
> > > > then it should be possible to change it to stop the node first.
> > > >
> > > >
> > > > >
> > > > > вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <
> > [hidden email]
> > > >:
> > > > >
> > > > > > On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <
> > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Dmitriy, alternative is "kill if standalone, stop if embedded"
> > > > > >
> > > > > >
> > > > > > > User will be still able to set something like
> > > > > > > -DNODE_CRASH_ACTION="kill"
> > > > > > > if ignite.sh is not used and user accepts alternative that
> whole
> > > > > process
> > > > > > > would be killed if node is crashed.
> > > > > > >
> > > > > > > Default would be 'node stop', but not hang up infinetely.
> > > > > > >
> > > > > >
> > > > > > Dmitriy, if Ignite if frozen, you will not be able to stop it.
> The
> > > only
> > > > > > guaranteed way to "un-freeze" the cluster is to kill the frozen
> > JVM.
> > > > > >
> > > > > > On top of that, it is very likely that if you stop the "embedded"
> > > > Ignite,
> > > > > > the user application will not be able to function any way, so
> > killing
> > > > the
> > > > > > node does sound like a better and *safer* option.
> > > > > >
> > > > > > D.
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
On Tue, Mar 13, 2018 at 6:55 PM, Dmitry Pavlov <[hidden email]>
wrote:

> What do you think if stop is default for all cases?
>
> Kill is configurable.
>
> We can consider enforse sockets close for 'stop'. This will allow to ignore
> hang node by rest of the cluster.
>

Dmitriy, I see that you cannot come to terms with stopping a process that
was not started by Ignite. However, in majority of the deployments, users
would prefer that you would "kill" the process instead of leaving it
running in a "frozen" state. Frozen state is non-deterministic and it is
impossible to create a recovery for it. Killing the process is very
deterministic and can be recovered by restarting it in most cases.

"stop" does not fix the problem we are trying to solve. The whole point is
to prevent frozen state, and "stop" without "kill" does not prevent it. I
am OK if "stop+kill" is the default behavior, which means that we try a
graceful shutdown and then always kill the process anyway.

I think we should have the following configurable options:
- "stop+kill" (default)
- "kill"
- "stop"
- "stop+restart" (if stop fails, we should kill regardless)
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Ivan Rakov
In reply to this post by dsetrakyan
I just would like to add my +1 for "kill if standalone, stop if
embedded" default option. My arguments:

1) Regarding "If Ignite hangs - it will likely be impossible to stop":
Unfortunately, it's true that Ignite can hang during stop procedure.
However, most of failures described under IEP-14 (storage IO exceptions,
death of critical system worker thread, etc) normally shouldn't turn
node into "impossible to stop" state. Turning into that state is a bug
itself. I guess that we shouldn't choose system behavior on the basis of
known bugs.

2) User might want to handle Ignite node crash before shutting down the
whole JVM - raise alert, close external resources, etc

3) IEP-14 document has important notes: "More than one Ignite node could
be started in one JVM process" and "Different nodes in one JVM process
could belong to different clusters". This is possible only in embedded
mode. I think, we shouldn't shock user by sudden JVM halt (possibly,
along with another healthy nodes) if there's a chance of successful node
stop.

Best Regards,
Ivan Rakov

On 14.03.2018 1:47, Dmitriy Setrakyan wrote:

> Guys, I do not think there is an understanding here. If Ignite hangs - it
> will likely be impossible to stop. So if you are suggesting "stop if
> embedded", you might as well suggest "do nothing if embedded".
>
> I have seen many Ignite deployments, embedded or not, large and small, and
> in all those deployments if Ignite went into a frozen state, killing it was
> the best option. Moreover, it provided the most predictable behavior. I am
> not guessing here, but it seems to me that the rest of the community is
> guessing.
>
> Killing a frozen Ignite node should be a default behavior in all cases,
> embedded or not. Stopping a frozen Ignite node should be a configurable
> option, so a user has an ability to turn off auto-kill behavior. We should
> also have a 3rd option, "stop+kill", so if stopping fails, then the process
> is automatically killed (this is also a good default option).
>
> Personally, I am OK if the default behavior is "kill" or "stop+kill", but
> it should be the same default in all cases. We should stop the practice of
> creating different default behaviors for the same problem. It is confusing
> and hard to document.
>
> D.
>
> On Tue, Mar 13, 2018 at 2:19 PM, Denis Magda <[hidden email]> wrote:
>
>> +1 for "kill if standalone, stop if embedded" behavior. If the practice
>> shows that the node should be killed regardless of the mode, then it will
>> be an easy change. Now we are just guessing, and common sense suggests
>> going for "kill if standalone, stop if embedded" until we get feedback.
>>
>> -
>> Denis
>>
>> On Tue, Mar 13, 2018 at 8:30 AM, Dmitry Pavlov <[hidden email]>
>> wrote:
>>
>>> You are suggesting to kill the process, which was not started by Ignite,
>>> are not you?
>>>
>>> More consistently is to stop only those processes that are generated by
>> the
>>> control of Ignite, e.g. from ignite.sh - here it is ok for me.
>>>
>>> If we relese 'kill by default' as part of 2.5, we will end up with 2.6
>>> emergency release to change it back, if one user will face with such
>>> unexpected behaviour.
>>>
>>> вт, 13 мар. 2018 г. в 18:17, Dmitriy Setrakyan <[hidden email]>:
>>>
>>>> Dmitriy,
>>>>
>>>> I think everyone is suggesting that stopping the node will likely be
>>>> impossible if Ignite is frozen. Moreover, it is very likely that all
>>> other
>>>> apps are frozen too.
>>>>
>>>> My comments are below...
>>>>
>>>> On Tue, Mar 13, 2018 at 9:12 AM, Dmitry Pavlov <[hidden email]>
>>>> wrote:
>>>>
>>>>> Please consider that user application may use Ignite as optional
>> cache
>>>> for
>>>>> some low-priority feature, but main logic is well functioning without
>>>>> Ingnite. I can say, as Ignite user in the past, that it is quite real
>>>> case.
>>>> I have been a part of this project for a while, but I have never seen
>>>> Ignite used as an optional cache. Usually, Ignite is a mandatory part
>> of
>>>> the application, not optional.
>>>>
>>>>
>>>>> Second real case is using several war files within one application
>>>> server,
>>>>> running different logic. Some apps use Ignite, some applications -
>> not.
>>>>> Killing application server in this case is not an option too.
>>>>>
>>>> Not very likely, but possible. This is not a common use case. Most
>>> commonly
>>>> Ignite would be serving all WAR files with a common data layer.
>>>>
>>>>
>>>>> So default should be stopping all node threads, but not kill the
>>> process.
>>>>> If user is aware process may be killed, it may setup option.
>>>>>
>>>> No, the default should be to kill the process. If user does not like
>> it,
>>>> then it should be possible to change it to stop the node first.
>>>>
>>>>
>>>>> вт, 13 мар. 2018 г. в 15:24, Dmitriy Setrakyan <
>> [hidden email]
>>>> :
>>>>>> On Tue, Mar 13, 2018 at 8:16 AM, Dmitry Pavlov <
>>> [hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> Dmitriy, alternative is "kill if standalone, stop if embedded"
>>>>>>
>>>>>>> User will be still able to set something like
>>>>>>> -DNODE_CRASH_ACTION="kill"
>>>>>>> if ignite.sh is not used and user accepts alternative that whole
>>>>> process
>>>>>>> would be killed if node is crashed.
>>>>>>>
>>>>>>> Default would be 'node stop', but not hang up infinetely.
>>>>>>>
>>>>>> Dmitriy, if Ignite if frozen, you will not be able to stop it. The
>>> only
>>>>>> guaranteed way to "un-freeze" the cluster is to kill the frozen
>> JVM.
>>>>>> On top of that, it is very likely that if you stop the "embedded"
>>>> Ignite,
>>>>>> the user application will not be able to function any way, so
>> killing
>>>> the
>>>>>> node does sound like a better and *safer* option.
>>>>>>
>>>>>> D.
>>>>>>

Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <[hidden email]> wrote:

> I just would like to add my +1 for "kill if standalone, stop if embedded"
> default option. My arguments:
>
> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
> Unfortunately, it's true that Ignite can hang during stop procedure.
> However, most of failures described under IEP-14 (storage IO exceptions,
> death of critical system worker thread, etc) normally shouldn't turn node
> into "impossible to stop" state. Turning into that state is a bug itself. I
> guess that we shouldn't choose system behavior on the basis of known bugs.


The whole discussion is about protecting against force-major issues,
including Ignite bugs. You are assuming that a user application will
somehow continue to function if an Ignite node is stopped. In most cases it
will just freeze itself and cause the rest of the application to hang.

Again, "kill+stop" is the most deterministic and the safest default
behavior. Try a graceful shutdown (which will make restart easier), and
then kill the process regardless.

Note that we are arguing about the default behavior. If a user does not
like this default, then this user can change it to another behavior.


> 2) User might want to handle Ignite node crash before shutting down the
> whole JVM - raise alert, close external resources, etc
>

Very unlikely, but if a user is this advanced, then this user can change
the default behavior. Most users will not even know how to configure such
custom shutdown behavior and would prefer an automatic kill.

3) IEP-14 document has important notes: "More than one Ignite node could be
> started in one JVM process" and "Different nodes in one JVM process could
> belong to different clusters". This is possible only in embedded mode. I
> think, we shouldn't shock user by sudden JVM halt (possibly, along with
> another healthy nodes) if there's a chance of successful node stop.
>

Has anyone actually seen a real example of that? I have not. This scenario
is extremely unlikely and should not define the default behavior. Again, if
a user is so advanced to come up with such a sophisticated deployment, then
the same user should be able to set different default behaviors for
different clusters.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Ivan Rakov
One more note: "kill if standalone, stop if embedded" differs from what
you are suggesting "try graceful, then kill process regardless" only in
case when graceful shutdown hangs.
Do we have understanding, how often does graceful shutdown hang?
Obviously, *grid hang* is often case, but it shouldn't be messed with
*graceful shutdown hang*. From my experience, if something went wrong,
users just prefer to do kill -9  because it's much more reliable and
easy. Probably, in most of cases when kill -9 worked, graceful stop
would have worked as well - we just don't have such statistics.
It may be bad example, but: in our CI tests we intentionally break grid
in many harsh ways and perform a graceful stop after the test execution,
and it doesn't hang - otherwise we'd see many "Execution timeout" test
suite hangs.

Best Regards,
Ivan Rakov

On 14.03.2018 2:24, Dmitriy Setrakyan wrote:

> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <[hidden email]> wrote:
>
>> I just would like to add my +1 for "kill if standalone, stop if embedded"
>> default option. My arguments:
>>
>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
>> Unfortunately, it's true that Ignite can hang during stop procedure.
>> However, most of failures described under IEP-14 (storage IO exceptions,
>> death of critical system worker thread, etc) normally shouldn't turn node
>> into "impossible to stop" state. Turning into that state is a bug itself. I
>> guess that we shouldn't choose system behavior on the basis of known bugs.
>
> The whole discussion is about protecting against force-major issues,
> including Ignite bugs. You are assuming that a user application will
> somehow continue to function if an Ignite node is stopped. In most cases it
> will just freeze itself and cause the rest of the application to hang.
>
> Again, "kill+stop" is the most deterministic and the safest default
> behavior. Try a graceful shutdown (which will make restart easier), and
> then kill the process regardless.
>
> Note that we are arguing about the default behavior. If a user does not
> like this default, then this user can change it to another behavior.
>
>
>> 2) User might want to handle Ignite node crash before shutting down the
>> whole JVM - raise alert, close external resources, etc
>>
> Very unlikely, but if a user is this advanced, then this user can change
> the default behavior. Most users will not even know how to configure such
> custom shutdown behavior and would prefer an automatic kill.
>
> 3) IEP-14 document has important notes: "More than one Ignite node could be
>> started in one JVM process" and "Different nodes in one JVM process could
>> belong to different clusters". This is possible only in embedded mode. I
>> think, we shouldn't shock user by sudden JVM halt (possibly, along with
>> another healthy nodes) if there's a chance of successful node stop.
>>
> Has anyone actually seen a real example of that? I have not. This scenario
> is extremely unlikely and should not define the default behavior. Again, if
> a user is so advanced to come up with such a sophisticated deployment, then
> the same user should be able to set different default behaviors for
> different clusters.
>

Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Valentin Kulichenko
Ivan,

If grid hangs, graceful shutdown would most likely hang as well. Almost
never you can recover from a bad state using graceful procedures.

I agree that we should not create two defaults, especially in this case.
It's not even strictly defined what is embedded node in Ignite. For
example, if I start it using a custom main class and/or custom script
instead of ignite.sh, would it be embedded or standalone node?

-Val

On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <[hidden email]> wrote:

> One more note: "kill if standalone, stop if embedded" differs from what
> you are suggesting "try graceful, then kill process regardless" only in
> case when graceful shutdown hangs.
> Do we have understanding, how often does graceful shutdown hang?
> Obviously, *grid hang* is often case, but it shouldn't be messed with
> *graceful shutdown hang*. From my experience, if something went wrong,
> users just prefer to do kill -9  because it's much more reliable and easy.
> Probably, in most of cases when kill -9 worked, graceful stop would have
> worked as well - we just don't have such statistics.
> It may be bad example, but: in our CI tests we intentionally break grid in
> many harsh ways and perform a graceful stop after the test execution, and
> it doesn't hang - otherwise we'd see many "Execution timeout" test suite
> hangs.
>
> Best Regards,
> Ivan Rakov
>
>
> On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
>
>> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <[hidden email]>
>> wrote:
>>
>> I just would like to add my +1 for "kill if standalone, stop if embedded"
>>> default option. My arguments:
>>>
>>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
>>> Unfortunately, it's true that Ignite can hang during stop procedure.
>>> However, most of failures described under IEP-14 (storage IO exceptions,
>>> death of critical system worker thread, etc) normally shouldn't turn node
>>> into "impossible to stop" state. Turning into that state is a bug
>>> itself. I
>>> guess that we shouldn't choose system behavior on the basis of known
>>> bugs.
>>>
>>
>> The whole discussion is about protecting against force-major issues,
>> including Ignite bugs. You are assuming that a user application will
>> somehow continue to function if an Ignite node is stopped. In most cases
>> it
>> will just freeze itself and cause the rest of the application to hang.
>>
>> Again, "kill+stop" is the most deterministic and the safest default
>> behavior. Try a graceful shutdown (which will make restart easier), and
>> then kill the process regardless.
>>
>> Note that we are arguing about the default behavior. If a user does not
>> like this default, then this user can change it to another behavior.
>>
>>
>> 2) User might want to handle Ignite node crash before shutting down the
>>> whole JVM - raise alert, close external resources, etc
>>>
>>> Very unlikely, but if a user is this advanced, then this user can change
>> the default behavior. Most users will not even know how to configure such
>> custom shutdown behavior and would prefer an automatic kill.
>>
>> 3) IEP-14 document has important notes: "More than one Ignite node could
>> be
>>
>>> started in one JVM process" and "Different nodes in one JVM process could
>>> belong to different clusters". This is possible only in embedded mode. I
>>> think, we shouldn't shock user by sudden JVM halt (possibly, along with
>>> another healthy nodes) if there's a chance of successful node stop.
>>>
>>> Has anyone actually seen a real example of that? I have not. This
>> scenario
>> is extremely unlikely and should not define the default behavior. Again,
>> if
>> a user is so advanced to come up with such a sophisticated deployment,
>> then
>> the same user should be able to set different default behaviors for
>> different clusters.
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

npordash
I can tell you as a user that if any library I was using in my application
called System.exit without my consent would result in a lot of frustration.

If ignite enters an unrecoverable state then I think that is something that
should be observable locally, similar to node segmentation and then the
application can decide the best course of action.

Of course, if ignite was started as a standalone process do what you think
is best, but don't think you can kill the process without backlash from
users if it's running in embedded mode.

- Nick

On Tue, Mar 13, 2018, 5:12 PM Valentin Kulichenko <
[hidden email]> wrote:

> Ivan,
>
> If grid hangs, graceful shutdown would most likely hang as well. Almost
> never you can recover from a bad state using graceful procedures.
>
> I agree that we should not create two defaults, especially in this case.
> It's not even strictly defined what is embedded node in Ignite. For
> example, if I start it using a custom main class and/or custom script
> instead of ignite.sh, would it be embedded or standalone node?
>
> -Val
>
> On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <[hidden email]> wrote:
>
> > One more note: "kill if standalone, stop if embedded" differs from what
> > you are suggesting "try graceful, then kill process regardless" only in
> > case when graceful shutdown hangs.
> > Do we have understanding, how often does graceful shutdown hang?
> > Obviously, *grid hang* is often case, but it shouldn't be messed with
> > *graceful shutdown hang*. From my experience, if something went wrong,
> > users just prefer to do kill -9  because it's much more reliable and
> easy.
> > Probably, in most of cases when kill -9 worked, graceful stop would have
> > worked as well - we just don't have such statistics.
> > It may be bad example, but: in our CI tests we intentionally break grid
> in
> > many harsh ways and perform a graceful stop after the test execution, and
> > it doesn't hang - otherwise we'd see many "Execution timeout" test suite
> > hangs.
> >
> > Best Regards,
> > Ivan Rakov
> >
> >
> > On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
> >
> >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <[hidden email]>
> >> wrote:
> >>
> >> I just would like to add my +1 for "kill if standalone, stop if
> embedded"
> >>> default option. My arguments:
> >>>
> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
> >>> Unfortunately, it's true that Ignite can hang during stop procedure.
> >>> However, most of failures described under IEP-14 (storage IO
> exceptions,
> >>> death of critical system worker thread, etc) normally shouldn't turn
> node
> >>> into "impossible to stop" state. Turning into that state is a bug
> >>> itself. I
> >>> guess that we shouldn't choose system behavior on the basis of known
> >>> bugs.
> >>>
> >>
> >> The whole discussion is about protecting against force-major issues,
> >> including Ignite bugs. You are assuming that a user application will
> >> somehow continue to function if an Ignite node is stopped. In most cases
> >> it
> >> will just freeze itself and cause the rest of the application to hang.
> >>
> >> Again, "kill+stop" is the most deterministic and the safest default
> >> behavior. Try a graceful shutdown (which will make restart easier), and
> >> then kill the process regardless.
> >>
> >> Note that we are arguing about the default behavior. If a user does not
> >> like this default, then this user can change it to another behavior.
> >>
> >>
> >> 2) User might want to handle Ignite node crash before shutting down the
> >>> whole JVM - raise alert, close external resources, etc
> >>>
> >>> Very unlikely, but if a user is this advanced, then this user can
> change
> >> the default behavior. Most users will not even know how to configure
> such
> >> custom shutdown behavior and would prefer an automatic kill.
> >>
> >> 3) IEP-14 document has important notes: "More than one Ignite node could
> >> be
> >>
> >>> started in one JVM process" and "Different nodes in one JVM process
> could
> >>> belong to different clusters". This is possible only in embedded mode.
> I
> >>> think, we shouldn't shock user by sudden JVM halt (possibly, along with
> >>> another healthy nodes) if there's a chance of successful node stop.
> >>>
> >>> Has anyone actually seen a real example of that? I have not. This
> >> scenario
> >> is extremely unlikely and should not define the default behavior. Again,
> >> if
> >> a user is so advanced to come up with such a sophisticated deployment,
> >> then
> >> the same user should be able to set different default behaviors for
> >> different clusters.
> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Nikolay Izhikov-2
In reply to this post by dsetrakyan
Dmitriy.

I think you and other participants of discussion are talking about different cases.

May be it be usefull to look at specific cases and discuss each of them separately?

I look at IEP page and see following:

```
File IO errors. Usually IOException's threw by read/write operations on file system. The following subsystems should be considered as critical:
* WAL
* Page store
* Meta store
* Binary meta store
```

Suppose, we ran out of disk space on some node.
The other things are all right.
Should we do `System.exit(-1);` in that case?

Personally, I fully agreed with Nick Podrash:

"I can tell you as a user that if any library I was using in my application called System.exit without my consent would result in a lot of frustration."

Also, do you have any examples of other products that do `System.exit(-1);` in case of troubles?

В Вт, 13/03/2018 в 19:07 -0400, Dmitriy Setrakyan пишет:

> On Tue, Mar 13, 2018 at 6:55 PM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > What do you think if stop is default for all cases?
> >
> > Kill is configurable.
> >
> > We can consider enforse sockets close for 'stop'. This will allow to ignore
> > hang node by rest of the cluster.
> >
>
> Dmitriy, I see that you cannot come to terms with stopping a process that
> was not started by Ignite. However, in majority of the deployments, users
> would prefer that you would "kill" the process instead of leaving it
> running in a "frozen" state. Frozen state is non-deterministic and it is
> impossible to create a recovery for it. Killing the process is very
> deterministic and can be recovered by restarting it in most cases.
>
> "stop" does not fix the problem we are trying to solve. The whole point is
> to prevent frozen state, and "stop" without "kill" does not prevent it. I
> am OK if "stop+kill" is the default behavior, which means that we try a
> graceful shutdown and then always kill the process anyway.
>
> I think we should have the following configurable options:
> - "stop+kill" (default)
> - "kill"
> - "stop"
> - "stop+restart" (if stop fails, we should kill regardless)

signature.asc (465 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Vladimir Ozerov
In reply to this post by Valentin Kulichenko
Valya,

This is very easy to answer - if CommandLineStartup is used, then it is
standalone node. In all other cases it is embedded.

If node shutdown hangs - just let it continue hanging, so that application
admins are able to decide on their own what to do next. Someone would want
to get the stack trace, others would decide to restart outside of business
hours (e.g. because Ignite is used only in part of their application),
someone else would try to shutdown gracefully other components before
stopping the process to minimize negative impact, etc.

I am quite understand why are we guessing here how embedded Ignite is used.
It could be used in any way and in any combination with other frameworks.
Process stop by default is simply not an option.

ср, 14 марта 2018 г. в 3:12, Valentin Kulichenko <
[hidden email]>:

> Ivan,
>
> If grid hangs, graceful shutdown would most likely hang as well. Almost
> never you can recover from a bad state using graceful procedures.
>
> I agree that we should not create two defaults, especially in this case.
> It's not even strictly defined what is embedded node in Ignite. For
> example, if I start it using a custom main class and/or custom script
> instead of ignite.sh, would it be embedded or standalone node?
>
> -Val
>
> On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <[hidden email]> wrote:
>
> > One more note: "kill if standalone, stop if embedded" differs from what
> > you are suggesting "try graceful, then kill process regardless" only in
> > case when graceful shutdown hangs.
> > Do we have understanding, how often does graceful shutdown hang?
> > Obviously, *grid hang* is often case, but it shouldn't be messed with
> > *graceful shutdown hang*. From my experience, if something went wrong,
> > users just prefer to do kill -9  because it's much more reliable and
> easy.
> > Probably, in most of cases when kill -9 worked, graceful stop would have
> > worked as well - we just don't have such statistics.
> > It may be bad example, but: in our CI tests we intentionally break grid
> in
> > many harsh ways and perform a graceful stop after the test execution, and
> > it doesn't hang - otherwise we'd see many "Execution timeout" test suite
> > hangs.
> >
> > Best Regards,
> > Ivan Rakov
> >
> >
> > On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
> >
> >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <[hidden email]>
> >> wrote:
> >>
> >> I just would like to add my +1 for "kill if standalone, stop if
> embedded"
> >>> default option. My arguments:
> >>>
> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
> >>> Unfortunately, it's true that Ignite can hang during stop procedure.
> >>> However, most of failures described under IEP-14 (storage IO
> exceptions,
> >>> death of critical system worker thread, etc) normally shouldn't turn
> node
> >>> into "impossible to stop" state. Turning into that state is a bug
> >>> itself. I
> >>> guess that we shouldn't choose system behavior on the basis of known
> >>> bugs.
> >>>
> >>
> >> The whole discussion is about protecting against force-major issues,
> >> including Ignite bugs. You are assuming that a user application will
> >> somehow continue to function if an Ignite node is stopped. In most cases
> >> it
> >> will just freeze itself and cause the rest of the application to hang.
> >>
> >> Again, "kill+stop" is the most deterministic and the safest default
> >> behavior. Try a graceful shutdown (which will make restart easier), and
> >> then kill the process regardless.
> >>
> >> Note that we are arguing about the default behavior. If a user does not
> >> like this default, then this user can change it to another behavior.
> >>
> >>
> >> 2) User might want to handle Ignite node crash before shutting down the
> >>> whole JVM - raise alert, close external resources, etc
> >>>
> >>> Very unlikely, but if a user is this advanced, then this user can
> change
> >> the default behavior. Most users will not even know how to configure
> such
> >> custom shutdown behavior and would prefer an automatic kill.
> >>
> >> 3) IEP-14 document has important notes: "More than one Ignite node could
> >> be
> >>
> >>> started in one JVM process" and "Different nodes in one JVM process
> could
> >>> belong to different clusters". This is possible only in embedded mode.
> I
> >>> think, we shouldn't shock user by sudden JVM halt (possibly, along with
> >>> another healthy nodes) if there's a chance of successful node stop.
> >>>
> >>> Has anyone actually seen a real example of that? I have not. This
> >> scenario
> >> is extremely unlikely and should not define the default behavior. Again,
> >> if
> >> a user is so advanced to come up with such a sophisticated deployment,
> >> then
> >> the same user should be able to set different default behaviors for
> >> different clusters.
> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Vladimir Ozerov
As far as shutdown, what we need to implement is “hard shutdown” mode. This
is when we first close all network sockets, then cancel all registered
futures. This would enough to unblock the cluster and local user threads.

ср, 14 марта 2018 г. в 8:40, Vladimir Ozerov <[hidden email]>:

> Valya,
>
> This is very easy to answer - if CommandLineStartup is used, then it is
> standalone node. In all other cases it is embedded.
>
> If node shutdown hangs - just let it continue hanging, so that application
> admins are able to decide on their own what to do next. Someone would want
> to get the stack trace, others would decide to restart outside of business
> hours (e.g. because Ignite is used only in part of their application),
> someone else would try to shutdown gracefully other components before
> stopping the process to minimize negative impact, etc.
>
> I am quite understand why are we guessing here how embedded Ignite is
> used. It could be used in any way and in any combination with other
> frameworks. Process stop by default is simply not an option.
>
> ср, 14 марта 2018 г. в 3:12, Valentin Kulichenko <
> [hidden email]>:
>
>> Ivan,
>>
>> If grid hangs, graceful shutdown would most likely hang as well. Almost
>> never you can recover from a bad state using graceful procedures.
>>
>> I agree that we should not create two defaults, especially in this case.
>> It's not even strictly defined what is embedded node in Ignite. For
>> example, if I start it using a custom main class and/or custom script
>> instead of ignite.sh, would it be embedded or standalone node?
>>
>> -Val
>>
>> On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <[hidden email]>
>> wrote:
>>
>> > One more note: "kill if standalone, stop if embedded" differs from what
>> > you are suggesting "try graceful, then kill process regardless" only in
>> > case when graceful shutdown hangs.
>> > Do we have understanding, how often does graceful shutdown hang?
>> > Obviously, *grid hang* is often case, but it shouldn't be messed with
>> > *graceful shutdown hang*. From my experience, if something went wrong,
>> > users just prefer to do kill -9  because it's much more reliable and
>> easy.
>> > Probably, in most of cases when kill -9 worked, graceful stop would have
>> > worked as well - we just don't have such statistics.
>> > It may be bad example, but: in our CI tests we intentionally break grid
>> in
>> > many harsh ways and perform a graceful stop after the test execution,
>> and
>> > it doesn't hang - otherwise we'd see many "Execution timeout" test suite
>> > hangs.
>> >
>> > Best Regards,
>> > Ivan Rakov
>> >
>> >
>> > On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
>> >
>> >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <[hidden email]>
>> >> wrote:
>> >>
>> >> I just would like to add my +1 for "kill if standalone, stop if
>> embedded"
>> >>> default option. My arguments:
>> >>>
>> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
>> >>> Unfortunately, it's true that Ignite can hang during stop procedure.
>> >>> However, most of failures described under IEP-14 (storage IO
>> exceptions,
>> >>> death of critical system worker thread, etc) normally shouldn't turn
>> node
>> >>> into "impossible to stop" state. Turning into that state is a bug
>> >>> itself. I
>> >>> guess that we shouldn't choose system behavior on the basis of known
>> >>> bugs.
>> >>>
>> >>
>> >> The whole discussion is about protecting against force-major issues,
>> >> including Ignite bugs. You are assuming that a user application will
>> >> somehow continue to function if an Ignite node is stopped. In most
>> cases
>> >> it
>> >> will just freeze itself and cause the rest of the application to hang.
>> >>
>> >> Again, "kill+stop" is the most deterministic and the safest default
>> >> behavior. Try a graceful shutdown (which will make restart easier), and
>> >> then kill the process regardless.
>> >>
>> >> Note that we are arguing about the default behavior. If a user does not
>> >> like this default, then this user can change it to another behavior.
>> >>
>> >>
>> >> 2) User might want to handle Ignite node crash before shutting down the
>> >>> whole JVM - raise alert, close external resources, etc
>> >>>
>> >>> Very unlikely, but if a user is this advanced, then this user can
>> change
>> >> the default behavior. Most users will not even know how to configure
>> such
>> >> custom shutdown behavior and would prefer an automatic kill.
>> >>
>> >> 3) IEP-14 document has important notes: "More than one Ignite node
>> could
>> >> be
>> >>
>> >>> started in one JVM process" and "Different nodes in one JVM process
>> could
>> >>> belong to different clusters". This is possible only in embedded
>> mode. I
>> >>> think, we shouldn't shock user by sudden JVM halt (possibly, along
>> with
>> >>> another healthy nodes) if there's a chance of successful node stop.
>> >>>
>> >>> Has anyone actually seen a real example of that? I have not. This
>> >> scenario
>> >> is extremely unlikely and should not define the default behavior.
>> Again,
>> >> if
>> >> a user is so advanced to come up with such a sophisticated deployment,
>> >> then
>> >> the same user should be able to set different default behaviors for
>> >> different clusters.
>> >>
>> >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Andrey Kornev
If I were the one responsible for running Ignite-based applications (be it embedded or standalone Ignite) in my company's datacenter, I'd prefer the application nodes simply make their current state readily available to external tools (via JMX, health checks, etc.) and leave the decision of when to die and when to continue to run up to me. The last thing I need in production is a too clever an application that decides to kill itself based on its local (perhaps confused) state.

Usually SRE teams build all sorts of technology-specific tools to monitor health of the applications and they like to be as much in control as possible when it comes to killing processes.

I guess what I'm saying is this: keep things simple. Do not over engineer. In real production environments the companies will most likely have this feature disabled (I know I would) and instead rely on their own tooling for handling failures.

Regards
Andrey

________________________________
From: Vladimir Ozerov <[hidden email]>
Sent: Tuesday, March 13, 2018 10:43 PM
To: [hidden email]
Subject: Re: IEP-14: Ignite failures handling (Discussion)

As far as shutdown, what we need to implement is “hard shutdown” mode. This
is when we first close all network sockets, then cancel all registered
futures. This would enough to unblock the cluster and local user threads.

ср, 14 марта 2018 г. в 8:40, Vladimir Ozerov <[hidden email]>:

> Valya,
>
> This is very easy to answer - if CommandLineStartup is used, then it is
> standalone node. In all other cases it is embedded.
>
> If node shutdown hangs - just let it continue hanging, so that application
> admins are able to decide on their own what to do next. Someone would want
> to get the stack trace, others would decide to restart outside of business
> hours (e.g. because Ignite is used only in part of their application),
> someone else would try to shutdown gracefully other components before
> stopping the process to minimize negative impact, etc.
>
> I am quite understand why are we guessing here how embedded Ignite is
> used. It could be used in any way and in any combination with other
> frameworks. Process stop by default is simply not an option.
>
> ср, 14 марта 2018 г. в 3:12, Valentin Kulichenko <
> [hidden email]>:
>
>> Ivan,
>>
>> If grid hangs, graceful shutdown would most likely hang as well. Almost
>> never you can recover from a bad state using graceful procedures.
>>
>> I agree that we should not create two defaults, especially in this case.
>> It's not even strictly defined what is embedded node in Ignite. For
>> example, if I start it using a custom main class and/or custom script
>> instead of ignite.sh, would it be embedded or standalone node?
>>
>> -Val
>>
>> On Tue, Mar 13, 2018 at 4:58 PM, Ivan Rakov <[hidden email]>
>> wrote:
>>
>> > One more note: "kill if standalone, stop if embedded" differs from what
>> > you are suggesting "try graceful, then kill process regardless" only in
>> > case when graceful shutdown hangs.
>> > Do we have understanding, how often does graceful shutdown hang?
>> > Obviously, *grid hang* is often case, but it shouldn't be messed with
>> > *graceful shutdown hang*. From my experience, if something went wrong,
>> > users just prefer to do kill -9  because it's much more reliable and
>> easy.
>> > Probably, in most of cases when kill -9 worked, graceful stop would have
>> > worked as well - we just don't have such statistics.
>> > It may be bad example, but: in our CI tests we intentionally break grid
>> in
>> > many harsh ways and perform a graceful stop after the test execution,
>> and
>> > it doesn't hang - otherwise we'd see many "Execution timeout" test suite
>> > hangs.
>> >
>> > Best Regards,
>> > Ivan Rakov
>> >
>> >
>> > On 14.03.2018 2:24, Dmitriy Setrakyan wrote:
>> >
>> >> On Tue, Mar 13, 2018 at 7:13 PM, Ivan Rakov <[hidden email]>
>> >> wrote:
>> >>
>> >> I just would like to add my +1 for "kill if standalone, stop if
>> embedded"
>> >>> default option. My arguments:
>> >>>
>> >>> 1) Regarding "If Ignite hangs - it will likely be impossible to stop":
>> >>> Unfortunately, it's true that Ignite can hang during stop procedure.
>> >>> However, most of failures described under IEP-14 (storage IO
>> exceptions,
>> >>> death of critical system worker thread, etc) normally shouldn't turn
>> node
>> >>> into "impossible to stop" state. Turning into that state is a bug
>> >>> itself. I
>> >>> guess that we shouldn't choose system behavior on the basis of known
>> >>> bugs.
>> >>>
>> >>
>> >> The whole discussion is about protecting against force-major issues,
>> >> including Ignite bugs. You are assuming that a user application will
>> >> somehow continue to function if an Ignite node is stopped. In most
>> cases
>> >> it
>> >> will just freeze itself and cause the rest of the application to hang.
>> >>
>> >> Again, "kill+stop" is the most deterministic and the safest default
>> >> behavior. Try a graceful shutdown (which will make restart easier), and
>> >> then kill the process regardless.
>> >>
>> >> Note that we are arguing about the default behavior. If a user does not
>> >> like this default, then this user can change it to another behavior.
>> >>
>> >>
>> >> 2) User might want to handle Ignite node crash before shutting down the
>> >>> whole JVM - raise alert, close external resources, etc
>> >>>
>> >>> Very unlikely, but if a user is this advanced, then this user can
>> change
>> >> the default behavior. Most users will not even know how to configure
>> such
>> >> custom shutdown behavior and would prefer an automatic kill.
>> >>
>> >> 3) IEP-14 document has important notes: "More than one Ignite node
>> could
>> >> be
>> >>
>> >>> started in one JVM process" and "Different nodes in one JVM process
>> could
>> >>> belong to different clusters". This is possible only in embedded
>> mode. I
>> >>> think, we shouldn't shock user by sudden JVM halt (possibly, along
>> with
>> >>> another healthy nodes) if there's a chance of successful node stop.
>> >>>
>> >>> Has anyone actually seen a real example of that? I have not. This
>> >> scenario
>> >> is extremely unlikely and should not define the default behavior.
>> Again,
>> >> if
>> >> a user is so advanced to come up with such a sophisticated deployment,
>> >> then
>> >> the same user should be able to set different default behaviors for
>> >> different clusters.
>> >>
>> >>
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
In reply to this post by npordash
On Tue, Mar 13, 2018 at 11:17 PM, Nick Pordash <[hidden email]>
wrote:

> I can tell you as a user that if any library I was using in my application
> called System.exit without my consent would result in a lot of frustration.
>
> If ignite enters an unrecoverable state then I think that is something that
> should be observable locally, similar to node segmentation and then the
> application can decide the best course of action.
>

Nick, you would be a lot more frustrated if Ignite was frozen and every
call to Ignite would freeze the application threads as well. Again, if you
prefer to keep the process around, even if Ignite freezes, then you can
always configure this behavior, but I still believe that the default should
be to kill the process.

Ignite is a horizontally scalable system, so killing of one node should not
be a significant event and should not disrupt the cluster. However, a
freeze of one node is a significant event and can bring the whole cluster
to a halt.

D.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
In reply to this post by Andrey Kornev
On Wed, Mar 14, 2018 at 3:36 PM, Andrey Kornev <[hidden email]>
wrote:

> If I were the one responsible for running Ignite-based applications (be it
> embedded or standalone Ignite) in my company's datacenter, I'd prefer the
> application nodes simply make their current state readily available to
> external tools (via JMX, health checks, etc.) and leave the decision of
> when to die and when to continue to run up to me. The last thing I need in
> production is a too clever an application that decides to kill itself based
> on its local (perhaps confused) state.
>
> Usually SRE teams build all sorts of technology-specific tools to monitor
> health of the applications and they like to be as much in control as
> possible when it comes to killing processes.
>
> I guess what I'm saying is this: keep things simple. Do not over engineer.
> In real production environments the companies will most likely have this
> feature disabled (I know I would) and instead rely on their own tooling for
> handling failures.
>
>
Andrey, our priority should be to keep the cluster operational. If a frozen
Ignite node is kept around, the whole cluster becomes un-operational. I bet
this is not what you would prefer in production either. However, if we kill
the process, then the cluster should continue to operate.

We are talking about a distributed system in which a failure of one node
should not matter. If we want to keep this promise to the users, then we
must kill the process if Ignite node freezes.

Also, keep in mind that we are talking about the "default" behavior. If you
are not happy with the "default" mode, then you will be able to configure
other behaviors, like keeping the frozen Ignite node around, if you like.

D.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Andrey Kornev
I'm not disagreeing with you, Dmitriy.

What I'm trying to say is that if we assume that a serious enough bug or some environmental issue prevents Ignite node from functioning correctly, then it's only logical to assume that Ignite process is completely hosed (for example, due to a very very long STW pause) and can't make any progress at all. In a situation like this the application can't reason about the process state, and the process itself may not be able to even kill itself. The only reliable way to handle cases like that is to have an external observer (a health monitoring tool) that is not itself affected by the bug or the env issue and can either make a decision by itself or send a notification to the SRE team.

In my previous post I only suggest to go easy on the "cleverness" of the self-monitoring implementation as IMHO it won't be used much in production environment. I think Ignite as it is already provides sufficient means of monitoring its health (they may or may not be robust enough, which is a different issue).

Regards
Andrey

________________________________
From: Dmitriy Setrakyan <[hidden email]>
Sent: Wednesday, March 14, 2018 6:22 PM
To: [hidden email]
Subject: Re: IEP-14: Ignite failures handling (Discussion)

On Wed, Mar 14, 2018 at 3:36 PM, Andrey Kornev <[hidden email]>
wrote:

> If I were the one responsible for running Ignite-based applications (be it
> embedded or standalone Ignite) in my company's datacenter, I'd prefer the
> application nodes simply make their current state readily available to
> external tools (via JMX, health checks, etc.) and leave the decision of
> when to die and when to continue to run up to me. The last thing I need in
> production is a too clever an application that decides to kill itself based
> on its local (perhaps confused) state.
>
> Usually SRE teams build all sorts of technology-specific tools to monitor
> health of the applications and they like to be as much in control as
> possible when it comes to killing processes.
>
> I guess what I'm saying is this: keep things simple. Do not over engineer.
> In real production environments the companies will most likely have this
> feature disabled (I know I would) and instead rely on their own tooling for
> handling failures.
>
>
Andrey, our priority should be to keep the cluster operational. If a frozen
Ignite node is kept around, the whole cluster becomes un-operational. I bet
this is not what you would prefer in production either. However, if we kill
the process, then the cluster should continue to operate.

We are talking about a distributed system in which a failure of one node
should not matter. If we want to keep this promise to the users, then we
must kill the process if Ignite node freezes.

Also, keep in mind that we are talking about the "default" behavior. If you
are not happy with the "default" mode, then you will be able to configure
other behaviors, like keeping the frozen Ignite node around, if you like.

D.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
On Wed, Mar 14, 2018 at 7:12 PM, Andrey Kornev <[hidden email]>
wrote:

> I'm not disagreeing with you, Dmitriy.
>
> What I'm trying to say is that if we assume that a serious enough bug or
> some environmental issue prevents Ignite node from functioning correctly,
> then it's only logical to assume that Ignite process is completely hosed
> (for example, due to a very very long STW pause) and can't make any
> progress at all. In a situation like this the application can't reason
> about the process state, and the process itself may not be able to even
> kill itself. The only reliable way to handle cases like that is to have an
> external observer (a health monitoring tool) that is not itself affected by
> the bug or the env issue and can either make a decision by itself or send a
> notification to the SRE team.
>

Agree about the external observers, but that is something a user should do
outside of Ignite.


> In my previous post I only suggest to go easy on the "cleverness" of the
> self-monitoring implementation as IMHO it won't be used much in production
> environment. I think Ignite as it is already provides sufficient means
> of monitoring its health (they may or may not be robust enough, which is a
> different issue).
>

The approach I am suggesting is pretty simple - "kill" the process in case
of a critical error. The only intelligence I would like to add is to
attempt shutting down the Ignite node gracefully before the "kill" is
executed. If a node is shutdown gracefully, then the restart procedure will
be faster, so it is worthwhile to try.

Some of the critical errors include running out of disk, memory, loosing
Ignite system threads, etc... These errors are truly unrecoverable from the
application stand point and should mostly be handled with a process restart
anyway.

D.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

Dmitriy Pavlov
Hi Dmitriy,

It seems, here everyone agrees that killing the process will give a more
guaranteed result. The question is that the majority in the community does
not consider this to be acceptable in case Ignite as started as embedded
lib (e.g. from Java, using Ignition.start())

What can help to accept the community's opinion? Let's remember Apache
principle: "community first".

If release 2.5 will show us it was inpractical, we will change default to
kill even for library. What do you think?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 5:48, Dmitriy Setrakyan <[hidden email]>:

> On Wed, Mar 14, 2018 at 7:12 PM, Andrey Kornev <[hidden email]>
> wrote:
>
> > I'm not disagreeing with you, Dmitriy.
> >
> > What I'm trying to say is that if we assume that a serious enough bug or
> > some environmental issue prevents Ignite node from functioning correctly,
> > then it's only logical to assume that Ignite process is completely hosed
> > (for example, due to a very very long STW pause) and can't make any
> > progress at all. In a situation like this the application can't reason
> > about the process state, and the process itself may not be able to even
> > kill itself. The only reliable way to handle cases like that is to have
> an
> > external observer (a health monitoring tool) that is not itself affected
> by
> > the bug or the env issue and can either make a decision by itself or
> send a
> > notification to the SRE team.
> >
>
> Agree about the external observers, but that is something a user should do
> outside of Ignite.
>
>
> > In my previous post I only suggest to go easy on the "cleverness" of the
> > self-monitoring implementation as IMHO it won't be used much in
> production
> > environment. I think Ignite as it is already provides sufficient means
> > of monitoring its health (they may or may not be robust enough, which is
> a
> > different issue).
> >
>
> The approach I am suggesting is pretty simple - "kill" the process in case
> of a critical error. The only intelligence I would like to add is to
> attempt shutting down the Ignite node gracefully before the "kill" is
> executed. If a node is shutdown gracefully, then the restart procedure will
> be faster, so it is worthwhile to try.
>
> Some of the critical errors include running out of disk, memory, loosing
> Ignite system threads, etc... These errors are truly unrecoverable from the
> application stand point and should mostly be handled with a process restart
> anyway.
>
> D.
>
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
On Thu, Mar 15, 2018 at 5:21 AM, Dmitry Pavlov <[hidden email]>
wrote:

> Hi Dmitriy,
>
> It seems, here everyone agrees that killing the process will give a more
> guaranteed result. The question is that the majority in the community does
> not consider this to be acceptable in case Ignite as started as embedded
> lib (e.g. from Java, using Ignition.start())
>
> What can help to accept the community's opinion? Let's remember Apache
> principle: "community first".
>

I am still confused about the problem the majority of the community is
trying to solve. If our priority is to keep the cluster in frozen state,
then what is the reason for this task altogether?

The priority should be to keep the cluster operational, not frozen. The
only solution here is "kill" or "stop+kill". If the community does not
accept this option as a default, then I propose to drop this task
altogether, because we do not have to do anything to keep the cluster
frozen.


> If release 2.5 will show us it was inpractical, we will change default to
> kill even for library. What do you think?
>

See above. I do not see a reason to continue with this task if the end
result is identical to what we have today.

I want to give the community another chance to speak up and voice their
opinions again, having fully understood the context and the problem being
solved here.

D.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

agura
Hi!

Thank you all for your opinions and ideas!

While reading the thread I made two important conclusions:

1. Proposed API should be changed because possible actions enumeration
is bad idea. More clean and simple design should allow user provide
failure handler implementation with custom logic of failure handling
if needed.

2. Several failure handler implementations should be provided out-of
box in order to provide simple way of changing default behaviour
through configuration. The following implementations should be
provided:

     - NoOpFailureHandler - It's useful for tests and debugging.
     - RestartProcessFailureHandler - Specific implementation that
could be used only with ignite.(sh|bat).
     - StopNodeFailureHandler - This implementation will stop Ignite
node in case of critical error.
     - StopNodeOrHaltFailureHandler(boolean tryStop, long timeout) -
Default failure handler will try stop node if tryStop value is true.
If node can't be stopped or tryStop value is false then JVM process
will be terminated forcibly (Runtime.halt()). Default value for
tryStop parameter is false. Of course we should limit time of node
shutdown in order to prevent hangs.

As for the default behavior, I agree with those who believe that most
suitable default option is process termination (although I had a
different opinion before) and most strong argument for this choice is
impossibility of reasoning about system state in case of critical
error.
Also I believe that we can't choose solution that will be suitable for
any community member and the best that we can do is provide simple way
of changing this behavior.

So, I think, default behavior discussion should be finished. I'll
update IEP-14 [1] accordingly to my conclusions above. If you have any
ideas or thoughts about this conclusions, please feel free to share.

Thanks!

[1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling

On Fri, Mar 16, 2018 at 1:07 AM, Dmitriy Setrakyan
<[hidden email]> wrote:

> On Thu, Mar 15, 2018 at 5:21 AM, Dmitry Pavlov <[hidden email]>
> wrote:
>
>> Hi Dmitriy,
>>
>> It seems, here everyone agrees that killing the process will give a more
>> guaranteed result. The question is that the majority in the community does
>> not consider this to be acceptable in case Ignite as started as embedded
>> lib (e.g. from Java, using Ignition.start())
>>
>> What can help to accept the community's opinion? Let's remember Apache
>> principle: "community first".
>>
>
> I am still confused about the problem the majority of the community is
> trying to solve. If our priority is to keep the cluster in frozen state,
> then what is the reason for this task altogether?
>
> The priority should be to keep the cluster operational, not frozen. The
> only solution here is "kill" or "stop+kill". If the community does not
> accept this option as a default, then I propose to drop this task
> altogether, because we do not have to do anything to keep the cluster
> frozen.
>
>
>> If release 2.5 will show us it was inpractical, we will change default to
>> kill even for library. What do you think?
>>
>
> See above. I do not see a reason to continue with this task if the end
> result is identical to what we have today.
>
> I want to give the community another chance to speak up and voice their
> opinions again, having fully understood the context and the problem being
> solved here.
>
> D.
Reply | Threaded
Open this post in threaded view
|

Re: IEP-14: Ignite failures handling (Discussion)

dsetrakyan
Thanks Andrey! I have added a few comments to the IEP-14 page.

D.

On Fri, Mar 16, 2018 at 6:44 AM, Andrey Gura <[hidden email]> wrote:

> Hi!
>
> Thank you all for your opinions and ideas!
>
> While reading the thread I made two important conclusions:
>
> 1. Proposed API should be changed because possible actions enumeration
> is bad idea. More clean and simple design should allow user provide
> failure handler implementation with custom logic of failure handling
> if needed.
>
> 2. Several failure handler implementations should be provided out-of
> box in order to provide simple way of changing default behaviour
> through configuration. The following implementations should be
> provided:
>
>      - NoOpFailureHandler - It's useful for tests and debugging.
>      - RestartProcessFailureHandler - Specific implementation that
> could be used only with ignite.(sh|bat).
>      - StopNodeFailureHandler - This implementation will stop Ignite
> node in case of critical error.
>      - StopNodeOrHaltFailureHandler(boolean tryStop, long timeout) -
> Default failure handler will try stop node if tryStop value is true.
> If node can't be stopped or tryStop value is false then JVM process
> will be terminated forcibly (Runtime.halt()). Default value for
> tryStop parameter is false. Of course we should limit time of node
> shutdown in order to prevent hangs.
>
> As for the default behavior, I agree with those who believe that most
> suitable default option is process termination (although I had a
> different opinion before) and most strong argument for this choice is
> impossibility of reasoning about system state in case of critical
> error.
> Also I believe that we can't choose solution that will be suitable for
> any community member and the best that we can do is provide simple way
> of changing this behavior.
>
> So, I think, default behavior discussion should be finished. I'll
> update IEP-14 [1] accordingly to my conclusions above. If you have any
> ideas or thoughts about this conclusions, please feel free to share.
>
> Thanks!
>
> [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 14+Ignite+failures+handling
>
> On Fri, Mar 16, 2018 at 1:07 AM, Dmitriy Setrakyan
> <[hidden email]> wrote:
> > On Thu, Mar 15, 2018 at 5:21 AM, Dmitry Pavlov <[hidden email]>
> > wrote:
> >
> >> Hi Dmitriy,
> >>
> >> It seems, here everyone agrees that killing the process will give a more
> >> guaranteed result. The question is that the majority in the community
> does
> >> not consider this to be acceptable in case Ignite as started as embedded
> >> lib (e.g. from Java, using Ignition.start())
> >>
> >> What can help to accept the community's opinion? Let's remember Apache
> >> principle: "community first".
> >>
> >
> > I am still confused about the problem the majority of the community is
> > trying to solve. If our priority is to keep the cluster in frozen state,
> > then what is the reason for this task altogether?
> >
> > The priority should be to keep the cluster operational, not frozen. The
> > only solution here is "kill" or "stop+kill". If the community does not
> > accept this option as a default, then I propose to drop this task
> > altogether, because we do not have to do anything to keep the cluster
> > frozen.
> >
> >
> >> If release 2.5 will show us it was inpractical, we will change default
> to
> >> kill even for library. What do you think?
> >>
> >
> > See above. I do not see a reason to continue with this task if the end
> > result is identical to what we have today.
> >
> > I want to give the community another chance to speak up and voice their
> > opinions again, having fully understood the context and the problem being
> > solved here.
> >
> > D.
>
123