GridDhtInvalidPartitionException takes the cluster down

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

GridDhtInvalidPartitionException takes the cluster down

Roman Shtykh
Igniters,

Restarting a node when injecting data and having it expired, results at
GridDhtInvalidPartitionException which terminates nodes with SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. This is really bad and I didn't find the way to save the cluster from disappearing.
I created a JIRA issue https://issues.apache.org/jira/browse/IGNITE-11620 with a test case. Any clues how to fix this inconsistency when rebalancing?

-- Roman
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Pavel Kovalenko
Hi Roman,

I think this InvalidPartition case can be simply handled
in GridCacheTtlManager.expire method.
For workaround a custom FailureHandler can be configured that will not stop
a node in case of such exception is thrown.

пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <[hidden email]>:

> Igniters,
>
> Restarting a node when injecting data and having it expired, results at
> GridDhtInvalidPartitionException which terminates nodes with
> SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. This is
> really bad and I didn't find the way to save the cluster from disappearing.
> I created a JIRA issue https://issues.apache.org/jira/browse/IGNITE-11620
> with a test case. Any clues how to fix this inconsistency when rebalancing?
>
> -- Roman
>
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Nikolay Izhikov-2
Guys.

We should fix the SYSTEM_WORKER_TERMINATION once and for all.
Seems, we have ten or more "cluster shutdown" bugs with this subsystem
since it was introduced.

Should we disable it by default in 2.7.5?


пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]>:

> Hi Roman,
>
> I think this InvalidPartition case can be simply handled
> in GridCacheTtlManager.expire method.
> For workaround a custom FailureHandler can be configured that will not stop
> a node in case of such exception is thrown.
>
> пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <[hidden email]>:
>
> > Igniters,
> >
> > Restarting a node when injecting data and having it expired, results at
> > GridDhtInvalidPartitionException which terminates nodes with
> > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. This
> is
> > really bad and I didn't find the way to save the cluster from
> disappearing.
> > I created a JIRA issue
> https://issues.apache.org/jira/browse/IGNITE-11620
> > with a test case. Any clues how to fix this inconsistency when
> rebalancing?
> >
> > -- Roman
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Roman Shtykh
If it sticks to the behavior we had before introducing failure handler, I think it's better to have disabled instead of killing the whole cluster, as in my case, and create a parent issue for those ten bugs.Pavel, thanks for the suggestion!

 

    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <[hidden email]> wrote:  
 
 Guys.

We should fix the SYSTEM_WORKER_TERMINATION once and for all.
Seems, we have ten or more "cluster shutdown" bugs with this subsystem
since it was introduced.

Should we disable it by default in 2.7.5?


пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]>:

> Hi Roman,
>
> I think this InvalidPartition case can be simply handled
> in GridCacheTtlManager.expire method.
> For workaround a custom FailureHandler can be configured that will not stop
> a node in case of such exception is thrown.
>
> пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <[hidden email]>:
>
> > Igniters,
> >
> > Restarting a node when injecting data and having it expired, results at
> > GridDhtInvalidPartitionException which terminates nodes with
> > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. This
> is
> > really bad and I didn't find the way to save the cluster from
> disappearing.
> > I created a JIRA issue
> https://issues.apache.org/jira/browse/IGNITE-11620
> > with a test case. Any clues how to fix this inconsistency when
> rebalancing?
> >
> > -- Roman
> >
>  
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

agura
Failure handlers were introduced in order to avoid cluster hanging and
they kill nodes instead.

If critical worker was terminated by GridDhtInvalidPartitionException
then your node is unable to work anymore.

Unexpected cluster shutdown with reasons in logs that failure handlers
provide is better than hanging. So answer is NO. We mustn't disable
failure handlers.

On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh <[hidden email]> wrote:

>
> If it sticks to the behavior we had before introducing failure handler, I think it's better to have disabled instead of killing the whole cluster, as in my case, and create a parent issue for those ten bugs.Pavel, thanks for the suggestion!
>
>
>
>     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <[hidden email]> wrote:
>
>  Guys.
>
> We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> Seems, we have ten or more "cluster shutdown" bugs with this subsystem
> since it was introduced.
>
> Should we disable it by default in 2.7.5?
>
>
> пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]>:
>
> > Hi Roman,
> >
> > I think this InvalidPartition case can be simply handled
> > in GridCacheTtlManager.expire method.
> > For workaround a custom FailureHandler can be configured that will not stop
> > a node in case of such exception is thrown.
> >
> > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <[hidden email]>:
> >
> > > Igniters,
> > >
> > > Restarting a node when injecting data and having it expired, results at
> > > GridDhtInvalidPartitionException which terminates nodes with
> > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. This
> > is
> > > really bad and I didn't find the way to save the cluster from
> > disappearing.
> > > I created a JIRA issue
> > https://issues.apache.org/jira/browse/IGNITE-11620
> > > with a test case. Any clues how to fix this inconsistency when
> > rebalancing?
> > >
> > > -- Roman
> > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

dmagda
Nikolay,

Thanks for kicking off this discussion. Surprisingly, planned to start a
similar one today and incidentally came across this thread.

Agree that the failure handler should be off by default or the default
settings have to be revisited. That's true that people are complaining of
nodes shutdowns even on moderate workloads. For instance, that's the most
recent feedback related to slow checkpointing:
https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure

At a minimum, let's consider the following:

   - A failure handler needs to provide hints on how to come around the
   shutdown in the future. Take the checkpointing SO thread above. It's
   unclear from the logs how to prevent the same situation next time (suggest
   parameters for tuning, flash drives, etc).
   - Is there any protection for a full cluster restart? We need to
   distinguish a slow cluster from the stuck one. A node removal should not
   lead to a meltdown of the whole storage.
   - Should we enable the failure handler for things like transactions or
   PME and have it off for checkpointing and something else? Let's have it
   enabled for cases when we are 100% certain that a node shutdown is the
   right thing and print out warnings with suggestions whenever we're not
   confident that the removal is appropriate.

--
Denis


On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]> wrote:

> Failure handlers were introduced in order to avoid cluster hanging and
> they kill nodes instead.
>
> If critical worker was terminated by GridDhtInvalidPartitionException
> then your node is unable to work anymore.
>
> Unexpected cluster shutdown with reasons in logs that failure handlers
> provide is better than hanging. So answer is NO. We mustn't disable
> failure handlers.
>
> On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh <[hidden email]>
> wrote:
> >
> > If it sticks to the behavior we had before introducing failure handler,
> I think it's better to have disabled instead of killing the whole cluster,
> as in my case, and create a parent issue for those ten bugs.Pavel, thanks
> for the suggestion!
> >
> >
> >
> >     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <
> [hidden email]> wrote:
> >
> >  Guys.
> >
> > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > Seems, we have ten or more "cluster shutdown" bugs with this subsystem
> > since it was introduced.
> >
> > Should we disable it by default in 2.7.5?
> >
> >
> > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]>:
> >
> > > Hi Roman,
> > >
> > > I think this InvalidPartition case can be simply handled
> > > in GridCacheTtlManager.expire method.
> > > For workaround a custom FailureHandler can be configured that will not
> stop
> > > a node in case of such exception is thrown.
> > >
> > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <[hidden email]>:
> > >
> > > > Igniters,
> > > >
> > > > Restarting a node when injecting data and having it expired, results
> at
> > > > GridDhtInvalidPartitionException which terminates nodes with
> > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down.
> This
> > > is
> > > > really bad and I didn't find the way to save the cluster from
> > > disappearing.
> > > > I created a JIRA issue
> > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > with a test case. Any clues how to fix this inconsistency when
> > > rebalancing?
> > > >
> > > > -- Roman
> > > >
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Roman Shtykh
+ 1 for having the default settings revisited.
I understand Andrey's reasonings, but sometimes taking nodes down is too radical (as in my case it was GridDhtInvalidPartitionException which could be ignored for a while when rebalancing <- I might be wrong here).

-- Roman
 

    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <[hidden email]> wrote:  
 
 Nikolay,
Thanks for kicking off this discussion. Surprisingly, planned to start a similar one today and incidentally came across this thread.
Agree that the failure handler should be off by default or the default settings have to be revisited. That's true that people are complaining of nodes shutdowns even on moderate workloads. For instance, that's the most recent feedback related to slow checkpointing:https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure

At a minimum, let's consider the following:  
   - A failure handler needs to provide hints on how to come around the shutdown in the future. Take the checkpointing SO thread above. It's unclear from the logs how to prevent the same situation next time (suggest parameters for tuning, flash drives, etc).
   - Is there any protection for a full cluster restart? We need to distinguish a slow cluster from the stuck one. A node removal should not lead to a meltdown of the whole storage.
   - Should we enable the failure handler for things like transactions or PME and have it off for checkpointing and something else? Let's have it enabled for cases when we are 100% certain that a node shutdown is the right thing and print out warnings with suggestions whenever we're not confident that the removal is appropriate.
--Denis

On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]> wrote:

Failure handlers were introduced in order to avoid cluster hanging and
they kill nodes instead.

If critical worker was terminated by GridDhtInvalidPartitionException
then your node is unable to work anymore.

Unexpected cluster shutdown with reasons in logs that failure handlers
provide is better than hanging. So answer is NO. We mustn't disable
failure handlers.

On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh <[hidden email]> wrote:

>
> If it sticks to the behavior we had before introducing failure handler, I think it's better to have disabled instead of killing the whole cluster, as in my case, and create a parent issue for those ten bugs.Pavel, thanks for the suggestion!
>
>
>
>     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <[hidden email]> wrote:
>
>  Guys.
>
> We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> Seems, we have ten or more "cluster shutdown" bugs with this subsystem
> since it was introduced.
>
> Should we disable it by default in 2.7.5?
>
>
> пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]>:
>
> > Hi Roman,
> >
> > I think this InvalidPartition case can be simply handled
> > in GridCacheTtlManager.expire method.
> > For workaround a custom FailureHandler can be configured that will not stop
> > a node in case of such exception is thrown.
> >
> > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <[hidden email]>:
> >
> > > Igniters,
> > >
> > > Restarting a node when injecting data and having it expired, results at
> > > GridDhtInvalidPartitionException which terminates nodes with
> > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. This
> > is
> > > really bad and I didn't find the way to save the cluster from
> > disappearing.
> > > I created a JIRA issue
> > https://issues.apache.org/jira/browse/IGNITE-11620
> > > with a test case. Any clues how to fix this inconsistency when
> > rebalancing?
> > >
> > > -- Roman
> > >
> >

 
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Andrey Kuznetsov
By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't like
this behavior, but it may be useful sometimes: "frozen" threads have a
chance to become active again after load decreases. As for
SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait for
dead thread's magical resurrection. Then, if under some circumstances node
stop leads to cascade cluster crash, then it's a bug, and it should be
fixed. Once and for all. Instead of hiding the flaw we have in the product.

вт, 26 мар. 2019 г. в 09:17, Roman Shtykh <[hidden email]>:

> + 1 for having the default settings revisited.
> I understand Andrey's reasonings, but sometimes taking nodes down is too
> radical (as in my case it was GridDhtInvalidPartitionException which could
> be ignored for a while when rebalancing <- I might be wrong here).
>
> -- Roman
>
>
>     On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> [hidden email]> wrote:
>
>  Nikolay,
> Thanks for kicking off this discussion. Surprisingly, planned to start a
> similar one today and incidentally came across this thread.
> Agree that the failure handler should be off by default or the default
> settings have to be revisited. That's true that people are complaining of
> nodes shutdowns even on moderate workloads. For instance, that's the most
> recent feedback related to slow checkpointing:
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
>
> At a minimum, let's consider the following:
>    - A failure handler needs to provide hints on how to come around the
> shutdown in the future. Take the checkpointing SO thread above. It's
> unclear from the logs how to prevent the same situation next time (suggest
> parameters for tuning, flash drives, etc).
>    - Is there any protection for a full cluster restart? We need to
> distinguish a slow cluster from the stuck one. A node removal should not
> lead to a meltdown of the whole storage.
>    - Should we enable the failure handler for things like transactions or
> PME and have it off for checkpointing and something else? Let's have it
> enabled for cases when we are 100% certain that a node shutdown is the
> right thing and print out warnings with suggestions whenever we're not
> confident that the removal is appropriate.
> --Denis
>
> On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]> wrote:
>
> Failure handlers were introduced in order to avoid cluster hanging and
> they kill nodes instead.
>
> If critical worker was terminated by GridDhtInvalidPartitionException
> then your node is unable to work anymore.
>
> Unexpected cluster shutdown with reasons in logs that failure handlers
> provide is better than hanging. So answer is NO. We mustn't disable
> failure handlers.
>
> On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh <[hidden email]>
> wrote:
> >
> > If it sticks to the behavior we had before introducing failure handler,
> I think it's better to have disabled instead of killing the whole cluster,
> as in my case, and create a parent issue for those ten bugs.Pavel, thanks
> for the suggestion!
> >
> >
> >
> >     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <
> [hidden email]> wrote:
> >
> >  Guys.
> >
> > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > Seems, we have ten or more "cluster shutdown" bugs with this subsystem
> > since it was introduced.
> >
> > Should we disable it by default in 2.7.5?
> >
> >
> > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]>:
> >
> > > Hi Roman,
> > >
> > > I think this InvalidPartition case can be simply handled
> > > in GridCacheTtlManager.expire method.
> > > For workaround a custom FailureHandler can be configured that will not
> stop
> > > a node in case of such exception is thrown.
> > >
> > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <[hidden email]>:
> > >
> > > > Igniters,
> > > >
> > > > Restarting a node when injecting data and having it expired, results
> at
> > > > GridDhtInvalidPartitionException which terminates nodes with
> > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down.
> This
> > > is
> > > > really bad and I didn't find the way to save the cluster from
> > > disappearing.
> > > > I created a JIRA issue
> > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > with a test case. Any clues how to fix this inconsistency when
> > > rebalancing?
> > > >
> > > > -- Roman
> > > >
> > >
>
>



--
Best regards,
  Andrey Kuznetsov.
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Nikolay Izhikov-2
Andrey.

> As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait for dead thread's magical resurrection.

Why is it unrecoverable?
Why we can't restart some thread?
Is there some kind of nature limitation to not restart system thread?

Actually, distributed systems are designed to overcome some bugs, thread failure, node failure, for example, isn't it?
> if under some circumstances node> stop leads to cascade cluster crash, then it's a bug

How user can know it's a bug? Where this bug should be reported?
Do we log it somewhere?
Do we warn user before shutdown one or several times?

This feature kills user experience literally now.

If I would be a user of the product that just shutdown with poor log I would throw this product away.
Do we want it for Ignite?

From SO discussion I see following error message: ": >>> Possible starvation in striped pool."
Are you sure this message are clear for Ignite user(not Ignite hacker)?
What user should do to prevent this error in future?

В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:

> By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't like
> this behavior, but it may be useful sometimes: "frozen" threads have a
> chance to become active again after load decreases. As for
> SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait for
> dead thread's magical resurrection. Then, if under some circumstances node
> stop leads to cascade cluster crash, then it's a bug, and it should be
> fixed. Once and for all. Instead of hiding the flaw we have in the product.
>
> вт, 26 мар. 2019 г. в 09:17, Roman Shtykh <[hidden email]>:
>
> > + 1 for having the default settings revisited.
> > I understand Andrey's reasonings, but sometimes taking nodes down is too
> > radical (as in my case it was GridDhtInvalidPartitionException which could
> > be ignored for a while when rebalancing <- I might be wrong here).
> >
> > -- Roman
> >
> >
> >     On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > [hidden email]> wrote:
> >
> > p Nikolay,
> > Thanks for kicking off this discussion. Surprisingly, planned to start a
> > similar one today and incidentally came across this thread.
> > Agree that the failure handler should be off by default or the default
> > settings have to be revisited. That's true that people are complaining of
> > nodes shutdowns even on moderate workloads. For instance, that's the most
> > recent feedback related to slow checkpointing:
> > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> >
> > At a minimum, let's consider the following:
> >    - A failure handler needs to provide hints on how to come around the
> > shutdown in the future. Take the checkpointing SO thread above. It's
> > unclear from the logs how to prevent the same situation next time (suggest
> > parameters for tuning, flash drives, etc).
> >    - Is there any protection for a full cluster restart? We need to
> > distinguish a slow cluster from the stuck one. A node removal should not
> > lead to a meltdown of the whole storage.
> >    - Should we enable the failure handler for things like transactions or
> > PME and have it off for checkpointing and something else? Let's have it
> > enabled for cases when we are 100% certain that a node shutdown is the
> > right thing and print out warnings with suggestions whenever we're not
> > confident that the removal is appropriate.
> > --Denis
> >
> > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]> wrote:
> >
> > Failure handlers were introduced in order to avoid cluster hanging and
> > they kill nodes instead.
> >
> > If critical worker was terminated by GridDhtInvalidPartitionException
> > then your node is unable to work anymore.
> >
> > Unexpected cluster shutdown with reasons in logs that failure handlers
> > provide is better than hanging. So answer is NO. We mustn't disable
> > failure handlers.
> >
> > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh <[hidden email]>
> > wrote:
> > >
> > > If it sticks to the behavior we had before introducing failure handler,
> >
> > I think it's better to have disabled instead of killing the whole cluster,
> > as in my case, and create a parent issue for those ten bugs.Pavel, thanks
> > for the suggestion!
> > >
> > >
> > >
> > >     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <
> >
> > [hidden email]> wrote:
> > >
> > >  Guys.
> > >
> > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > Seems, we have ten or more "cluster shutdown" bugs with this subsystem
> > > since it was introduced.
> > >
> > > Should we disable it by default in 2.7.5?
> > >
> > >
> > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]>:
> > >
> > > > Hi Roman,
> > > >
> > > > I think this InvalidPartition case can be simply handled
> > > > in GridCacheTtlManager.expire method.
> > > > For workaround a custom FailureHandler can be configured that will not
> >
> > stop
> > > > a node in case of such exception is thrown.
> > > >
> > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh <[hidden email]>:
> > > >
> > > > > Igniters,
> > > > >
> > > > > Restarting a node when injecting data and having it expired, results
> >
> > at
> > > > > GridDhtInvalidPartitionException which terminates nodes with
> > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down.
> >
> > This
> > > > is
> > > > > really bad and I didn't find the way to save the cluster from
> > > >
> > > > disappearing.
> > > > > I created a JIRA issue
> > > >
> > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > with a test case. Any clues how to fix this inconsistency when
> > > >
> > > > rebalancing?
> > > > >
> > > > > -- Roman
> > > > >
> >
> >
>
>
>

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Andrey Kuznetsov
Nikolay,

>  Why we can't restart some thread?
Technically, we can. It's just matter of design: the thread can be made
non-critical, and we can restart it every time it dies. But such design
looks poor to me. It's much simpler to catch and handle all exceptions in
critical threads. Failure handling is a last-chance tool that reveals
internal Ignite errors. It's not pleasant for us when users see these
errors, but it's better than hiding.

>  Actually, distributed systems are designed to overcome some bugs, thread
failure, node failure, for example, isn't it?
100% agree with you: overcome, but not hide.

>  How user can know it's a bug? Where this bug should be reported?
As far as I see from user-list messages, our users are qualified enough to
provide necessary information from their cluster-wide logs.


вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[hidden email]>:

> Andrey.
>
> > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> wait for dead thread's magical resurrection.
>
> Why is it unrecoverable?
> Why we can't restart some thread?
> Is there some kind of nature limitation to not restart system thread?
>
> Actually, distributed systems are designed to overcome some bugs, thread
> failure, node failure, for example, isn't it?
> > if under some circumstances node> stop leads to cascade cluster crash,
> then it's a bug
>
> How user can know it's a bug? Where this bug should be reported?
> Do we log it somewhere?
> Do we warn user before shutdown one or several times?
>
> This feature kills user experience literally now.
>
> If I would be a user of the product that just shutdown with poor log I
> would throw this product away.
> Do we want it for Ignite?
>
> From SO discussion I see following error message: ": >>> Possible
> starvation in striped pool."
> Are you sure this message are clear for Ignite user(not Ignite hacker)?
> What user should do to prevent this error in future?
>
> В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't
> like
> > this behavior, but it may be useful sometimes: "frozen" threads have a
> > chance to become active again after load decreases. As for
> > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait
> for
> > dead thread's magical resurrection. Then, if under some circumstances
> node
> > stop leads to cascade cluster crash, then it's a bug, and it should be
> > fixed. Once and for all. Instead of hiding the flaw we have in the
> product.
> >
> > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh <[hidden email]>:
> >
> > > + 1 for having the default settings revisited.
> > > I understand Andrey's reasonings, but sometimes taking nodes down is
> too
> > > radical (as in my case it was GridDhtInvalidPartitionException which
> could
> > > be ignored for a while when rebalancing <- I might be wrong here).
> > >
> > > -- Roman
> > >
> > >
> > >     On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > [hidden email]> wrote:
> > >
> > > p    Nikolay,
> > > Thanks for kicking off this discussion. Surprisingly, planned to start
> a
> > > similar one today and incidentally came across this thread.
> > > Agree that the failure handler should be off by default or the default
> > > settings have to be revisited. That's true that people are complaining
> of
> > > nodes shutdowns even on moderate workloads. For instance, that's the
> most
> > > recent feedback related to slow checkpointing:
> > >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > >
> > > At a minimum, let's consider the following:
> > >    - A failure handler needs to provide hints on how to come around the
> > > shutdown in the future. Take the checkpointing SO thread above. It's
> > > unclear from the logs how to prevent the same situation next time
> (suggest
> > > parameters for tuning, flash drives, etc).
> > >    - Is there any protection for a full cluster restart? We need to
> > > distinguish a slow cluster from the stuck one. A node removal should
> not
> > > lead to a meltdown of the whole storage.
> > >    - Should we enable the failure handler for things like transactions
> or
> > > PME and have it off for checkpointing and something else? Let's have it
> > > enabled for cases when we are 100% certain that a node shutdown is the
> > > right thing and print out warnings with suggestions whenever we're not
> > > confident that the removal is appropriate.
> > > --Denis
> > >
> > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]> wrote:
> > >
> > > Failure handlers were introduced in order to avoid cluster hanging and
> > > they kill nodes instead.
> > >
> > > If critical worker was terminated by GridDhtInvalidPartitionException
> > > then your node is unable to work anymore.
> > >
> > > Unexpected cluster shutdown with reasons in logs that failure handlers
> > > provide is better than hanging. So answer is NO. We mustn't disable
> > > failure handlers.
> > >
> > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh <[hidden email]
> >
> > > wrote:
> > > >
> > > > If it sticks to the behavior we had before introducing failure
> handler,
> > >
> > > I think it's better to have disabled instead of killing the whole
> cluster,
> > > as in my case, and create a parent issue for those ten bugs.Pavel,
> thanks
> > > for the suggestion!
> > > >
> > > >
> > > >
> > > >     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <
> > >
> > > [hidden email]> wrote:
> > > >
> > > >  Guys.
> > > >
> > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > > Seems, we have ten or more "cluster shutdown" bugs with this
> subsystem
> > > > since it was introduced.
> > > >
> > > > Should we disable it by default in 2.7.5?
> > > >
> > > >
> > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]>:
> > > >
> > > > > Hi Roman,
> > > > >
> > > > > I think this InvalidPartition case can be simply handled
> > > > > in GridCacheTtlManager.expire method.
> > > > > For workaround a custom FailureHandler can be configured that will
> not
> > >
> > > stop
> > > > > a node in case of such exception is thrown.
> > > > >
> > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> <[hidden email]>:
> > > > >
> > > > > > Igniters,
> > > > > >
> > > > > > Restarting a node when injecting data and having it expired,
> results
> > >
> > > at
> > > > > > GridDhtInvalidPartitionException which terminates nodes with
> > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster
> down.
> > >
> > > This
> > > > > is
> > > > > > really bad and I didn't find the way to save the cluster from
> > > > >
> > > > > disappearing.
> > > > > > I created a JIRA issue
> > > > >
> > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > with a test case. Any clues how to fix this inconsistency when
> > > > >
> > > > > rebalancing?
> > > > > >
> > > > > > -- Roman
> > > > > >
> > >
> > >
> >
> >
> >
>


--
Best regards,
  Andrey Kuznetsov.
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Nikolay Izhikov-2
Andrey.

>  the thread can be made non-critical, and we can restart it every time it
dies

Why we can't restart critical thread?
What is the root difference between critical and non critical threads?

> It's much simpler to catch and handle all exceptions in critical threads

I don't agree with you.
We develop Ignite not because it simple!
We must spend extra time to made it robust and resilient to the failures.

> Failure handling is a last-chance tool that reveals internal Ignite errors
> 100% agree with you: overcome, but not hide.

Logging stack trace with proper explanation is not hiding.
Killing nodes and whole cluster is not "handling".

> As far as I see from user-list messages, our users are qualified enough
to provide necessary information from their cluster-wide logs.

We shouldn't develop our product only for users who are able to read Ignite
sources to decrypt the fail reason behind "starvation in stripped pool"

Some of my questions remain unanswered :) :

1. How user can know it's an Ignite bug? Where this bug should be reported?
2. Do we log it somewhere?
3. Do we warn user before shutdown several times?
4. "starvation in stripped pool" I think it's not clear error message.
Let's make it more specific!
5. Let's write to the user log - what he or she should do to prevent this
error in future?


вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[hidden email]>:

> Nikolay,
>
> >  Why we can't restart some thread?
> Technically, we can. It's just matter of design: the thread can be made
> non-critical, and we can restart it every time it dies. But such design
> looks poor to me. It's much simpler to catch and handle all exceptions in
> critical threads. Failure handling is a last-chance tool that reveals
> internal Ignite errors. It's not pleasant for us when users see these
> errors, but it's better than hiding.
>
> >  Actually, distributed systems are designed to overcome some bugs, thread
> failure, node failure, for example, isn't it?
> 100% agree with you: overcome, but not hide.
>
> >  How user can know it's a bug? Where this bug should be reported?
> As far as I see from user-list messages, our users are qualified enough to
> provide necessary information from their cluster-wide logs.
>
>
> вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[hidden email]>:
>
> > Andrey.
> >
> > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use
> to
> > wait for dead thread's magical resurrection.
> >
> > Why is it unrecoverable?
> > Why we can't restart some thread?
> > Is there some kind of nature limitation to not restart system thread?
> >
> > Actually, distributed systems are designed to overcome some bugs, thread
> > failure, node failure, for example, isn't it?
> > > if under some circumstances node> stop leads to cascade cluster crash,
> > then it's a bug
> >
> > How user can know it's a bug? Where this bug should be reported?
> > Do we log it somewhere?
> > Do we warn user before shutdown one or several times?
> >
> > This feature kills user experience literally now.
> >
> > If I would be a user of the product that just shutdown with poor log I
> > would throw this product away.
> > Do we want it for Ignite?
> >
> > From SO discussion I see following error message: ": >>> Possible
> > starvation in striped pool."
> > Are you sure this message are clear for Ignite user(not Ignite hacker)?
> > What user should do to prevent this error in future?
> >
> > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't
> > like
> > > this behavior, but it may be useful sometimes: "frozen" threads have a
> > > chance to become active again after load decreases. As for
> > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait
> > for
> > > dead thread's magical resurrection. Then, if under some circumstances
> > node
> > > stop leads to cascade cluster crash, then it's a bug, and it should be
> > > fixed. Once and for all. Instead of hiding the flaw we have in the
> > product.
> > >
> > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh <[hidden email]>:
> > >
> > > > + 1 for having the default settings revisited.
> > > > I understand Andrey's reasonings, but sometimes taking nodes down is
> > too
> > > > radical (as in my case it was GridDhtInvalidPartitionException which
> > could
> > > > be ignored for a while when rebalancing <- I might be wrong here).
> > > >
> > > > -- Roman
> > > >
> > > >
> > > >     On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > > [hidden email]> wrote:
> > > >
> > > > p    Nikolay,
> > > > Thanks for kicking off this discussion. Surprisingly, planned to
> start
> > a
> > > > similar one today and incidentally came across this thread.
> > > > Agree that the failure handler should be off by default or the
> default
> > > > settings have to be revisited. That's true that people are
> complaining
> > of
> > > > nodes shutdowns even on moderate workloads. For instance, that's the
> > most
> > > > recent feedback related to slow checkpointing:
> > > >
> >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > >
> > > > At a minimum, let's consider the following:
> > > >    - A failure handler needs to provide hints on how to come around
> the
> > > > shutdown in the future. Take the checkpointing SO thread above. It's
> > > > unclear from the logs how to prevent the same situation next time
> > (suggest
> > > > parameters for tuning, flash drives, etc).
> > > >    - Is there any protection for a full cluster restart? We need to
> > > > distinguish a slow cluster from the stuck one. A node removal should
> > not
> > > > lead to a meltdown of the whole storage.
> > > >    - Should we enable the failure handler for things like
> transactions
> > or
> > > > PME and have it off for checkpointing and something else? Let's have
> it
> > > > enabled for cases when we are 100% certain that a node shutdown is
> the
> > > > right thing and print out warnings with suggestions whenever we're
> not
> > > > confident that the removal is appropriate.
> > > > --Denis
> > > >
> > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]>
> wrote:
> > > >
> > > > Failure handlers were introduced in order to avoid cluster hanging
> and
> > > > they kill nodes instead.
> > > >
> > > > If critical worker was terminated by GridDhtInvalidPartitionException
> > > > then your node is unable to work anymore.
> > > >
> > > > Unexpected cluster shutdown with reasons in logs that failure
> handlers
> > > > provide is better than hanging. So answer is NO. We mustn't disable
> > > > failure handlers.
> > > >
> > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> <[hidden email]
> > >
> > > > wrote:
> > > > >
> > > > > If it sticks to the behavior we had before introducing failure
> > handler,
> > > >
> > > > I think it's better to have disabled instead of killing the whole
> > cluster,
> > > > as in my case, and create a parent issue for those ten bugs.Pavel,
> > thanks
> > > > for the suggestion!
> > > > >
> > > > >
> > > > >
> > > > >     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov
> <
> > > >
> > > > [hidden email]> wrote:
> > > > >
> > > > >  Guys.
> > > > >
> > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > > > Seems, we have ten or more "cluster shutdown" bugs with this
> > subsystem
> > > > > since it was introduced.
> > > > >
> > > > > Should we disable it by default in 2.7.5?
> > > > >
> > > > >
> > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]>:
> > > > >
> > > > > > Hi Roman,
> > > > > >
> > > > > > I think this InvalidPartition case can be simply handled
> > > > > > in GridCacheTtlManager.expire method.
> > > > > > For workaround a custom FailureHandler can be configured that
> will
> > not
> > > >
> > > > stop
> > > > > > a node in case of such exception is thrown.
> > > > > >
> > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > <[hidden email]>:
> > > > > >
> > > > > > > Igniters,
> > > > > > >
> > > > > > > Restarting a node when injecting data and having it expired,
> > results
> > > >
> > > > at
> > > > > > > GridDhtInvalidPartitionException which terminates nodes with
> > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster
> > down.
> > > >
> > > > This
> > > > > > is
> > > > > > > really bad and I didn't find the way to save the cluster from
> > > > > >
> > > > > > disappearing.
> > > > > > > I created a JIRA issue
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > with a test case. Any clues how to fix this inconsistency when
> > > > > >
> > > > > > rebalancing?
> > > > > > >
> > > > > > > -- Roman
> > > > > > >
> > > >
> > > >
> > >
> > >
> > >
> >
>
>
> --
> Best regards,
>   Andrey Kuznetsov.
>
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Andrey Kuznetsov
Nikolay,

Feel free to suggest better error messages to indicate internal/critical
failures. User actions in response to critical failures are rather limited:
mail to user-list or maybe file an issue. As for repetitive warnings, it
makes sense, but requires additional stuff to deliver such signals, mere
spamming to log will not have an effect.

Anyway, when experienced committers suggest to disable failure handling and
hide existing issues, I feel as if they are pulling my leg.

Best regards,
Andrey Kuznetsov.

вт, 26 марта 2019, 13:30 Nikolay Izhikov [hidden email]:

> Andrey.
>
> >  the thread can be made non-critical, and we can restart it every time it
> dies
>
> Why we can't restart critical thread?
> What is the root difference between critical and non critical threads?
>
> > It's much simpler to catch and handle all exceptions in critical threads
>
> I don't agree with you.
> We develop Ignite not because it simple!
> We must spend extra time to made it robust and resilient to the failures.
>
> > Failure handling is a last-chance tool that reveals internal Ignite
> errors
> > 100% agree with you: overcome, but not hide.
>
> Logging stack trace with proper explanation is not hiding.
> Killing nodes and whole cluster is not "handling".
>
> > As far as I see from user-list messages, our users are qualified enough
> to provide necessary information from their cluster-wide logs.
>
> We shouldn't develop our product only for users who are able to read Ignite
> sources to decrypt the fail reason behind "starvation in stripped pool"
>
> Some of my questions remain unanswered :) :
>
> 1. How user can know it's an Ignite bug? Where this bug should be reported?
> 2. Do we log it somewhere?
> 3. Do we warn user before shutdown several times?
> 4. "starvation in stripped pool" I think it's not clear error message.
> Let's make it more specific!
> 5. Let's write to the user log - what he or she should do to prevent this
> error in future?
>
>
> вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[hidden email]>:
>
> > Nikolay,
> >
> > >  Why we can't restart some thread?
> > Technically, we can. It's just matter of design: the thread can be made
> > non-critical, and we can restart it every time it dies. But such design
> > looks poor to me. It's much simpler to catch and handle all exceptions in
> > critical threads. Failure handling is a last-chance tool that reveals
> > internal Ignite errors. It's not pleasant for us when users see these
> > errors, but it's better than hiding.
> >
> > >  Actually, distributed systems are designed to overcome some bugs,
> thread
> > failure, node failure, for example, isn't it?
> > 100% agree with you: overcome, but not hide.
> >
> > >  How user can know it's a bug? Where this bug should be reported?
> > As far as I see from user-list messages, our users are qualified enough
> to
> > provide necessary information from their cluster-wide logs.
> >
> >
> > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[hidden email]>:
> >
> > > Andrey.
> > >
> > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use
> > to
> > > wait for dead thread's magical resurrection.
> > >
> > > Why is it unrecoverable?
> > > Why we can't restart some thread?
> > > Is there some kind of nature limitation to not restart system thread?
> > >
> > > Actually, distributed systems are designed to overcome some bugs,
> thread
> > > failure, node failure, for example, isn't it?
> > > > if under some circumstances node> stop leads to cascade cluster
> crash,
> > > then it's a bug
> > >
> > > How user can know it's a bug? Where this bug should be reported?
> > > Do we log it somewhere?
> > > Do we warn user before shutdown one or several times?
> > >
> > > This feature kills user experience literally now.
> > >
> > > If I would be a user of the product that just shutdown with poor log I
> > > would throw this product away.
> > > Do we want it for Ignite?
> > >
> > > From SO discussion I see following error message: ": >>> Possible
> > > starvation in striped pool."
> > > Are you sure this message are clear for Ignite user(not Ignite hacker)?
> > > What user should do to prevent this error in future?
> > >
> > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I
> don't
> > > like
> > > > this behavior, but it may be useful sometimes: "frozen" threads have
> a
> > > > chance to become active again after load decreases. As for
> > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> wait
> > > for
> > > > dead thread's magical resurrection. Then, if under some circumstances
> > > node
> > > > stop leads to cascade cluster crash, then it's a bug, and it should
> be
> > > > fixed. Once and for all. Instead of hiding the flaw we have in the
> > > product.
> > > >
> > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh <[hidden email]
> >:
> > > >
> > > > > + 1 for having the default settings revisited.
> > > > > I understand Andrey's reasonings, but sometimes taking nodes down
> is
> > > too
> > > > > radical (as in my case it was GridDhtInvalidPartitionException
> which
> > > could
> > > > > be ignored for a while when rebalancing <- I might be wrong here).
> > > > >
> > > > > -- Roman
> > > > >
> > > > >
> > > > >     On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > > > [hidden email]> wrote:
> > > > >
> > > > > p    Nikolay,
> > > > > Thanks for kicking off this discussion. Surprisingly, planned to
> > start
> > > a
> > > > > similar one today and incidentally came across this thread.
> > > > > Agree that the failure handler should be off by default or the
> > default
> > > > > settings have to be revisited. That's true that people are
> > complaining
> > > of
> > > > > nodes shutdowns even on moderate workloads. For instance, that's
> the
> > > most
> > > > > recent feedback related to slow checkpointing:
> > > > >
> > >
> >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > >
> > > > > At a minimum, let's consider the following:
> > > > >    - A failure handler needs to provide hints on how to come around
> > the
> > > > > shutdown in the future. Take the checkpointing SO thread above.
> It's
> > > > > unclear from the logs how to prevent the same situation next time
> > > (suggest
> > > > > parameters for tuning, flash drives, etc).
> > > > >    - Is there any protection for a full cluster restart? We need to
> > > > > distinguish a slow cluster from the stuck one. A node removal
> should
> > > not
> > > > > lead to a meltdown of the whole storage.
> > > > >    - Should we enable the failure handler for things like
> > transactions
> > > or
> > > > > PME and have it off for checkpointing and something else? Let's
> have
> > it
> > > > > enabled for cases when we are 100% certain that a node shutdown is
> > the
> > > > > right thing and print out warnings with suggestions whenever we're
> > not
> > > > > confident that the removal is appropriate.
> > > > > --Denis
> > > > >
> > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]>
> > wrote:
> > > > >
> > > > > Failure handlers were introduced in order to avoid cluster hanging
> > and
> > > > > they kill nodes instead.
> > > > >
> > > > > If critical worker was terminated by
> GridDhtInvalidPartitionException
> > > > > then your node is unable to work anymore.
> > > > >
> > > > > Unexpected cluster shutdown with reasons in logs that failure
> > handlers
> > > > > provide is better than hanging. So answer is NO. We mustn't disable
> > > > > failure handlers.
> > > > >
> > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > <[hidden email]
> > > >
> > > > > wrote:
> > > > > >
> > > > > > If it sticks to the behavior we had before introducing failure
> > > handler,
> > > > >
> > > > > I think it's better to have disabled instead of killing the whole
> > > cluster,
> > > > > as in my case, and create a parent issue for those ten bugs.Pavel,
> > > thanks
> > > > > for the suggestion!
> > > > > >
> > > > > >
> > > > > >
> > > > > >     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> Izhikov
> > <
> > > > >
> > > > > [hidden email]> wrote:
> > > > > >
> > > > > >  Guys.
> > > > > >
> > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > > > > Seems, we have ten or more "cluster shutdown" bugs with this
> > > subsystem
> > > > > > since it was introduced.
> > > > > >
> > > > > > Should we disable it by default in 2.7.5?
> > > > > >
> > > > > >
> > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]
> >:
> > > > > >
> > > > > > > Hi Roman,
> > > > > > >
> > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > For workaround a custom FailureHandler can be configured that
> > will
> > > not
> > > > >
> > > > > stop
> > > > > > > a node in case of such exception is thrown.
> > > > > > >
> > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > <[hidden email]>:
> > > > > > >
> > > > > > > > Igniters,
> > > > > > > >
> > > > > > > > Restarting a node when injecting data and having it expired,
> > > results
> > > > >
> > > > > at
> > > > > > > > GridDhtInvalidPartitionException which terminates nodes with
> > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster
> > > down.
> > > > >
> > > > > This
> > > > > > > is
> > > > > > > > really bad and I didn't find the way to save the cluster from
> > > > > > >
> > > > > > > disappearing.
> > > > > > > > I created a JIRA issue
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > with a test case. Any clues how to fix this inconsistency
> when
> > > > > > >
> > > > > > > rebalancing?
> > > > > > > >
> > > > > > > > -- Roman
> > > > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> > --
> > Best regards,
> >   Andrey Kuznetsov.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Roman Shtykh
I do believe failure handling is useful, but it has to be revisited (including above-mentioned suggestions) because what we have now is not what Ignite promises to do. Disabling it can be a temporal measure until it is improved.Andrey, when you say "hiding", I kind of understand you (even if I don't think we hide), but with the current behavior it's like doing stress tests on users' clusters -- any serious situation/bug can crash the cluster and, in its turn, trust in Ignite.
I think this discussion reveals another problem -- we might need something like Jepsen tests etc., which hopefully help us find such issues. AFAIK, CockroachDb has it running for a couple of years.

-- Roman
 

    On Tuesday, March 26, 2019, 8:24:24 p.m. GMT+9, Andrey Kuznetsov <[hidden email]> wrote:  
 
 Nikolay,

Feel free to suggest better error messages to indicate internal/critical
failures. User actions in response to critical failures are rather limited:
mail to user-list or maybe file an issue. As for repetitive warnings, it
makes sense, but requires additional stuff to deliver such signals, mere
spamming to log will not have an effect.

Anyway, when experienced committers suggest to disable failure handling and
hide existing issues, I feel as if they are pulling my leg.

Best regards,
Andrey Kuznetsov.

вт, 26 марта 2019, 13:30 Nikolay Izhikov [hidden email]:

> Andrey.
>
> >  the thread can be made non-critical, and we can restart it every time it
> dies
>
> Why we can't restart critical thread?
> What is the root difference between critical and non critical threads?
>
> > It's much simpler to catch and handle all exceptions in critical threads
>
> I don't agree with you.
> We develop Ignite not because it simple!
> We must spend extra time to made it robust and resilient to the failures.
>
> > Failure handling is a last-chance tool that reveals internal Ignite
> errors
> > 100% agree with you: overcome, but not hide.
>
> Logging stack trace with proper explanation is not hiding.
> Killing nodes and whole cluster is not "handling".
>
> > As far as I see from user-list messages, our users are qualified enough
> to provide necessary information from their cluster-wide logs.
>
> We shouldn't develop our product only for users who are able to read Ignite
> sources to decrypt the fail reason behind "starvation in stripped pool"
>
> Some of my questions remain unanswered :) :
>
> 1. How user can know it's an Ignite bug? Where this bug should be reported?
> 2. Do we log it somewhere?
> 3. Do we warn user before shutdown several times?
> 4. "starvation in stripped pool" I think it's not clear error message.
> Let's make it more specific!
> 5. Let's write to the user log - what he or she should do to prevent this
> error in future?
>
>
> вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[hidden email]>:
>
> > Nikolay,
> >
> > >  Why we can't restart some thread?
> > Technically, we can. It's just matter of design: the thread can be made
> > non-critical, and we can restart it every time it dies. But such design
> > looks poor to me. It's much simpler to catch and handle all exceptions in
> > critical threads. Failure handling is a last-chance tool that reveals
> > internal Ignite errors. It's not pleasant for us when users see these
> > errors, but it's better than hiding.
> >
> > >  Actually, distributed systems are designed to overcome some bugs,
> thread
> > failure, node failure, for example, isn't it?
> > 100% agree with you: overcome, but not hide.
> >
> > >  How user can know it's a bug? Where this bug should be reported?
> > As far as I see from user-list messages, our users are qualified enough
> to
> > provide necessary information from their cluster-wide logs.
> >
> >
> > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[hidden email]>:
> >
> > > Andrey.
> > >
> > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use
> > to
> > > wait for dead thread's magical resurrection.
> > >
> > > Why is it unrecoverable?
> > > Why we can't restart some thread?
> > > Is there some kind of nature limitation to not restart system thread?
> > >
> > > Actually, distributed systems are designed to overcome some bugs,
> thread
> > > failure, node failure, for example, isn't it?
> > > > if under some circumstances node> stop leads to cascade cluster
> crash,
> > > then it's a bug
> > >
> > > How user can know it's a bug? Where this bug should be reported?
> > > Do we log it somewhere?
> > > Do we warn user before shutdown one or several times?
> > >
> > > This feature kills user experience literally now.
> > >
> > > If I would be a user of the product that just shutdown with poor log I
> > > would throw this product away.
> > > Do we want it for Ignite?
> > >
> > > From SO discussion I see following error message: ": >>> Possible
> > > starvation in striped pool."
> > > Are you sure this message are clear for Ignite user(not Ignite hacker)?
> > > What user should do to prevent this error in future?
> > >
> > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I
> don't
> > > like
> > > > this behavior, but it may be useful sometimes: "frozen" threads have
> a
> > > > chance to become active again after load decreases. As for
> > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> wait
> > > for
> > > > dead thread's magical resurrection. Then, if under some circumstances
> > > node
> > > > stop leads to cascade cluster crash, then it's a bug, and it should
> be
> > > > fixed. Once and for all. Instead of hiding the flaw we have in the
> > > product.
> > > >
> > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh <[hidden email]
> >:
> > > >
> > > > > + 1 for having the default settings revisited.
> > > > > I understand Andrey's reasonings, but sometimes taking nodes down
> is
> > > too
> > > > > radical (as in my case it was GridDhtInvalidPartitionException
> which
> > > could
> > > > > be ignored for a while when rebalancing <- I might be wrong here).
> > > > >
> > > > > -- Roman
> > > > >
> > > > >
> > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > > > [hidden email]> wrote:
> > > > >
> > > > > p    Nikolay,
> > > > > Thanks for kicking off this discussion. Surprisingly, planned to
> > start
> > > a
> > > > > similar one today and incidentally came across this thread.
> > > > > Agree that the failure handler should be off by default or the
> > default
> > > > > settings have to be revisited. That's true that people are
> > complaining
> > > of
> > > > > nodes shutdowns even on moderate workloads. For instance, that's
> the
> > > most
> > > > > recent feedback related to slow checkpointing:
> > > > >
> > >
> >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > >
> > > > > At a minimum, let's consider the following:
> > > > >    - A failure handler needs to provide hints on how to come around
> > the
> > > > > shutdown in the future. Take the checkpointing SO thread above.
> It's
> > > > > unclear from the logs how to prevent the same situation next time
> > > (suggest
> > > > > parameters for tuning, flash drives, etc).
> > > > >    - Is there any protection for a full cluster restart? We need to
> > > > > distinguish a slow cluster from the stuck one. A node removal
> should
> > > not
> > > > > lead to a meltdown of the whole storage.
> > > > >    - Should we enable the failure handler for things like
> > transactions
> > > or
> > > > > PME and have it off for checkpointing and something else? Let's
> have
> > it
> > > > > enabled for cases when we are 100% certain that a node shutdown is
> > the
> > > > > right thing and print out warnings with suggestions whenever we're
> > not
> > > > > confident that the removal is appropriate.
> > > > > --Denis
> > > > >
> > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]>
> > wrote:
> > > > >
> > > > > Failure handlers were introduced in order to avoid cluster hanging
> > and
> > > > > they kill nodes instead.
> > > > >
> > > > > If critical worker was terminated by
> GridDhtInvalidPartitionException
> > > > > then your node is unable to work anymore.
> > > > >
> > > > > Unexpected cluster shutdown with reasons in logs that failure
> > handlers
> > > > > provide is better than hanging. So answer is NO. We mustn't disable
> > > > > failure handlers.
> > > > >
> > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > <[hidden email]
> > > >
> > > > > wrote:
> > > > > >
> > > > > > If it sticks to the behavior we had before introducing failure
> > > handler,
> > > > >
> > > > > I think it's better to have disabled instead of killing the whole
> > > cluster,
> > > > > as in my case, and create a parent issue for those ten bugs.Pavel,
> > > thanks
> > > > > for the suggestion!
> > > > > >
> > > > > >
> > > > > >
> > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> Izhikov
> > <
> > > > >
> > > > > [hidden email]> wrote:
> > > > > >
> > > > > >  Guys.
> > > > > >
> > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > > > > Seems, we have ten or more "cluster shutdown" bugs with this
> > > subsystem
> > > > > > since it was introduced.
> > > > > >
> > > > > > Should we disable it by default in 2.7.5?
> > > > > >
> > > > > >
> > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <[hidden email]
> >:
> > > > > >
> > > > > > > Hi Roman,
> > > > > > >
> > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > For workaround a custom FailureHandler can be configured that
> > will
> > > not
> > > > >
> > > > > stop
> > > > > > > a node in case of such exception is thrown.
> > > > > > >
> > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > <[hidden email]>:
> > > > > > >
> > > > > > > > Igniters,
> > > > > > > >
> > > > > > > > Restarting a node when injecting data and having it expired,
> > > results
> > > > >
> > > > > at
> > > > > > > > GridDhtInvalidPartitionException which terminates nodes with
> > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster
> > > down.
> > > > >
> > > > > This
> > > > > > > is
> > > > > > > > really bad and I didn't find the way to save the cluster from
> > > > > > >
> > > > > > > disappearing.
> > > > > > > > I created a JIRA issue
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > with a test case. Any clues how to fix this inconsistency
> when
> > > > > > >
> > > > > > > rebalancing?
> > > > > > > >
> > > > > > > > -- Roman
> > > > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> > --
> > Best regards,
> >  Andrey Kuznetsov.
> >
>  
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

daradurvs
In reply to this post by Andrey Kuznetsov
In general I agree with Andrey, the handler is very usefull itself. It
allows us to become know that ‘GridDhtInvalidPartitionException’ is not
processed properly in PME process by worker.

Nikolay, look at the code, if Failure Handler hadles an exception - this
means that while-true loop in worker’s body has been interrupted with
unexpected exception and thread is completed his lifecycle.

Without Failure Hanller, in the current case, the cluster will hang,
because of unable to participate in PME process.

So, the problem is the incorrect handling of the exception in PME’s task
wich should be fixed.


вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <[hidden email]>:

> Nikolay,
>
> Feel free to suggest better error messages to indicate internal/critical
> failures. User actions in response to critical failures are rather limited:
> mail to user-list or maybe file an issue. As for repetitive warnings, it
> makes sense, but requires additional stuff to deliver such signals, mere
> spamming to log will not have an effect.
>
> Anyway, when experienced committers suggest to disable failure handling and
> hide existing issues, I feel as if they are pulling my leg.
>
> Best regards,
> Andrey Kuznetsov.
>
> вт, 26 марта 2019, 13:30 Nikolay Izhikov [hidden email]:
>
> > Andrey.
> >
> > >  the thread can be made non-critical, and we can restart it every time
> it
> > dies
> >
> > Why we can't restart critical thread?
> > What is the root difference between critical and non critical threads?
> >
> > > It's much simpler to catch and handle all exceptions in critical
> threads
> >
> > I don't agree with you.
> > We develop Ignite not because it simple!
> > We must spend extra time to made it robust and resilient to the failures.
> >
> > > Failure handling is a last-chance tool that reveals internal Ignite
> > errors
> > > 100% agree with you: overcome, but not hide.
> >
> > Logging stack trace with proper explanation is not hiding.
> > Killing nodes and whole cluster is not "handling".
> >
> > > As far as I see from user-list messages, our users are qualified enough
> > to provide necessary information from their cluster-wide logs.
> >
> > We shouldn't develop our product only for users who are able to read
> Ignite
> > sources to decrypt the fail reason behind "starvation in stripped pool"
> >
> > Some of my questions remain unanswered :) :
> >
> > 1. How user can know it's an Ignite bug? Where this bug should be
> reported?
> > 2. Do we log it somewhere?
> > 3. Do we warn user before shutdown several times?
> > 4. "starvation in stripped pool" I think it's not clear error message.
> > Let's make it more specific!
> > 5. Let's write to the user log - what he or she should do to prevent this
> > error in future?
> >
> >
> > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[hidden email]>:
> >
> > > Nikolay,
> > >
> > > >  Why we can't restart some thread?
> > > Technically, we can. It's just matter of design: the thread can be made
> > > non-critical, and we can restart it every time it dies. But such design
> > > looks poor to me. It's much simpler to catch and handle all exceptions
> in
> > > critical threads. Failure handling is a last-chance tool that reveals
> > > internal Ignite errors. It's not pleasant for us when users see these
> > > errors, but it's better than hiding.
> > >
> > > >  Actually, distributed systems are designed to overcome some bugs,
> > thread
> > > failure, node failure, for example, isn't it?
> > > 100% agree with you: overcome, but not hide.
> > >
> > > >  How user can know it's a bug? Where this bug should be reported?
> > > As far as I see from user-list messages, our users are qualified enough
> > to
> > > provide necessary information from their cluster-wide logs.
> > >
> > >
> > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[hidden email]>:
> > >
> > > > Andrey.
> > > >
> > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no
> use
> > > to
> > > > wait for dead thread's magical resurrection.
> > > >
> > > > Why is it unrecoverable?
> > > > Why we can't restart some thread?
> > > > Is there some kind of nature limitation to not restart system thread?
> > > >
> > > > Actually, distributed systems are designed to overcome some bugs,
> > thread
> > > > failure, node failure, for example, isn't it?
> > > > > if under some circumstances node> stop leads to cascade cluster
> > crash,
> > > > then it's a bug
> > > >
> > > > How user can know it's a bug? Where this bug should be reported?
> > > > Do we log it somewhere?
> > > > Do we warn user before shutdown one or several times?
> > > >
> > > > This feature kills user experience literally now.
> > > >
> > > > If I would be a user of the product that just shutdown with poor log
> I
> > > > would throw this product away.
> > > > Do we want it for Ignite?
> > > >
> > > > From SO discussion I see following error message: ": >>> Possible
> > > > starvation in striped pool."
> > > > Are you sure this message are clear for Ignite user(not Ignite
> hacker)?
> > > > What user should do to prevent this error in future?
> > > >
> > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I
> > don't
> > > > like
> > > > > this behavior, but it may be useful sometimes: "frozen" threads
> have
> > a
> > > > > chance to become active again after load decreases. As for
> > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> > wait
> > > > for
> > > > > dead thread's magical resurrection. Then, if under some
> circumstances
> > > > node
> > > > > stop leads to cascade cluster crash, then it's a bug, and it should
> > be
> > > > > fixed. Once and for all. Instead of hiding the flaw we have in the
> > > > product.
> > > > >
> > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> <[hidden email]
> > >:
> > > > >
> > > > > > + 1 for having the default settings revisited.
> > > > > > I understand Andrey's reasonings, but sometimes taking nodes down
> > is
> > > > too
> > > > > > radical (as in my case it was GridDhtInvalidPartitionException
> > which
> > > > could
> > > > > > be ignored for a while when rebalancing <- I might be wrong
> here).
> > > > > >
> > > > > > -- Roman
> > > > > >
> > > > > >
> > > > > >     On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > > > > [hidden email]> wrote:
> > > > > >
> > > > > > p    Nikolay,
> > > > > > Thanks for kicking off this discussion. Surprisingly, planned to
> > > start
> > > > a
> > > > > > similar one today and incidentally came across this thread.
> > > > > > Agree that the failure handler should be off by default or the
> > > default
> > > > > > settings have to be revisited. That's true that people are
> > > complaining
> > > > of
> > > > > > nodes shutdowns even on moderate workloads. For instance, that's
> > the
> > > > most
> > > > > > recent feedback related to slow checkpointing:
> > > > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > >
> > > > > > At a minimum, let's consider the following:
> > > > > >    - A failure handler needs to provide hints on how to come
> around
> > > the
> > > > > > shutdown in the future. Take the checkpointing SO thread above.
> > It's
> > > > > > unclear from the logs how to prevent the same situation next time
> > > > (suggest
> > > > > > parameters for tuning, flash drives, etc).
> > > > > >    - Is there any protection for a full cluster restart? We need
> to
> > > > > > distinguish a slow cluster from the stuck one. A node removal
> > should
> > > > not
> > > > > > lead to a meltdown of the whole storage.
> > > > > >    - Should we enable the failure handler for things like
> > > transactions
> > > > or
> > > > > > PME and have it off for checkpointing and something else? Let's
> > have
> > > it
> > > > > > enabled for cases when we are 100% certain that a node shutdown
> is
> > > the
> > > > > > right thing and print out warnings with suggestions whenever
> we're
> > > not
> > > > > > confident that the removal is appropriate.
> > > > > > --Denis
> > > > > >
> > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]>
> > > wrote:
> > > > > >
> > > > > > Failure handlers were introduced in order to avoid cluster
> hanging
> > > and
> > > > > > they kill nodes instead.
> > > > > >
> > > > > > If critical worker was terminated by
> > GridDhtInvalidPartitionException
> > > > > > then your node is unable to work anymore.
> > > > > >
> > > > > > Unexpected cluster shutdown with reasons in logs that failure
> > > handlers
> > > > > > provide is better than hanging. So answer is NO. We mustn't
> disable
> > > > > > failure handlers.
> > > > > >
> > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > <[hidden email]
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > If it sticks to the behavior we had before introducing failure
> > > > handler,
> > > > > >
> > > > > > I think it's better to have disabled instead of killing the whole
> > > > cluster,
> > > > > > as in my case, and create a parent issue for those ten
> bugs.Pavel,
> > > > thanks
> > > > > > for the suggestion!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> > Izhikov
> > > <
> > > > > >
> > > > > > [hidden email]> wrote:
> > > > > > >
> > > > > > >  Guys.
> > > > > > >
> > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > > > > > Seems, we have ten or more "cluster shutdown" bugs with this
> > > > subsystem
> > > > > > > since it was introduced.
> > > > > > >
> > > > > > > Should we disable it by default in 2.7.5?
> > > > > > >
> > > > > > >
> > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> [hidden email]
> > >:
> > > > > > >
> > > > > > > > Hi Roman,
> > > > > > > >
> > > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > For workaround a custom FailureHandler can be configured that
> > > will
> > > > not
> > > > > >
> > > > > > stop
> > > > > > > > a node in case of such exception is thrown.
> > > > > > > >
> > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > <[hidden email]>:
> > > > > > > >
> > > > > > > > > Igniters,
> > > > > > > > >
> > > > > > > > > Restarting a node when injecting data and having it
> expired,
> > > > results
> > > > > >
> > > > > > at
> > > > > > > > > GridDhtInvalidPartitionException which terminates nodes
> with
> > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole
> cluster
> > > > down.
> > > > > >
> > > > > > This
> > > > > > > > is
> > > > > > > > > really bad and I didn't find the way to save the cluster
> from
> > > > > > > >
> > > > > > > > disappearing.
> > > > > > > > > I created a JIRA issue
> > > > > > > >
> > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > with a test case. Any clues how to fix this inconsistency
> > when
> > > > > > > >
> > > > > > > > rebalancing?
> > > > > > > > >
> > > > > > > > > -- Roman
> > > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >   Andrey Kuznetsov.
> > >
> >
>
--
Best Regards, Vyacheslav D.
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Roman Shtykh
Vyacheslav, if you are talking about this particular case I described, I believe it has no influence on PME. What could happen is having CleanupWorker thread dead (which is not good too).But I believe we are talking in a wider scope.

-- Roman
 

    On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur <[hidden email]> wrote:  
 
 In general I agree with Andrey, the handler is very usefull itself. It
allows us to become know that ‘GridDhtInvalidPartitionException’ is not
processed properly in PME process by worker.

Nikolay, look at the code, if Failure Handler hadles an exception - this
means that while-true loop in worker’s body has been interrupted with
unexpected exception and thread is completed his lifecycle.

Without Failure Hanller, in the current case, the cluster will hang,
because of unable to participate in PME process.

So, the problem is the incorrect handling of the exception in PME’s task
wich should be fixed.


вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <[hidden email]>:

> Nikolay,
>
> Feel free to suggest better error messages to indicate internal/critical
> failures. User actions in response to critical failures are rather limited:
> mail to user-list or maybe file an issue. As for repetitive warnings, it
> makes sense, but requires additional stuff to deliver such signals, mere
> spamming to log will not have an effect.
>
> Anyway, when experienced committers suggest to disable failure handling and
> hide existing issues, I feel as if they are pulling my leg.
>
> Best regards,
> Andrey Kuznetsov.
>
> вт, 26 марта 2019, 13:30 Nikolay Izhikov [hidden email]:
>
> > Andrey.
> >
> > >  the thread can be made non-critical, and we can restart it every time
> it
> > dies
> >
> > Why we can't restart critical thread?
> > What is the root difference between critical and non critical threads?
> >
> > > It's much simpler to catch and handle all exceptions in critical
> threads
> >
> > I don't agree with you.
> > We develop Ignite not because it simple!
> > We must spend extra time to made it robust and resilient to the failures.
> >
> > > Failure handling is a last-chance tool that reveals internal Ignite
> > errors
> > > 100% agree with you: overcome, but not hide.
> >
> > Logging stack trace with proper explanation is not hiding.
> > Killing nodes and whole cluster is not "handling".
> >
> > > As far as I see from user-list messages, our users are qualified enough
> > to provide necessary information from their cluster-wide logs.
> >
> > We shouldn't develop our product only for users who are able to read
> Ignite
> > sources to decrypt the fail reason behind "starvation in stripped pool"
> >
> > Some of my questions remain unanswered :) :
> >
> > 1. How user can know it's an Ignite bug? Where this bug should be
> reported?
> > 2. Do we log it somewhere?
> > 3. Do we warn user before shutdown several times?
> > 4. "starvation in stripped pool" I think it's not clear error message.
> > Let's make it more specific!
> > 5. Let's write to the user log - what he or she should do to prevent this
> > error in future?
> >
> >
> > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[hidden email]>:
> >
> > > Nikolay,
> > >
> > > >  Why we can't restart some thread?
> > > Technically, we can. It's just matter of design: the thread can be made
> > > non-critical, and we can restart it every time it dies. But such design
> > > looks poor to me. It's much simpler to catch and handle all exceptions
> in
> > > critical threads. Failure handling is a last-chance tool that reveals
> > > internal Ignite errors. It's not pleasant for us when users see these
> > > errors, but it's better than hiding.
> > >
> > > >  Actually, distributed systems are designed to overcome some bugs,
> > thread
> > > failure, node failure, for example, isn't it?
> > > 100% agree with you: overcome, but not hide.
> > >
> > > >  How user can know it's a bug? Where this bug should be reported?
> > > As far as I see from user-list messages, our users are qualified enough
> > to
> > > provide necessary information from their cluster-wide logs.
> > >
> > >
> > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[hidden email]>:
> > >
> > > > Andrey.
> > > >
> > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no
> use
> > > to
> > > > wait for dead thread's magical resurrection.
> > > >
> > > > Why is it unrecoverable?
> > > > Why we can't restart some thread?
> > > > Is there some kind of nature limitation to not restart system thread?
> > > >
> > > > Actually, distributed systems are designed to overcome some bugs,
> > thread
> > > > failure, node failure, for example, isn't it?
> > > > > if under some circumstances node> stop leads to cascade cluster
> > crash,
> > > > then it's a bug
> > > >
> > > > How user can know it's a bug? Where this bug should be reported?
> > > > Do we log it somewhere?
> > > > Do we warn user before shutdown one or several times?
> > > >
> > > > This feature kills user experience literally now.
> > > >
> > > > If I would be a user of the product that just shutdown with poor log
> I
> > > > would throw this product away.
> > > > Do we want it for Ignite?
> > > >
> > > > From SO discussion I see following error message: ": >>> Possible
> > > > starvation in striped pool."
> > > > Are you sure this message are clear for Ignite user(not Ignite
> hacker)?
> > > > What user should do to prevent this error in future?
> > > >
> > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I
> > don't
> > > > like
> > > > > this behavior, but it may be useful sometimes: "frozen" threads
> have
> > a
> > > > > chance to become active again after load decreases. As for
> > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> > wait
> > > > for
> > > > > dead thread's magical resurrection. Then, if under some
> circumstances
> > > > node
> > > > > stop leads to cascade cluster crash, then it's a bug, and it should
> > be
> > > > > fixed. Once and for all. Instead of hiding the flaw we have in the
> > > > product.
> > > > >
> > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> <[hidden email]
> > >:
> > > > >
> > > > > > + 1 for having the default settings revisited.
> > > > > > I understand Andrey's reasonings, but sometimes taking nodes down
> > is
> > > > too
> > > > > > radical (as in my case it was GridDhtInvalidPartitionException
> > which
> > > > could
> > > > > > be ignored for a while when rebalancing <- I might be wrong
> here).
> > > > > >
> > > > > > -- Roman
> > > > > >
> > > > > >
> > > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > > > > [hidden email]> wrote:
> > > > > >
> > > > > > p    Nikolay,
> > > > > > Thanks for kicking off this discussion. Surprisingly, planned to
> > > start
> > > > a
> > > > > > similar one today and incidentally came across this thread.
> > > > > > Agree that the failure handler should be off by default or the
> > > default
> > > > > > settings have to be revisited. That's true that people are
> > > complaining
> > > > of
> > > > > > nodes shutdowns even on moderate workloads. For instance, that's
> > the
> > > > most
> > > > > > recent feedback related to slow checkpointing:
> > > > > >
> > > >
> > >
> >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > >
> > > > > > At a minimum, let's consider the following:
> > > > > >    - A failure handler needs to provide hints on how to come
> around
> > > the
> > > > > > shutdown in the future. Take the checkpointing SO thread above.
> > It's
> > > > > > unclear from the logs how to prevent the same situation next time
> > > > (suggest
> > > > > > parameters for tuning, flash drives, etc).
> > > > > >    - Is there any protection for a full cluster restart? We need
> to
> > > > > > distinguish a slow cluster from the stuck one. A node removal
> > should
> > > > not
> > > > > > lead to a meltdown of the whole storage.
> > > > > >    - Should we enable the failure handler for things like
> > > transactions
> > > > or
> > > > > > PME and have it off for checkpointing and something else? Let's
> > have
> > > it
> > > > > > enabled for cases when we are 100% certain that a node shutdown
> is
> > > the
> > > > > > right thing and print out warnings with suggestions whenever
> we're
> > > not
> > > > > > confident that the removal is appropriate.
> > > > > > --Denis
> > > > > >
> > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]>
> > > wrote:
> > > > > >
> > > > > > Failure handlers were introduced in order to avoid cluster
> hanging
> > > and
> > > > > > they kill nodes instead.
> > > > > >
> > > > > > If critical worker was terminated by
> > GridDhtInvalidPartitionException
> > > > > > then your node is unable to work anymore.
> > > > > >
> > > > > > Unexpected cluster shutdown with reasons in logs that failure
> > > handlers
> > > > > > provide is better than hanging. So answer is NO. We mustn't
> disable
> > > > > > failure handlers.
> > > > > >
> > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > <[hidden email]
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > If it sticks to the behavior we had before introducing failure
> > > > handler,
> > > > > >
> > > > > > I think it's better to have disabled instead of killing the whole
> > > > cluster,
> > > > > > as in my case, and create a parent issue for those ten
> bugs.Pavel,
> > > > thanks
> > > > > > for the suggestion!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> > Izhikov
> > > <
> > > > > >
> > > > > > [hidden email]> wrote:
> > > > > > >
> > > > > > >  Guys.
> > > > > > >
> > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > > > > > Seems, we have ten or more "cluster shutdown" bugs with this
> > > > subsystem
> > > > > > > since it was introduced.
> > > > > > >
> > > > > > > Should we disable it by default in 2.7.5?
> > > > > > >
> > > > > > >
> > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> [hidden email]
> > >:
> > > > > > >
> > > > > > > > Hi Roman,
> > > > > > > >
> > > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > For workaround a custom FailureHandler can be configured that
> > > will
> > > > not
> > > > > >
> > > > > > stop
> > > > > > > > a node in case of such exception is thrown.
> > > > > > > >
> > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > <[hidden email]>:
> > > > > > > >
> > > > > > > > > Igniters,
> > > > > > > > >
> > > > > > > > > Restarting a node when injecting data and having it
> expired,
> > > > results
> > > > > >
> > > > > > at
> > > > > > > > > GridDhtInvalidPartitionException which terminates nodes
> with
> > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole
> cluster
> > > > down.
> > > > > >
> > > > > > This
> > > > > > > > is
> > > > > > > > > really bad and I didn't find the way to save the cluster
> from
> > > > > > > >
> > > > > > > > disappearing.
> > > > > > > > > I created a JIRA issue
> > > > > > > >
> > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > with a test case. Any clues how to fix this inconsistency
> > when
> > > > > > > >
> > > > > > > > rebalancing?
> > > > > > > > >
> > > > > > > > > -- Roman
> > > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >  Andrey Kuznetsov.
> > >
> >
>
--
Best Regards, Vyacheslav D.  
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

agura
Igniters,

1. First of all, I want to remind you why failure handles were
implemented. Please take a look to IEP-14 [1] and corresponding
discussion on dev-list [2] (quite emotional discussion). This sources
also answer on some questions from previous posts of this topic.

2. Note that the following failure types are ignored by default (BUT
this fixes ARE NOT included to 2.7):
- SYSTEM_WORKER_BLOCKED: Unresponsive critical thread for a long time
is a problem but we don't know why it happened (possibly slow
environment) so we just ignore this failure.
- SYSTEM_CRITICAL_OPERATION_TIMEOUT: At the moment it is related only
with checkpoint read lock acquisition.

So we already have more or less adequate defaults.

3. About SYSTEM_WORKER_TERMINATION failure type.

Restarting thread is very bad idea because we already have system in
undefined state and system behavior is unpredictable from this point.

For example discovery thread is critical part of discovery protocol.
If discovery thread on some node is terminated during discovery
message processing then:
- Protocol is already broken because message will not send to the next
node in the ring, so we can't ignore this failure because whole
cluster will suffer in this case;
- But we can restart thread and even try to process the same message
once again. And what? The same error will happen with high probability
and discovery thread will be terminated again.

4. About enabling the failure handler for things like transactions or
PME and have it off for check pointing and something else.

Failure handler is a general component. It isn't related with some
kind of functionality (e.g. tx, PME or check pointing). We only can to
manage the behavior of configured failure handler in case of
particular failure type. See p.2 above.

5. About providing hints on how to come around the shutdown in the future

I really don't like analogies but I believe it will be appropriate to
our discussion. What kind of hint can provide JVM in case
AssertionError? It is right for failure handler also. Failure handler
is the last resort and only thing than handler can provide is some
information about failure. In our case this information contains
failure context, thread name and thread dump.

6. About protection for a full cluster restart

Failure handler is node local entity. If whole cluster is
restarted/stopped due to a some failure it means only one - on each
cluster node some critical failure happened. It means that we can't
protect cluster from shutting down in current failure model.
More complex failure model can be implemented which will require
decision about node stopping from all cluster nodes (or some subset -
quorum). But it require additional research and discussion.

7. About user experience

Yes, "starvation in stripped pool" message isn't clear enough for...
hmmm... user. But it is definitely clear for developer. And I've no
idea about clear message for user. So... Are you have an idea? You are
welcome!
It is easy to say that something is wrong but it is hard to make it right.

Also I believe that user experience will not better in cases of frozen
cluster instead of failed cluster. And user will not more happy if we
log more messages like "cluster will be stopped". And unfortunately we
can't explain users what he or she should to do in order to prevent
this error in future because we ourselves don't know what to in this
case. Every failure is actually bug that should be investigated and
fixed. Less bugs is the thing that can improve user experience.


Links:

1. https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling
2. http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html

On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh <[hidden email]> wrote:

>
> Vyacheslav, if you are talking about this particular case I described, I believe it has no influence on PME. What could happen is having CleanupWorker thread dead (which is not good too).But I believe we are talking in a wider scope.
>
> -- Roman
>
>
>     On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur <[hidden email]> wrote:
>
>  In general I agree with Andrey, the handler is very usefull itself. It
> allows us to become know that ‘GridDhtInvalidPartitionException’ is not
> processed properly in PME process by worker.
>
> Nikolay, look at the code, if Failure Handler hadles an exception - this
> means that while-true loop in worker’s body has been interrupted with
> unexpected exception and thread is completed his lifecycle.
>
> Without Failure Hanller, in the current case, the cluster will hang,
> because of unable to participate in PME process.
>
> So, the problem is the incorrect handling of the exception in PME’s task
> wich should be fixed.
>
>
> вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <[hidden email]>:
>
> > Nikolay,
> >
> > Feel free to suggest better error messages to indicate internal/critical
> > failures. User actions in response to critical failures are rather limited:
> > mail to user-list or maybe file an issue. As for repetitive warnings, it
> > makes sense, but requires additional stuff to deliver such signals, mere
> > spamming to log will not have an effect.
> >
> > Anyway, when experienced committers suggest to disable failure handling and
> > hide existing issues, I feel as if they are pulling my leg.
> >
> > Best regards,
> > Andrey Kuznetsov.
> >
> > вт, 26 марта 2019, 13:30 Nikolay Izhikov [hidden email]:
> >
> > > Andrey.
> > >
> > > >  the thread can be made non-critical, and we can restart it every time
> > it
> > > dies
> > >
> > > Why we can't restart critical thread?
> > > What is the root difference between critical and non critical threads?
> > >
> > > > It's much simpler to catch and handle all exceptions in critical
> > threads
> > >
> > > I don't agree with you.
> > > We develop Ignite not because it simple!
> > > We must spend extra time to made it robust and resilient to the failures.
> > >
> > > > Failure handling is a last-chance tool that reveals internal Ignite
> > > errors
> > > > 100% agree with you: overcome, but not hide.
> > >
> > > Logging stack trace with proper explanation is not hiding.
> > > Killing nodes and whole cluster is not "handling".
> > >
> > > > As far as I see from user-list messages, our users are qualified enough
> > > to provide necessary information from their cluster-wide logs.
> > >
> > > We shouldn't develop our product only for users who are able to read
> > Ignite
> > > sources to decrypt the fail reason behind "starvation in stripped pool"
> > >
> > > Some of my questions remain unanswered :) :
> > >
> > > 1. How user can know it's an Ignite bug? Where this bug should be
> > reported?
> > > 2. Do we log it somewhere?
> > > 3. Do we warn user before shutdown several times?
> > > 4. "starvation in stripped pool" I think it's not clear error message.
> > > Let's make it more specific!
> > > 5. Let's write to the user log - what he or she should do to prevent this
> > > error in future?
> > >
> > >
> > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[hidden email]>:
> > >
> > > > Nikolay,
> > > >
> > > > >  Why we can't restart some thread?
> > > > Technically, we can. It's just matter of design: the thread can be made
> > > > non-critical, and we can restart it every time it dies. But such design
> > > > looks poor to me. It's much simpler to catch and handle all exceptions
> > in
> > > > critical threads. Failure handling is a last-chance tool that reveals
> > > > internal Ignite errors. It's not pleasant for us when users see these
> > > > errors, but it's better than hiding.
> > > >
> > > > >  Actually, distributed systems are designed to overcome some bugs,
> > > thread
> > > > failure, node failure, for example, isn't it?
> > > > 100% agree with you: overcome, but not hide.
> > > >
> > > > >  How user can know it's a bug? Where this bug should be reported?
> > > > As far as I see from user-list messages, our users are qualified enough
> > > to
> > > > provide necessary information from their cluster-wide logs.
> > > >
> > > >
> > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[hidden email]>:
> > > >
> > > > > Andrey.
> > > > >
> > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no
> > use
> > > > to
> > > > > wait for dead thread's magical resurrection.
> > > > >
> > > > > Why is it unrecoverable?
> > > > > Why we can't restart some thread?
> > > > > Is there some kind of nature limitation to not restart system thread?
> > > > >
> > > > > Actually, distributed systems are designed to overcome some bugs,
> > > thread
> > > > > failure, node failure, for example, isn't it?
> > > > > > if under some circumstances node> stop leads to cascade cluster
> > > crash,
> > > > > then it's a bug
> > > > >
> > > > > How user can know it's a bug? Where this bug should be reported?
> > > > > Do we log it somewhere?
> > > > > Do we warn user before shutdown one or several times?
> > > > >
> > > > > This feature kills user experience literally now.
> > > > >
> > > > > If I would be a user of the product that just shutdown with poor log
> > I
> > > > > would throw this product away.
> > > > > Do we want it for Ignite?
> > > > >
> > > > > From SO discussion I see following error message: ": >>> Possible
> > > > > starvation in striped pool."
> > > > > Are you sure this message are clear for Ignite user(not Ignite
> > hacker)?
> > > > > What user should do to prevent this error in future?
> > > > >
> > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I
> > > don't
> > > > > like
> > > > > > this behavior, but it may be useful sometimes: "frozen" threads
> > have
> > > a
> > > > > > chance to become active again after load decreases. As for
> > > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> > > wait
> > > > > for
> > > > > > dead thread's magical resurrection. Then, if under some
> > circumstances
> > > > > node
> > > > > > stop leads to cascade cluster crash, then it's a bug, and it should
> > > be
> > > > > > fixed. Once and for all. Instead of hiding the flaw we have in the
> > > > > product.
> > > > > >
> > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> > <[hidden email]
> > > >:
> > > > > >
> > > > > > > + 1 for having the default settings revisited.
> > > > > > > I understand Andrey's reasonings, but sometimes taking nodes down
> > > is
> > > > > too
> > > > > > > radical (as in my case it was GridDhtInvalidPartitionException
> > > which
> > > > > could
> > > > > > > be ignored for a while when rebalancing <- I might be wrong
> > here).
> > > > > > >
> > > > > > > -- Roman
> > > > > > >
> > > > > > >
> > > > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > > > > > [hidden email]> wrote:
> > > > > > >
> > > > > > > p    Nikolay,
> > > > > > > Thanks for kicking off this discussion. Surprisingly, planned to
> > > > start
> > > > > a
> > > > > > > similar one today and incidentally came across this thread.
> > > > > > > Agree that the failure handler should be off by default or the
> > > > default
> > > > > > > settings have to be revisited. That's true that people are
> > > > complaining
> > > > > of
> > > > > > > nodes shutdowns even on moderate workloads. For instance, that's
> > > the
> > > > > most
> > > > > > > recent feedback related to slow checkpointing:
> > > > > > >
> > > > >
> > > >
> > >
> > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > > >
> > > > > > > At a minimum, let's consider the following:
> > > > > > >    - A failure handler needs to provide hints on how to come
> > around
> > > > the
> > > > > > > shutdown in the future. Take the checkpointing SO thread above.
> > > It's
> > > > > > > unclear from the logs how to prevent the same situation next time
> > > > > (suggest
> > > > > > > parameters for tuning, flash drives, etc).
> > > > > > >    - Is there any protection for a full cluster restart? We need
> > to
> > > > > > > distinguish a slow cluster from the stuck one. A node removal
> > > should
> > > > > not
> > > > > > > lead to a meltdown of the whole storage.
> > > > > > >    - Should we enable the failure handler for things like
> > > > transactions
> > > > > or
> > > > > > > PME and have it off for checkpointing and something else? Let's
> > > have
> > > > it
> > > > > > > enabled for cases when we are 100% certain that a node shutdown
> > is
> > > > the
> > > > > > > right thing and print out warnings with suggestions whenever
> > we're
> > > > not
> > > > > > > confident that the removal is appropriate.
> > > > > > > --Denis
> > > > > > >
> > > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]>
> > > > wrote:
> > > > > > >
> > > > > > > Failure handlers were introduced in order to avoid cluster
> > hanging
> > > > and
> > > > > > > they kill nodes instead.
> > > > > > >
> > > > > > > If critical worker was terminated by
> > > GridDhtInvalidPartitionException
> > > > > > > then your node is unable to work anymore.
> > > > > > >
> > > > > > > Unexpected cluster shutdown with reasons in logs that failure
> > > > handlers
> > > > > > > provide is better than hanging. So answer is NO. We mustn't
> > disable
> > > > > > > failure handlers.
> > > > > > >
> > > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > > <[hidden email]
> > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > If it sticks to the behavior we had before introducing failure
> > > > > handler,
> > > > > > >
> > > > > > > I think it's better to have disabled instead of killing the whole
> > > > > cluster,
> > > > > > > as in my case, and create a parent issue for those ten
> > bugs.Pavel,
> > > > > thanks
> > > > > > > for the suggestion!
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> > > Izhikov
> > > > <
> > > > > > >
> > > > > > > [hidden email]> wrote:
> > > > > > > >
> > > > > > > >  Guys.
> > > > > > > >
> > > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > > > > > > Seems, we have ten or more "cluster shutdown" bugs with this
> > > > > subsystem
> > > > > > > > since it was introduced.
> > > > > > > >
> > > > > > > > Should we disable it by default in 2.7.5?
> > > > > > > >
> > > > > > > >
> > > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> > [hidden email]
> > > >:
> > > > > > > >
> > > > > > > > > Hi Roman,
> > > > > > > > >
> > > > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > > For workaround a custom FailureHandler can be configured that
> > > > will
> > > > > not
> > > > > > >
> > > > > > > stop
> > > > > > > > > a node in case of such exception is thrown.
> > > > > > > > >
> > > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > > <[hidden email]>:
> > > > > > > > >
> > > > > > > > > > Igniters,
> > > > > > > > > >
> > > > > > > > > > Restarting a node when injecting data and having it
> > expired,
> > > > > results
> > > > > > >
> > > > > > > at
> > > > > > > > > > GridDhtInvalidPartitionException which terminates nodes
> > with
> > > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole
> > cluster
> > > > > down.
> > > > > > >
> > > > > > > This
> > > > > > > > > is
> > > > > > > > > > really bad and I didn't find the way to save the cluster
> > from
> > > > > > > > >
> > > > > > > > > disappearing.
> > > > > > > > > > I created a JIRA issue
> > > > > > > > >
> > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > > with a test case. Any clues how to fix this inconsistency
> > > when
> > > > > > > > >
> > > > > > > > > rebalancing?
> > > > > > > > > >
> > > > > > > > > > -- Roman
> > > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >  Andrey Kuznetsov.
> > > >
> > >
> >
> --
> Best Regards, Vyacheslav D.
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

agura
In reply to this post by Roman Shtykh
CleanupWorker termination can lead to the following effects:

- Queries can retrieve data that have to expired so application will
behave incorrectly.
- Memory and/or disc can be overflowed because entries weren't expired.
- Performance degradation is possible due to unmanageable data set grows.

On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh <[hidden email]> wrote:

>
> Vyacheslav, if you are talking about this particular case I described, I believe it has no influence on PME. What could happen is having CleanupWorker thread dead (which is not good too).But I believe we are talking in a wider scope.
>
> -- Roman
>
>
>     On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur <[hidden email]> wrote:
>
>  In general I agree with Andrey, the handler is very usefull itself. It
> allows us to become know that ‘GridDhtInvalidPartitionException’ is not
> processed properly in PME process by worker.
>
> Nikolay, look at the code, if Failure Handler hadles an exception - this
> means that while-true loop in worker’s body has been interrupted with
> unexpected exception and thread is completed his lifecycle.
>
> Without Failure Hanller, in the current case, the cluster will hang,
> because of unable to participate in PME process.
>
> So, the problem is the incorrect handling of the exception in PME’s task
> wich should be fixed.
>
>
> вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <[hidden email]>:
>
> > Nikolay,
> >
> > Feel free to suggest better error messages to indicate internal/critical
> > failures. User actions in response to critical failures are rather limited:
> > mail to user-list or maybe file an issue. As for repetitive warnings, it
> > makes sense, but requires additional stuff to deliver such signals, mere
> > spamming to log will not have an effect.
> >
> > Anyway, when experienced committers suggest to disable failure handling and
> > hide existing issues, I feel as if they are pulling my leg.
> >
> > Best regards,
> > Andrey Kuznetsov.
> >
> > вт, 26 марта 2019, 13:30 Nikolay Izhikov [hidden email]:
> >
> > > Andrey.
> > >
> > > >  the thread can be made non-critical, and we can restart it every time
> > it
> > > dies
> > >
> > > Why we can't restart critical thread?
> > > What is the root difference between critical and non critical threads?
> > >
> > > > It's much simpler to catch and handle all exceptions in critical
> > threads
> > >
> > > I don't agree with you.
> > > We develop Ignite not because it simple!
> > > We must spend extra time to made it robust and resilient to the failures.
> > >
> > > > Failure handling is a last-chance tool that reveals internal Ignite
> > > errors
> > > > 100% agree with you: overcome, but not hide.
> > >
> > > Logging stack trace with proper explanation is not hiding.
> > > Killing nodes and whole cluster is not "handling".
> > >
> > > > As far as I see from user-list messages, our users are qualified enough
> > > to provide necessary information from their cluster-wide logs.
> > >
> > > We shouldn't develop our product only for users who are able to read
> > Ignite
> > > sources to decrypt the fail reason behind "starvation in stripped pool"
> > >
> > > Some of my questions remain unanswered :) :
> > >
> > > 1. How user can know it's an Ignite bug? Where this bug should be
> > reported?
> > > 2. Do we log it somewhere?
> > > 3. Do we warn user before shutdown several times?
> > > 4. "starvation in stripped pool" I think it's not clear error message.
> > > Let's make it more specific!
> > > 5. Let's write to the user log - what he or she should do to prevent this
> > > error in future?
> > >
> > >
> > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[hidden email]>:
> > >
> > > > Nikolay,
> > > >
> > > > >  Why we can't restart some thread?
> > > > Technically, we can. It's just matter of design: the thread can be made
> > > > non-critical, and we can restart it every time it dies. But such design
> > > > looks poor to me. It's much simpler to catch and handle all exceptions
> > in
> > > > critical threads. Failure handling is a last-chance tool that reveals
> > > > internal Ignite errors. It's not pleasant for us when users see these
> > > > errors, but it's better than hiding.
> > > >
> > > > >  Actually, distributed systems are designed to overcome some bugs,
> > > thread
> > > > failure, node failure, for example, isn't it?
> > > > 100% agree with you: overcome, but not hide.
> > > >
> > > > >  How user can know it's a bug? Where this bug should be reported?
> > > > As far as I see from user-list messages, our users are qualified enough
> > > to
> > > > provide necessary information from their cluster-wide logs.
> > > >
> > > >
> > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[hidden email]>:
> > > >
> > > > > Andrey.
> > > > >
> > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no
> > use
> > > > to
> > > > > wait for dead thread's magical resurrection.
> > > > >
> > > > > Why is it unrecoverable?
> > > > > Why we can't restart some thread?
> > > > > Is there some kind of nature limitation to not restart system thread?
> > > > >
> > > > > Actually, distributed systems are designed to overcome some bugs,
> > > thread
> > > > > failure, node failure, for example, isn't it?
> > > > > > if under some circumstances node> stop leads to cascade cluster
> > > crash,
> > > > > then it's a bug
> > > > >
> > > > > How user can know it's a bug? Where this bug should be reported?
> > > > > Do we log it somewhere?
> > > > > Do we warn user before shutdown one or several times?
> > > > >
> > > > > This feature kills user experience literally now.
> > > > >
> > > > > If I would be a user of the product that just shutdown with poor log
> > I
> > > > > would throw this product away.
> > > > > Do we want it for Ignite?
> > > > >
> > > > > From SO discussion I see following error message: ": >>> Possible
> > > > > starvation in striped pool."
> > > > > Are you sure this message are clear for Ignite user(not Ignite
> > hacker)?
> > > > > What user should do to prevent this error in future?
> > > > >
> > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I
> > > don't
> > > > > like
> > > > > > this behavior, but it may be useful sometimes: "frozen" threads
> > have
> > > a
> > > > > > chance to become active again after load decreases. As for
> > > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> > > wait
> > > > > for
> > > > > > dead thread's magical resurrection. Then, if under some
> > circumstances
> > > > > node
> > > > > > stop leads to cascade cluster crash, then it's a bug, and it should
> > > be
> > > > > > fixed. Once and for all. Instead of hiding the flaw we have in the
> > > > > product.
> > > > > >
> > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> > <[hidden email]
> > > >:
> > > > > >
> > > > > > > + 1 for having the default settings revisited.
> > > > > > > I understand Andrey's reasonings, but sometimes taking nodes down
> > > is
> > > > > too
> > > > > > > radical (as in my case it was GridDhtInvalidPartitionException
> > > which
> > > > > could
> > > > > > > be ignored for a while when rebalancing <- I might be wrong
> > here).
> > > > > > >
> > > > > > > -- Roman
> > > > > > >
> > > > > > >
> > > > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > > > > > [hidden email]> wrote:
> > > > > > >
> > > > > > > p    Nikolay,
> > > > > > > Thanks for kicking off this discussion. Surprisingly, planned to
> > > > start
> > > > > a
> > > > > > > similar one today and incidentally came across this thread.
> > > > > > > Agree that the failure handler should be off by default or the
> > > > default
> > > > > > > settings have to be revisited. That's true that people are
> > > > complaining
> > > > > of
> > > > > > > nodes shutdowns even on moderate workloads. For instance, that's
> > > the
> > > > > most
> > > > > > > recent feedback related to slow checkpointing:
> > > > > > >
> > > > >
> > > >
> > >
> > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > > >
> > > > > > > At a minimum, let's consider the following:
> > > > > > >    - A failure handler needs to provide hints on how to come
> > around
> > > > the
> > > > > > > shutdown in the future. Take the checkpointing SO thread above.
> > > It's
> > > > > > > unclear from the logs how to prevent the same situation next time
> > > > > (suggest
> > > > > > > parameters for tuning, flash drives, etc).
> > > > > > >    - Is there any protection for a full cluster restart? We need
> > to
> > > > > > > distinguish a slow cluster from the stuck one. A node removal
> > > should
> > > > > not
> > > > > > > lead to a meltdown of the whole storage.
> > > > > > >    - Should we enable the failure handler for things like
> > > > transactions
> > > > > or
> > > > > > > PME and have it off for checkpointing and something else? Let's
> > > have
> > > > it
> > > > > > > enabled for cases when we are 100% certain that a node shutdown
> > is
> > > > the
> > > > > > > right thing and print out warnings with suggestions whenever
> > we're
> > > > not
> > > > > > > confident that the removal is appropriate.
> > > > > > > --Denis
> > > > > > >
> > > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <[hidden email]>
> > > > wrote:
> > > > > > >
> > > > > > > Failure handlers were introduced in order to avoid cluster
> > hanging
> > > > and
> > > > > > > they kill nodes instead.
> > > > > > >
> > > > > > > If critical worker was terminated by
> > > GridDhtInvalidPartitionException
> > > > > > > then your node is unable to work anymore.
> > > > > > >
> > > > > > > Unexpected cluster shutdown with reasons in logs that failure
> > > > handlers
> > > > > > > provide is better than hanging. So answer is NO. We mustn't
> > disable
> > > > > > > failure handlers.
> > > > > > >
> > > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > > <[hidden email]
> > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > If it sticks to the behavior we had before introducing failure
> > > > > handler,
> > > > > > >
> > > > > > > I think it's better to have disabled instead of killing the whole
> > > > > cluster,
> > > > > > > as in my case, and create a parent issue for those ten
> > bugs.Pavel,
> > > > > thanks
> > > > > > > for the suggestion!
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> > > Izhikov
> > > > <
> > > > > > >
> > > > > > > [hidden email]> wrote:
> > > > > > > >
> > > > > > > >  Guys.
> > > > > > > >
> > > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > > > > > > Seems, we have ten or more "cluster shutdown" bugs with this
> > > > > subsystem
> > > > > > > > since it was introduced.
> > > > > > > >
> > > > > > > > Should we disable it by default in 2.7.5?
> > > > > > > >
> > > > > > > >
> > > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> > [hidden email]
> > > >:
> > > > > > > >
> > > > > > > > > Hi Roman,
> > > > > > > > >
> > > > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > > For workaround a custom FailureHandler can be configured that
> > > > will
> > > > > not
> > > > > > >
> > > > > > > stop
> > > > > > > > > a node in case of such exception is thrown.
> > > > > > > > >
> > > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > > <[hidden email]>:
> > > > > > > > >
> > > > > > > > > > Igniters,
> > > > > > > > > >
> > > > > > > > > > Restarting a node when injecting data and having it
> > expired,
> > > > > results
> > > > > > >
> > > > > > > at
> > > > > > > > > > GridDhtInvalidPartitionException which terminates nodes
> > with
> > > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole
> > cluster
> > > > > down.
> > > > > > >
> > > > > > > This
> > > > > > > > > is
> > > > > > > > > > really bad and I didn't find the way to save the cluster
> > from
> > > > > > > > >
> > > > > > > > > disappearing.
> > > > > > > > > > I created a JIRA issue
> > > > > > > > >
> > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > > with a test case. Any clues how to fix this inconsistency
> > > when
> > > > > > > > >
> > > > > > > > > rebalancing?
> > > > > > > > > >
> > > > > > > > > > -- Roman
> > > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >  Andrey Kuznetsov.
> > > >
> > >
> >
> --
> Best Regards, Vyacheslav D.
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

dmagda
Folks, thanks for sharing details and inputs. This is helpful. As long as I
spend a lot of time working with Ignite users, I'll look into this topic in
a couple of days to propose some changes. In the meantime, here is a fresh
one report on the user list:
http://apache-ignite-users.70518.x6.nabble.com/Triggering-Rebalancing-Programmatically-get-error-while-requesting-td27651.html


-
Denis


On Tue, Mar 26, 2019 at 9:04 AM Andrey Gura <[hidden email]> wrote:

> CleanupWorker termination can lead to the following effects:
>
> - Queries can retrieve data that have to expired so application will
> behave incorrectly.
> - Memory and/or disc can be overflowed because entries weren't expired.
> - Performance degradation is possible due to unmanageable data set grows.
>
> On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh <[hidden email]>
> wrote:
> >
> > Vyacheslav, if you are talking about this particular case I described, I
> believe it has no influence on PME. What could happen is having
> CleanupWorker thread dead (which is not good too).But I believe we are
> talking in a wider scope.
> >
> > -- Roman
> >
> >
> >     On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur <
> [hidden email]> wrote:
> >
> >  In general I agree with Andrey, the handler is very usefull itself. It
> > allows us to become know that ‘GridDhtInvalidPartitionException’ is not
> > processed properly in PME process by worker.
> >
> > Nikolay, look at the code, if Failure Handler hadles an exception - this
> > means that while-true loop in worker’s body has been interrupted with
> > unexpected exception and thread is completed his lifecycle.
> >
> > Without Failure Hanller, in the current case, the cluster will hang,
> > because of unable to participate in PME process.
> >
> > So, the problem is the incorrect handling of the exception in PME’s task
> > wich should be fixed.
> >
> >
> > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <[hidden email]>:
> >
> > > Nikolay,
> > >
> > > Feel free to suggest better error messages to indicate
> internal/critical
> > > failures. User actions in response to critical failures are rather
> limited:
> > > mail to user-list or maybe file an issue. As for repetitive warnings,
> it
> > > makes sense, but requires additional stuff to deliver such signals,
> mere
> > > spamming to log will not have an effect.
> > >
> > > Anyway, when experienced committers suggest to disable failure
> handling and
> > > hide existing issues, I feel as if they are pulling my leg.
> > >
> > > Best regards,
> > > Andrey Kuznetsov.
> > >
> > > вт, 26 марта 2019, 13:30 Nikolay Izhikov [hidden email]:
> > >
> > > > Andrey.
> > > >
> > > > >  the thread can be made non-critical, and we can restart it every
> time
> > > it
> > > > dies
> > > >
> > > > Why we can't restart critical thread?
> > > > What is the root difference between critical and non critical
> threads?
> > > >
> > > > > It's much simpler to catch and handle all exceptions in critical
> > > threads
> > > >
> > > > I don't agree with you.
> > > > We develop Ignite not because it simple!
> > > > We must spend extra time to made it robust and resilient to the
> failures.
> > > >
> > > > > Failure handling is a last-chance tool that reveals internal Ignite
> > > > errors
> > > > > 100% agree with you: overcome, but not hide.
> > > >
> > > > Logging stack trace with proper explanation is not hiding.
> > > > Killing nodes and whole cluster is not "handling".
> > > >
> > > > > As far as I see from user-list messages, our users are qualified
> enough
> > > > to provide necessary information from their cluster-wide logs.
> > > >
> > > > We shouldn't develop our product only for users who are able to read
> > > Ignite
> > > > sources to decrypt the fail reason behind "starvation in stripped
> pool"
> > > >
> > > > Some of my questions remain unanswered :) :
> > > >
> > > > 1. How user can know it's an Ignite bug? Where this bug should be
> > > reported?
> > > > 2. Do we log it somewhere?
> > > > 3. Do we warn user before shutdown several times?
> > > > 4. "starvation in stripped pool" I think it's not clear error
> message.
> > > > Let's make it more specific!
> > > > 5. Let's write to the user log - what he or she should do to prevent
> this
> > > > error in future?
> > > >
> > > >
> > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[hidden email]>:
> > > >
> > > > > Nikolay,
> > > > >
> > > > > >  Why we can't restart some thread?
> > > > > Technically, we can. It's just matter of design: the thread can be
> made
> > > > > non-critical, and we can restart it every time it dies. But such
> design
> > > > > looks poor to me. It's much simpler to catch and handle all
> exceptions
> > > in
> > > > > critical threads. Failure handling is a last-chance tool that
> reveals
> > > > > internal Ignite errors. It's not pleasant for us when users see
> these
> > > > > errors, but it's better than hiding.
> > > > >
> > > > > >  Actually, distributed systems are designed to overcome some
> bugs,
> > > > thread
> > > > > failure, node failure, for example, isn't it?
> > > > > 100% agree with you: overcome, but not hide.
> > > > >
> > > > > >  How user can know it's a bug? Where this bug should be reported?
> > > > > As far as I see from user-list messages, our users are qualified
> enough
> > > > to
> > > > > provide necessary information from their cluster-wide logs.
> > > > >
> > > > >
> > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[hidden email]
> >:
> > > > >
> > > > > > Andrey.
> > > > > >
> > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is
> no
> > > use
> > > > > to
> > > > > > wait for dead thread's magical resurrection.
> > > > > >
> > > > > > Why is it unrecoverable?
> > > > > > Why we can't restart some thread?
> > > > > > Is there some kind of nature limitation to not restart system
> thread?
> > > > > >
> > > > > > Actually, distributed systems are designed to overcome some bugs,
> > > > thread
> > > > > > failure, node failure, for example, isn't it?
> > > > > > > if under some circumstances node> stop leads to cascade cluster
> > > > crash,
> > > > > > then it's a bug
> > > > > >
> > > > > > How user can know it's a bug? Where this bug should be reported?
> > > > > > Do we log it somewhere?
> > > > > > Do we warn user before shutdown one or several times?
> > > > > >
> > > > > > This feature kills user experience literally now.
> > > > > >
> > > > > > If I would be a user of the product that just shutdown with poor
> log
> > > I
> > > > > > would throw this product away.
> > > > > > Do we want it for Ignite?
> > > > > >
> > > > > > From SO discussion I see following error message: ": >>> Possible
> > > > > > starvation in striped pool."
> > > > > > Are you sure this message are clear for Ignite user(not Ignite
> > > hacker)?
> > > > > > What user should do to prevent this error in future?
> > > > > >
> > > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled.
> I
> > > > don't
> > > > > > like
> > > > > > > this behavior, but it may be useful sometimes: "frozen" threads
> > > have
> > > > a
> > > > > > > chance to become active again after load decreases. As for
> > > > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use
> to
> > > > wait
> > > > > > for
> > > > > > > dead thread's magical resurrection. Then, if under some
> > > circumstances
> > > > > > node
> > > > > > > stop leads to cascade cluster crash, then it's a bug, and it
> should
> > > > be
> > > > > > > fixed. Once and for all. Instead of hiding the flaw we have in
> the
> > > > > > product.
> > > > > > >
> > > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> > > <[hidden email]
> > > > >:
> > > > > > >
> > > > > > > > + 1 for having the default settings revisited.
> > > > > > > > I understand Andrey's reasonings, but sometimes taking nodes
> down
> > > > is
> > > > > > too
> > > > > > > > radical (as in my case it was
> GridDhtInvalidPartitionException
> > > > which
> > > > > > could
> > > > > > > > be ignored for a while when rebalancing <- I might be wrong
> > > here).
> > > > > > > >
> > > > > > > > -- Roman
> > > > > > > >
> > > > > > > >
> > > > > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis
> Magda <
> > > > > > > > [hidden email]> wrote:
> > > > > > > >
> > > > > > > > p    Nikolay,
> > > > > > > > Thanks for kicking off this discussion. Surprisingly,
> planned to
> > > > > start
> > > > > > a
> > > > > > > > similar one today and incidentally came across this thread.
> > > > > > > > Agree that the failure handler should be off by default or
> the
> > > > > default
> > > > > > > > settings have to be revisited. That's true that people are
> > > > > complaining
> > > > > > of
> > > > > > > > nodes shutdowns even on moderate workloads. For instance,
> that's
> > > > the
> > > > > > most
> > > > > > > > recent feedback related to slow checkpointing:
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > > > >
> > > > > > > > At a minimum, let's consider the following:
> > > > > > > >    - A failure handler needs to provide hints on how to come
> > > around
> > > > > the
> > > > > > > > shutdown in the future. Take the checkpointing SO thread
> above.
> > > > It's
> > > > > > > > unclear from the logs how to prevent the same situation next
> time
> > > > > > (suggest
> > > > > > > > parameters for tuning, flash drives, etc).
> > > > > > > >    - Is there any protection for a full cluster restart? We
> need
> > > to
> > > > > > > > distinguish a slow cluster from the stuck one. A node removal
> > > > should
> > > > > > not
> > > > > > > > lead to a meltdown of the whole storage.
> > > > > > > >    - Should we enable the failure handler for things like
> > > > > transactions
> > > > > > or
> > > > > > > > PME and have it off for checkpointing and something else?
> Let's
> > > > have
> > > > > it
> > > > > > > > enabled for cases when we are 100% certain that a node
> shutdown
> > > is
> > > > > the
> > > > > > > > right thing and print out warnings with suggestions whenever
> > > we're
> > > > > not
> > > > > > > > confident that the removal is appropriate.
> > > > > > > > --Denis
> > > > > > > >
> > > > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <
> [hidden email]>
> > > > > wrote:
> > > > > > > >
> > > > > > > > Failure handlers were introduced in order to avoid cluster
> > > hanging
> > > > > and
> > > > > > > > they kill nodes instead.
> > > > > > > >
> > > > > > > > If critical worker was terminated by
> > > > GridDhtInvalidPartitionException
> > > > > > > > then your node is unable to work anymore.
> > > > > > > >
> > > > > > > > Unexpected cluster shutdown with reasons in logs that failure
> > > > > handlers
> > > > > > > > provide is better than hanging. So answer is NO. We mustn't
> > > disable
> > > > > > > > failure handlers.
> > > > > > > >
> > > > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > > > <[hidden email]
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > If it sticks to the behavior we had before introducing
> failure
> > > > > > handler,
> > > > > > > >
> > > > > > > > I think it's better to have disabled instead of killing the
> whole
> > > > > > cluster,
> > > > > > > > as in my case, and create a parent issue for those ten
> > > bugs.Pavel,
> > > > > > thanks
> > > > > > > > for the suggestion!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> > > > Izhikov
> > > > > <
> > > > > > > >
> > > > > > > > [hidden email]> wrote:
> > > > > > > > >
> > > > > > > > >  Guys.
> > > > > > > > >
> > > > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for
> all.
> > > > > > > > > Seems, we have ten or more "cluster shutdown" bugs with
> this
> > > > > > subsystem
> > > > > > > > > since it was introduced.
> > > > > > > > >
> > > > > > > > > Should we disable it by default in 2.7.5?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> > > [hidden email]
> > > > >:
> > > > > > > > >
> > > > > > > > > > Hi Roman,
> > > > > > > > > >
> > > > > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > > > For workaround a custom FailureHandler can be configured
> that
> > > > > will
> > > > > > not
> > > > > > > >
> > > > > > > > stop
> > > > > > > > > > a node in case of such exception is thrown.
> > > > > > > > > >
> > > > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > > > <[hidden email]>:
> > > > > > > > > >
> > > > > > > > > > > Igniters,
> > > > > > > > > > >
> > > > > > > > > > > Restarting a node when injecting data and having it
> > > expired,
> > > > > > results
> > > > > > > >
> > > > > > > > at
> > > > > > > > > > > GridDhtInvalidPartitionException which terminates nodes
> > > with
> > > > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole
> > > cluster
> > > > > > down.
> > > > > > > >
> > > > > > > > This
> > > > > > > > > > is
> > > > > > > > > > > really bad and I didn't find the way to save the
> cluster
> > > from
> > > > > > > > > >
> > > > > > > > > > disappearing.
> > > > > > > > > > > I created a JIRA issue
> > > > > > > > > >
> > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > > > with a test case. Any clues how to fix this
> inconsistency
> > > > when
> > > > > > > > > >
> > > > > > > > > > rebalancing?
> > > > > > > > > > >
> > > > > > > > > > > -- Roman
> > > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >  Andrey Kuznetsov.
> > > > >
> > > >
> > >
> > --
> > Best Regards, Vyacheslav D.
>
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

agura
What do you think about including patches [1] and [2] to Ignite 2.7.5?
It's all about default failure handler behavior in cases of
SYSTEM_WORKER_BLOCKED and SYSTEM_CRITICAL_OPERATION_TIMEOUT.

Andrey Kuznetsov, could you please check, does IGNITE-10003 depend on
other issue that isn't included into 2.7 release?

[1] https://issues.apache.org/jira/browse/IGNITE-10154
[2] https://issues.apache.org/jira/browse/IGNITE-10003

On Wed, Mar 27, 2019 at 8:11 AM Denis Magda <[hidden email]> wrote:

>
> Folks, thanks for sharing details and inputs. This is helpful. As long as I
> spend a lot of time working with Ignite users, I'll look into this topic in
> a couple of days to propose some changes. In the meantime, here is a fresh
> one report on the user list:
> http://apache-ignite-users.70518.x6.nabble.com/Triggering-Rebalancing-Programmatically-get-error-while-requesting-td27651.html
>
>
> -
> Denis
>
>
> On Tue, Mar 26, 2019 at 9:04 AM Andrey Gura <[hidden email]> wrote:
>
> > CleanupWorker termination can lead to the following effects:
> >
> > - Queries can retrieve data that have to expired so application will
> > behave incorrectly.
> > - Memory and/or disc can be overflowed because entries weren't expired.
> > - Performance degradation is possible due to unmanageable data set grows.
> >
> > On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh <[hidden email]>
> > wrote:
> > >
> > > Vyacheslav, if you are talking about this particular case I described, I
> > believe it has no influence on PME. What could happen is having
> > CleanupWorker thread dead (which is not good too).But I believe we are
> > talking in a wider scope.
> > >
> > > -- Roman
> > >
> > >
> > >     On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur <
> > [hidden email]> wrote:
> > >
> > >  In general I agree with Andrey, the handler is very usefull itself. It
> > > allows us to become know that ‘GridDhtInvalidPartitionException’ is not
> > > processed properly in PME process by worker.
> > >
> > > Nikolay, look at the code, if Failure Handler hadles an exception - this
> > > means that while-true loop in worker’s body has been interrupted with
> > > unexpected exception and thread is completed his lifecycle.
> > >
> > > Without Failure Hanller, in the current case, the cluster will hang,
> > > because of unable to participate in PME process.
> > >
> > > So, the problem is the incorrect handling of the exception in PME’s task
> > > wich should be fixed.
> > >
> > >
> > > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <[hidden email]>:
> > >
> > > > Nikolay,
> > > >
> > > > Feel free to suggest better error messages to indicate
> > internal/critical
> > > > failures. User actions in response to critical failures are rather
> > limited:
> > > > mail to user-list or maybe file an issue. As for repetitive warnings,
> > it
> > > > makes sense, but requires additional stuff to deliver such signals,
> > mere
> > > > spamming to log will not have an effect.
> > > >
> > > > Anyway, when experienced committers suggest to disable failure
> > handling and
> > > > hide existing issues, I feel as if they are pulling my leg.
> > > >
> > > > Best regards,
> > > > Andrey Kuznetsov.
> > > >
> > > > вт, 26 марта 2019, 13:30 Nikolay Izhikov [hidden email]:
> > > >
> > > > > Andrey.
> > > > >
> > > > > >  the thread can be made non-critical, and we can restart it every
> > time
> > > > it
> > > > > dies
> > > > >
> > > > > Why we can't restart critical thread?
> > > > > What is the root difference between critical and non critical
> > threads?
> > > > >
> > > > > > It's much simpler to catch and handle all exceptions in critical
> > > > threads
> > > > >
> > > > > I don't agree with you.
> > > > > We develop Ignite not because it simple!
> > > > > We must spend extra time to made it robust and resilient to the
> > failures.
> > > > >
> > > > > > Failure handling is a last-chance tool that reveals internal Ignite
> > > > > errors
> > > > > > 100% agree with you: overcome, but not hide.
> > > > >
> > > > > Logging stack trace with proper explanation is not hiding.
> > > > > Killing nodes and whole cluster is not "handling".
> > > > >
> > > > > > As far as I see from user-list messages, our users are qualified
> > enough
> > > > > to provide necessary information from their cluster-wide logs.
> > > > >
> > > > > We shouldn't develop our product only for users who are able to read
> > > > Ignite
> > > > > sources to decrypt the fail reason behind "starvation in stripped
> > pool"
> > > > >
> > > > > Some of my questions remain unanswered :) :
> > > > >
> > > > > 1. How user can know it's an Ignite bug? Where this bug should be
> > > > reported?
> > > > > 2. Do we log it somewhere?
> > > > > 3. Do we warn user before shutdown several times?
> > > > > 4. "starvation in stripped pool" I think it's not clear error
> > message.
> > > > > Let's make it more specific!
> > > > > 5. Let's write to the user log - what he or she should do to prevent
> > this
> > > > > error in future?
> > > > >
> > > > >
> > > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[hidden email]>:
> > > > >
> > > > > > Nikolay,
> > > > > >
> > > > > > >  Why we can't restart some thread?
> > > > > > Technically, we can. It's just matter of design: the thread can be
> > made
> > > > > > non-critical, and we can restart it every time it dies. But such
> > design
> > > > > > looks poor to me. It's much simpler to catch and handle all
> > exceptions
> > > > in
> > > > > > critical threads. Failure handling is a last-chance tool that
> > reveals
> > > > > > internal Ignite errors. It's not pleasant for us when users see
> > these
> > > > > > errors, but it's better than hiding.
> > > > > >
> > > > > > >  Actually, distributed systems are designed to overcome some
> > bugs,
> > > > > thread
> > > > > > failure, node failure, for example, isn't it?
> > > > > > 100% agree with you: overcome, but not hide.
> > > > > >
> > > > > > >  How user can know it's a bug? Where this bug should be reported?
> > > > > > As far as I see from user-list messages, our users are qualified
> > enough
> > > > > to
> > > > > > provide necessary information from their cluster-wide logs.
> > > > > >
> > > > > >
> > > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <[hidden email]
> > >:
> > > > > >
> > > > > > > Andrey.
> > > > > > >
> > > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is
> > no
> > > > use
> > > > > > to
> > > > > > > wait for dead thread's magical resurrection.
> > > > > > >
> > > > > > > Why is it unrecoverable?
> > > > > > > Why we can't restart some thread?
> > > > > > > Is there some kind of nature limitation to not restart system
> > thread?
> > > > > > >
> > > > > > > Actually, distributed systems are designed to overcome some bugs,
> > > > > thread
> > > > > > > failure, node failure, for example, isn't it?
> > > > > > > > if under some circumstances node> stop leads to cascade cluster
> > > > > crash,
> > > > > > > then it's a bug
> > > > > > >
> > > > > > > How user can know it's a bug? Where this bug should be reported?
> > > > > > > Do we log it somewhere?
> > > > > > > Do we warn user before shutdown one or several times?
> > > > > > >
> > > > > > > This feature kills user experience literally now.
> > > > > > >
> > > > > > > If I would be a user of the product that just shutdown with poor
> > log
> > > > I
> > > > > > > would throw this product away.
> > > > > > > Do we want it for Ignite?
> > > > > > >
> > > > > > > From SO discussion I see following error message: ": >>> Possible
> > > > > > > starvation in striped pool."
> > > > > > > Are you sure this message are clear for Ignite user(not Ignite
> > > > hacker)?
> > > > > > > What user should do to prevent this error in future?
> > > > > > >
> > > > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled.
> > I
> > > > > don't
> > > > > > > like
> > > > > > > > this behavior, but it may be useful sometimes: "frozen" threads
> > > > have
> > > > > a
> > > > > > > > chance to become active again after load decreases. As for
> > > > > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use
> > to
> > > > > wait
> > > > > > > for
> > > > > > > > dead thread's magical resurrection. Then, if under some
> > > > circumstances
> > > > > > > node
> > > > > > > > stop leads to cascade cluster crash, then it's a bug, and it
> > should
> > > > > be
> > > > > > > > fixed. Once and for all. Instead of hiding the flaw we have in
> > the
> > > > > > > product.
> > > > > > > >
> > > > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> > > > <[hidden email]
> > > > > >:
> > > > > > > >
> > > > > > > > > + 1 for having the default settings revisited.
> > > > > > > > > I understand Andrey's reasonings, but sometimes taking nodes
> > down
> > > > > is
> > > > > > > too
> > > > > > > > > radical (as in my case it was
> > GridDhtInvalidPartitionException
> > > > > which
> > > > > > > could
> > > > > > > > > be ignored for a while when rebalancing <- I might be wrong
> > > > here).
> > > > > > > > >
> > > > > > > > > -- Roman
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis
> > Magda <
> > > > > > > > > [hidden email]> wrote:
> > > > > > > > >
> > > > > > > > > p    Nikolay,
> > > > > > > > > Thanks for kicking off this discussion. Surprisingly,
> > planned to
> > > > > > start
> > > > > > > a
> > > > > > > > > similar one today and incidentally came across this thread.
> > > > > > > > > Agree that the failure handler should be off by default or
> > the
> > > > > > default
> > > > > > > > > settings have to be revisited. That's true that people are
> > > > > > complaining
> > > > > > > of
> > > > > > > > > nodes shutdowns even on moderate workloads. For instance,
> > that's
> > > > > the
> > > > > > > most
> > > > > > > > > recent feedback related to slow checkpointing:
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > > > > >
> > > > > > > > > At a minimum, let's consider the following:
> > > > > > > > >    - A failure handler needs to provide hints on how to come
> > > > around
> > > > > > the
> > > > > > > > > shutdown in the future. Take the checkpointing SO thread
> > above.
> > > > > It's
> > > > > > > > > unclear from the logs how to prevent the same situation next
> > time
> > > > > > > (suggest
> > > > > > > > > parameters for tuning, flash drives, etc).
> > > > > > > > >    - Is there any protection for a full cluster restart? We
> > need
> > > > to
> > > > > > > > > distinguish a slow cluster from the stuck one. A node removal
> > > > > should
> > > > > > > not
> > > > > > > > > lead to a meltdown of the whole storage.
> > > > > > > > >    - Should we enable the failure handler for things like
> > > > > > transactions
> > > > > > > or
> > > > > > > > > PME and have it off for checkpointing and something else?
> > Let's
> > > > > have
> > > > > > it
> > > > > > > > > enabled for cases when we are 100% certain that a node
> > shutdown
> > > > is
> > > > > > the
> > > > > > > > > right thing and print out warnings with suggestions whenever
> > > > we're
> > > > > > not
> > > > > > > > > confident that the removal is appropriate.
> > > > > > > > > --Denis
> > > > > > > > >
> > > > > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <
> > [hidden email]>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Failure handlers were introduced in order to avoid cluster
> > > > hanging
> > > > > > and
> > > > > > > > > they kill nodes instead.
> > > > > > > > >
> > > > > > > > > If critical worker was terminated by
> > > > > GridDhtInvalidPartitionException
> > > > > > > > > then your node is unable to work anymore.
> > > > > > > > >
> > > > > > > > > Unexpected cluster shutdown with reasons in logs that failure
> > > > > > handlers
> > > > > > > > > provide is better than hanging. So answer is NO. We mustn't
> > > > disable
> > > > > > > > > failure handlers.
> > > > > > > > >
> > > > > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > > > > <[hidden email]
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > If it sticks to the behavior we had before introducing
> > failure
> > > > > > > handler,
> > > > > > > > >
> > > > > > > > > I think it's better to have disabled instead of killing the
> > whole
> > > > > > > cluster,
> > > > > > > > > as in my case, and create a parent issue for those ten
> > > > bugs.Pavel,
> > > > > > > thanks
> > > > > > > > > for the suggestion!
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay
> > > > > Izhikov
> > > > > > <
> > > > > > > > >
> > > > > > > > > [hidden email]> wrote:
> > > > > > > > > >
> > > > > > > > > >  Guys.
> > > > > > > > > >
> > > > > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for
> > all.
> > > > > > > > > > Seems, we have ten or more "cluster shutdown" bugs with
> > this
> > > > > > > subsystem
> > > > > > > > > > since it was introduced.
> > > > > > > > > >
> > > > > > > > > > Should we disable it by default in 2.7.5?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> > > > [hidden email]
> > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Roman,
> > > > > > > > > > >
> > > > > > > > > > > I think this InvalidPartition case can be simply handled
> > > > > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > > > > For workaround a custom FailureHandler can be configured
> > that
> > > > > > will
> > > > > > > not
> > > > > > > > >
> > > > > > > > > stop
> > > > > > > > > > > a node in case of such exception is thrown.
> > > > > > > > > > >
> > > > > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > > > > <[hidden email]>:
> > > > > > > > > > >
> > > > > > > > > > > > Igniters,
> > > > > > > > > > > >
> > > > > > > > > > > > Restarting a node when injecting data and having it
> > > > expired,
> > > > > > > results
> > > > > > > > >
> > > > > > > > > at
> > > > > > > > > > > > GridDhtInvalidPartitionException which terminates nodes
> > > > with
> > > > > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole
> > > > cluster
> > > > > > > down.
> > > > > > > > >
> > > > > > > > > This
> > > > > > > > > > > is
> > > > > > > > > > > > really bad and I didn't find the way to save the
> > cluster
> > > > from
> > > > > > > > > > >
> > > > > > > > > > > disappearing.
> > > > > > > > > > > > I created a JIRA issue
> > > > > > > > > > >
> > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > > > > with a test case. Any clues how to fix this
> > inconsistency
> > > > > when
> > > > > > > > > > >
> > > > > > > > > > > rebalancing?
> > > > > > > > > > > >
> > > > > > > > > > > > -- Roman
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > >  Andrey Kuznetsov.
> > > > > >
> > > > >
> > > >
> > > --
> > > Best Regards, Vyacheslav D.
> >
Reply | Threaded
Open this post in threaded view
|

Re: GridDhtInvalidPartitionException takes the cluster down

Andrey Kuznetsov
I see no other dependencies for IGNITE-10003.

Best regards,
Andrey Kuznetsov.

ср, 27 марта 2019, 18:25 Andrey Gura [hidden email]:

> What do you think about including patches [1] and [2] to Ignite 2.7.5?
> It's all about default failure handler behavior in cases of
> SYSTEM_WORKER_BLOCKED and SYSTEM_CRITICAL_OPERATION_TIMEOUT.
>
> Andrey Kuznetsov, could you please check, does IGNITE-10003 depend on
> other issue that isn't included into 2.7 release?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-10154
> [2] https://issues.apache.org/jira/browse/IGNITE-10003
>
> On Wed, Mar 27, 2019 at 8:11 AM Denis Magda <[hidden email]> wrote:
> >
> > Folks, thanks for sharing details and inputs. This is helpful. As long
> as I
> > spend a lot of time working with Ignite users, I'll look into this topic
> in
> > a couple of days to propose some changes. In the meantime, here is a
> fresh
> > one report on the user list:
> >
> http://apache-ignite-users.70518.x6.nabble.com/Triggering-Rebalancing-Programmatically-get-error-while-requesting-td27651.html
> >
> >
> > -
> > Denis
> >
> >
> > On Tue, Mar 26, 2019 at 9:04 AM Andrey Gura <[hidden email]> wrote:
> >
> > > CleanupWorker termination can lead to the following effects:
> > >
> > > - Queries can retrieve data that have to expired so application will
> > > behave incorrectly.
> > > - Memory and/or disc can be overflowed because entries weren't expired.
> > > - Performance degradation is possible due to unmanageable data set
> grows.
> > >
> > > On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh <[hidden email]
> >
> > > wrote:
> > > >
> > > > Vyacheslav, if you are talking about this particular case I
> described, I
> > > believe it has no influence on PME. What could happen is having
> > > CleanupWorker thread dead (which is not good too).But I believe we are
> > > talking in a wider scope.
> > > >
> > > > -- Roman
> > > >
> > > >
> > > >     On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav
> Daradur <
> > > [hidden email]> wrote:
> > > >
> > > >  In general I agree with Andrey, the handler is very usefull itself.
> It
> > > > allows us to become know that ‘GridDhtInvalidPartitionException’ is
> not
> > > > processed properly in PME process by worker.
> > > >
> > > > Nikolay, look at the code, if Failure Handler hadles an exception -
> this
> > > > means that while-true loop in worker’s body has been interrupted with
> > > > unexpected exception and thread is completed his lifecycle.
> > > >
> > > > Without Failure Hanller, in the current case, the cluster will hang,
> > > > because of unable to participate in PME process.
> > > >
> > > > So, the problem is the incorrect handling of the exception in PME’s
> task
> > > > wich should be fixed.
> > > >
> > > >
> > > > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov <[hidden email]>:
> > > >
> > > > > Nikolay,
> > > > >
> > > > > Feel free to suggest better error messages to indicate
> > > internal/critical
> > > > > failures. User actions in response to critical failures are rather
> > > limited:
> > > > > mail to user-list or maybe file an issue. As for repetitive
> warnings,
> > > it
> > > > > makes sense, but requires additional stuff to deliver such signals,
> > > mere
> > > > > spamming to log will not have an effect.
> > > > >
> > > > > Anyway, when experienced committers suggest to disable failure
> > > handling and
> > > > > hide existing issues, I feel as if they are pulling my leg.
> > > > >
> > > > > Best regards,
> > > > > Andrey Kuznetsov.
> > > > >
> > > > > вт, 26 марта 2019, 13:30 Nikolay Izhikov [hidden email]:
> > > > >
> > > > > > Andrey.
> > > > > >
> > > > > > >  the thread can be made non-critical, and we can restart it
> every
> > > time
> > > > > it
> > > > > > dies
> > > > > >
> > > > > > Why we can't restart critical thread?
> > > > > > What is the root difference between critical and non critical
> > > threads?
> > > > > >
> > > > > > > It's much simpler to catch and handle all exceptions in
> critical
> > > > > threads
> > > > > >
> > > > > > I don't agree with you.
> > > > > > We develop Ignite not because it simple!
> > > > > > We must spend extra time to made it robust and resilient to the
> > > failures.
> > > > > >
> > > > > > > Failure handling is a last-chance tool that reveals internal
> Ignite
> > > > > > errors
> > > > > > > 100% agree with you: overcome, but not hide.
> > > > > >
> > > > > > Logging stack trace with proper explanation is not hiding.
> > > > > > Killing nodes and whole cluster is not "handling".
> > > > > >
> > > > > > > As far as I see from user-list messages, our users are
> qualified
> > > enough
> > > > > > to provide necessary information from their cluster-wide logs.
> > > > > >
> > > > > > We shouldn't develop our product only for users who are able to
> read
> > > > > Ignite
> > > > > > sources to decrypt the fail reason behind "starvation in stripped
> > > pool"
> > > > > >
> > > > > > Some of my questions remain unanswered :) :
> > > > > >
> > > > > > 1. How user can know it's an Ignite bug? Where this bug should be
> > > > > reported?
> > > > > > 2. Do we log it somewhere?
> > > > > > 3. Do we warn user before shutdown several times?
> > > > > > 4. "starvation in stripped pool" I think it's not clear error
> > > message.
> > > > > > Let's make it more specific!
> > > > > > 5. Let's write to the user log - what he or she should do to
> prevent
> > > this
> > > > > > error in future?
> > > > > >
> > > > > >
> > > > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov <[hidden email]
> >:
> > > > > >
> > > > > > > Nikolay,
> > > > > > >
> > > > > > > >  Why we can't restart some thread?
> > > > > > > Technically, we can. It's just matter of design: the thread
> can be
> > > made
> > > > > > > non-critical, and we can restart it every time it dies. But
> such
> > > design
> > > > > > > looks poor to me. It's much simpler to catch and handle all
> > > exceptions
> > > > > in
> > > > > > > critical threads. Failure handling is a last-chance tool that
> > > reveals
> > > > > > > internal Ignite errors. It's not pleasant for us when users see
> > > these
> > > > > > > errors, but it's better than hiding.
> > > > > > >
> > > > > > > >  Actually, distributed systems are designed to overcome some
> > > bugs,
> > > > > > thread
> > > > > > > failure, node failure, for example, isn't it?
> > > > > > > 100% agree with you: overcome, but not hide.
> > > > > > >
> > > > > > > >  How user can know it's a bug? Where this bug should be
> reported?
> > > > > > > As far as I see from user-list messages, our users are
> qualified
> > > enough
> > > > > > to
> > > > > > > provide necessary information from their cluster-wide logs.
> > > > > > >
> > > > > > >
> > > > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov <
> [hidden email]
> > > >:
> > > > > > >
> > > > > > > > Andrey.
> > > > > > > >
> > > > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable,
> there is
> > > no
> > > > > use
> > > > > > > to
> > > > > > > > wait for dead thread's magical resurrection.
> > > > > > > >
> > > > > > > > Why is it unrecoverable?
> > > > > > > > Why we can't restart some thread?
> > > > > > > > Is there some kind of nature limitation to not restart system
> > > thread?
> > > > > > > >
> > > > > > > > Actually, distributed systems are designed to overcome some
> bugs,
> > > > > > thread
> > > > > > > > failure, node failure, for example, isn't it?
> > > > > > > > > if under some circumstances node> stop leads to cascade
> cluster
> > > > > > crash,
> > > > > > > > then it's a bug
> > > > > > > >
> > > > > > > > How user can know it's a bug? Where this bug should be
> reported?
> > > > > > > > Do we log it somewhere?
> > > > > > > > Do we warn user before shutdown one or several times?
> > > > > > > >
> > > > > > > > This feature kills user experience literally now.
> > > > > > > >
> > > > > > > > If I would be a user of the product that just shutdown with
> poor
> > > log
> > > > > I
> > > > > > > > would throw this product away.
> > > > > > > > Do we want it for Ignite?
> > > > > > > >
> > > > > > > > From SO discussion I see following error message: ": >>>
> Possible
> > > > > > > > starvation in striped pool."
> > > > > > > > Are you sure this message are clear for Ignite user(not
> Ignite
> > > > > hacker)?
> > > > > > > > What user should do to prevent this error in future?
> > > > > > > >
> > > > > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > > > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not
> handled.
> > > I
> > > > > > don't
> > > > > > > > like
> > > > > > > > > this behavior, but it may be useful sometimes: "frozen"
> threads
> > > > > have
> > > > > > a
> > > > > > > > > chance to become active again after load decreases. As for
> > > > > > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no
> use
> > > to
> > > > > > wait
> > > > > > > > for
> > > > > > > > > dead thread's magical resurrection. Then, if under some
> > > > > circumstances
> > > > > > > > node
> > > > > > > > > stop leads to cascade cluster crash, then it's a bug, and
> it
> > > should
> > > > > > be
> > > > > > > > > fixed. Once and for all. Instead of hiding the flaw we
> have in
> > > the
> > > > > > > > product.
> > > > > > > > >
> > > > > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh
> > > > > <[hidden email]
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > + 1 for having the default settings revisited.
> > > > > > > > > > I understand Andrey's reasonings, but sometimes taking
> nodes
> > > down
> > > > > > is
> > > > > > > > too
> > > > > > > > > > radical (as in my case it was
> > > GridDhtInvalidPartitionException
> > > > > > which
> > > > > > > > could
> > > > > > > > > > be ignored for a while when rebalancing <- I might be
> wrong
> > > > > here).
> > > > > > > > > >
> > > > > > > > > > -- Roman
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis
> > > Magda <
> > > > > > > > > > [hidden email]> wrote:
> > > > > > > > > >
> > > > > > > > > > p    Nikolay,
> > > > > > > > > > Thanks for kicking off this discussion. Surprisingly,
> > > planned to
> > > > > > > start
> > > > > > > > a
> > > > > > > > > > similar one today and incidentally came across this
> thread.
> > > > > > > > > > Agree that the failure handler should be off by default
> or
> > > the
> > > > > > > default
> > > > > > > > > > settings have to be revisited. That's true that people
> are
> > > > > > > complaining
> > > > > > > > of
> > > > > > > > > > nodes shutdowns even on moderate workloads. For instance,
> > > that's
> > > > > > the
> > > > > > > > most
> > > > > > > > > > recent feedback related to slow checkpointing:
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > > > > > > > > >
> > > > > > > > > > At a minimum, let's consider the following:
> > > > > > > > > >    - A failure handler needs to provide hints on how to
> come
> > > > > around
> > > > > > > the
> > > > > > > > > > shutdown in the future. Take the checkpointing SO thread
> > > above.
> > > > > > It's
> > > > > > > > > > unclear from the logs how to prevent the same situation
> next
> > > time
> > > > > > > > (suggest
> > > > > > > > > > parameters for tuning, flash drives, etc).
> > > > > > > > > >    - Is there any protection for a full cluster restart?
> We
> > > need
> > > > > to
> > > > > > > > > > distinguish a slow cluster from the stuck one. A node
> removal
> > > > > > should
> > > > > > > > not
> > > > > > > > > > lead to a meltdown of the whole storage.
> > > > > > > > > >    - Should we enable the failure handler for things like
> > > > > > > transactions
> > > > > > > > or
> > > > > > > > > > PME and have it off for checkpointing and something else?
> > > Let's
> > > > > > have
> > > > > > > it
> > > > > > > > > > enabled for cases when we are 100% certain that a node
> > > shutdown
> > > > > is
> > > > > > > the
> > > > > > > > > > right thing and print out warnings with suggestions
> whenever
> > > > > we're
> > > > > > > not
> > > > > > > > > > confident that the removal is appropriate.
> > > > > > > > > > --Denis
> > > > > > > > > >
> > > > > > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Failure handlers were introduced in order to avoid
> cluster
> > > > > hanging
> > > > > > > and
> > > > > > > > > > they kill nodes instead.
> > > > > > > > > >
> > > > > > > > > > If critical worker was terminated by
> > > > > > GridDhtInvalidPartitionException
> > > > > > > > > > then your node is unable to work anymore.
> > > > > > > > > >
> > > > > > > > > > Unexpected cluster shutdown with reasons in logs that
> failure
> > > > > > > handlers
> > > > > > > > > > provide is better than hanging. So answer is NO. We
> mustn't
> > > > > disable
> > > > > > > > > > failure handlers.
> > > > > > > > > >
> > > > > > > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh
> > > > > > > <[hidden email]
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > If it sticks to the behavior we had before introducing
> > > failure
> > > > > > > > handler,
> > > > > > > > > >
> > > > > > > > > > I think it's better to have disabled instead of killing
> the
> > > whole
> > > > > > > > cluster,
> > > > > > > > > > as in my case, and create a parent issue for those ten
> > > > > bugs.Pavel,
> > > > > > > > thanks
> > > > > > > > > > for the suggestion!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    On Monday, March 25, 2019, 7:07:20 p.m. GMT+9,
> Nikolay
> > > > > > Izhikov
> > > > > > > <
> > > > > > > > > >
> > > > > > > > > > [hidden email]> wrote:
> > > > > > > > > > >
> > > > > > > > > > >  Guys.
> > > > > > > > > > >
> > > > > > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and
> for
> > > all.
> > > > > > > > > > > Seems, we have ten or more "cluster shutdown" bugs with
> > > this
> > > > > > > > subsystem
> > > > > > > > > > > since it was introduced.
> > > > > > > > > > >
> > > > > > > > > > > Should we disable it by default in 2.7.5?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko <
> > > > > [hidden email]
> > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Roman,
> > > > > > > > > > > >
> > > > > > > > > > > > I think this InvalidPartition case can be simply
> handled
> > > > > > > > > > > > in GridCacheTtlManager.expire method.
> > > > > > > > > > > > For workaround a custom FailureHandler can be
> configured
> > > that
> > > > > > > will
> > > > > > > > not
> > > > > > > > > >
> > > > > > > > > > stop
> > > > > > > > > > > > a node in case of such exception is thrown.
> > > > > > > > > > > >
> > > > > > > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh
> > > > > > > > <[hidden email]>:
> > > > > > > > > > > >
> > > > > > > > > > > > > Igniters,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Restarting a node when injecting data and having it
> > > > > expired,
> > > > > > > > results
> > > > > > > > > >
> > > > > > > > > > at
> > > > > > > > > > > > > GridDhtInvalidPartitionException which terminates
> nodes
> > > > > with
> > > > > > > > > > > > > SYSTEM_WORKER_TERMINATION one by one taking the
> whole
> > > > > cluster
> > > > > > > > down.
> > > > > > > > > >
> > > > > > > > > > This
> > > > > > > > > > > > is
> > > > > > > > > > > > > really bad and I didn't find the way to save the
> > > cluster
> > > > > from
> > > > > > > > > > > >
> > > > > > > > > > > > disappearing.
> > > > > > > > > > > > > I created a JIRA issue
> > > > > > > > > > > >
> > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > > > > > > > > > > with a test case. Any clues how to fix this
> > > inconsistency
> > > > > > when
> > > > > > > > > > > >
> > > > > > > > > > > > rebalancing?
> > > > > > > > > > > > >
> > > > > > > > > > > > > -- Roman
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >  Andrey Kuznetsov.
> > > > > > >
> > > > > >
> > > > >
> > > > --
> > > > Best Regards, Vyacheslav D.
> > >
>
12