Apache Ignite Developers - Legacy Mail Archive

Automatic Handling of Long Stop-the-World Pauses

Classic

List

Threaded

6 messages Options

dmagda

Automatic Handling of Long Stop-the-World Pauses

Igniters,

It's a pleasure to see how our project is evolving in a directing of being
a self-healing solution:

- Ignite can already handle critical failures such as OOM, File I/O
issues, etc. [1]
- There is an endeavor to fix cluster lock-ins due to partition map
exchange issues. [2]

There is one more notorious problem that might affect Ignite deployments
which is long stop-the-world GC pauses.

I know we did a little progress in this direction [3] by providing
particular metrics that help to monitor the pauses. Why don't we keep the
pace and teach Ignite to help itself if it sees there is a node that brings
down overall cluster performance due to an STP?

I would create policies similar to the critical failures policies [4] or
just add a long STP to the list of critical failures and reuse existing
functionality.

Thoughts? Anyone who'd like to implement the feature?

[1] https://apacheignite.readme.io/docs/critical-failures-handling
[2]
http://apache-ignite-developers.2346864.n4.nabble.com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html
[3] https://issues.apache.org/jira/browse/IGNITE-6171
[4]
https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling

dmagda

Re: Automatic Handling of Long Stop-the-World Pauses

Igniters,

Pulling this discussion up. Any thoughts?

--
Denis

On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[hidden email]> wrote:

> Igniters,
>
> It's a pleasure to see how our project is evolving in a directing of being
> a self-healing solution:
>
> - Ignite can already handle critical failures such as OOM, File I/O
> issues, etc. [1]
> - There is an endeavor to fix cluster lock-ins due to partition map
> exchange issues. [2]
>
> There is one more notorious problem that might affect Ignite deployments
> which is long stop-the-world GC pauses.
>
> I know we did a little progress in this direction [3] by providing
> particular metrics that help to monitor the pauses. Why don't we keep the
> pace and teach Ignite to help itself if it sees there is a node that brings
> down overall cluster performance due to an STP?
>
> I would create policies similar to the critical failures policies [4] or
> just add a long STP to the list of critical failures and reuse existing
> functionality.
>
> Thoughts? Anyone who'd like to implement the feature?
>
> [1] https://apacheignite.readme.io/docs/critical-failures-handling
> [2]
> http://apache-ignite-developers.2346864.n4.nabble.com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html
> [3] https://issues.apache.org/jira/browse/IGNITE-6171
> [4]
> https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling
>

Pavel Kovalenko

Re: Automatic Handling of Long Stop-the-World Pauses

Denis,

I think, JVM can't easily help to itself if it's in SW pause. Most
solutions what I saw about handling such situations are checking heartbeats
on other nodes or run in parallel supervisor process which can detect that
JVM with Ignite in SW.

2018-07-02 20:54 GMT+03:00 Denis Magda <[hidden email]>:

> Igniters,
>
> Pulling this discussion up. Any thoughts?
>
> --
> Denis
>
> On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[hidden email]> wrote:
>
> > Igniters,
> >
> > It's a pleasure to see how our project is evolving in a directing of
> being
> > a self-healing solution:
> >
> > - Ignite can already handle critical failures such as OOM, File I/O
> > issues, etc. [1]
> > - There is an endeavor to fix cluster lock-ins due to partition map
> > exchange issues. [2]
> >
> > There is one more notorious problem that might affect Ignite deployments
> > which is long stop-the-world GC pauses.
> >
> > I know we did a little progress in this direction [3] by providing
> > particular metrics that help to monitor the pauses. Why don't we keep the
> > pace and teach Ignite to help itself if it sees there is a node that
> brings
> > down overall cluster performance due to an STP?
> >
> > I would create policies similar to the critical failures policies [4] or
> > just add a long STP to the list of critical failures and reuse existing
> > functionality.
> >
> > Thoughts? Anyone who'd like to implement the feature?
> >
> > [1] https://apacheignite.readme.io/docs/critical-failures-handling
> > [2]
> > http://apache-ignite-developers.2346864.n4.nabble.
> com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html
> > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > [4]
> > https://apacheignite.readme.io/docs/critical-failures-
> handling#section-failure-handling
> >
>

dmagda

Re: Automatic Handling of Long Stop-the-World Pauses

Pavel,

We already can monitor the state of individual nodes and show it through
metrics. Now I'd like to see how to go further and automate a decision on
if a node should be kicked off from the cluster or not.

--
Denis

On Mon, Jul 2, 2018 at 12:28 PM Pavel Kovalenko <[hidden email]> wrote:

> Denis,
>
> I think, JVM can't easily help to itself if it's in SW pause. Most
> solutions what I saw about handling such situations are checking heartbeats
> on other nodes or run in parallel supervisor process which can detect that
> JVM with Ignite in SW.
>
> 2018-07-02 20:54 GMT+03:00 Denis Magda <[hidden email]>:
>
> > Igniters,
> >
> > Pulling this discussion up. Any thoughts?
> >
> > --
> > Denis
> >
> > On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[hidden email]> wrote:
> >
> > > Igniters,
> > >
> > > It's a pleasure to see how our project is evolving in a directing of
> > being
> > > a self-healing solution:
> > >
> > > - Ignite can already handle critical failures such as OOM, File I/O
> > > issues, etc. [1]
> > > - There is an endeavor to fix cluster lock-ins due to partition map
> > > exchange issues. [2]
> > >
> > > There is one more notorious problem that might affect Ignite
> deployments
> > > which is long stop-the-world GC pauses.
> > >
> > > I know we did a little progress in this direction [3] by providing
> > > particular metrics that help to monitor the pauses. Why don't we keep
> the
> > > pace and teach Ignite to help itself if it sees there is a node that
> > brings
> > > down overall cluster performance due to an STP?
> > >
> > > I would create policies similar to the critical failures policies [4]
> or
> > > just add a long STP to the list of critical failures and reuse existing
> > > functionality.
> > >
> > > Thoughts? Anyone who'd like to implement the feature?
> > >
> > > [1] https://apacheignite.readme.io/docs/critical-failures-handling
> > > [2]
> > > http://apache-ignite-developers.2346864.n4.nabble.
> > com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html
> > > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > > [4]
> > > https://apacheignite.readme.io/docs/critical-failures-
> > handling#section-failure-handling
> > >
> >
>

agura

Re: Automatic Handling of Long Stop-the-World Pauses

Denis,

we have LongJVMPauseDetector. But it is Java thread that will be in
safe-point during stop-the-world pause and therefore will not make any
progress. So only external process can detect SW pause.
On Mon, Jul 2, 2018 at 10:34 PM Denis Magda <[hidden email]> wrote:

>
> Pavel,
>
> We already can monitor the state of individual nodes and show it through
> metrics. Now I'd like to see how to go further and automate a decision on
> if a node should be kicked off from the cluster or not.
>
> --
> Denis
>
> On Mon, Jul 2, 2018 at 12:28 PM Pavel Kovalenko <[hidden email]> wrote:
>
> > Denis,
> >
> > I think, JVM can't easily help to itself if it's in SW pause. Most
> > solutions what I saw about handling such situations are checking heartbeats
> > on other nodes or run in parallel supervisor process which can detect that
> > JVM with Ignite in SW.
> >
> > 2018-07-02 20:54 GMT+03:00 Denis Magda <[hidden email]>:
> >
> > > Igniters,
> > >
> > > Pulling this discussion up. Any thoughts?
> > >
> > > --
> > > Denis
> > >
> > > On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[hidden email]> wrote:
> > >
> > > > Igniters,
> > > >
> > > > It's a pleasure to see how our project is evolving in a directing of
> > > being
> > > > a self-healing solution:
> > > >
> > > > - Ignite can already handle critical failures such as OOM, File I/O
> > > > issues, etc. [1]
> > > > - There is an endeavor to fix cluster lock-ins due to partition map
> > > > exchange issues. [2]
> > > >
> > > > There is one more notorious problem that might affect Ignite
> > deployments
> > > > which is long stop-the-world GC pauses.
> > > >
> > > > I know we did a little progress in this direction [3] by providing
> > > > particular metrics that help to monitor the pauses. Why don't we keep
> > the
> > > > pace and teach Ignite to help itself if it sees there is a node that
> > > brings
> > > > down overall cluster performance due to an STP?
> > > >
> > > > I would create policies similar to the critical failures policies [4]
> > or
> > > > just add a long STP to the list of critical failures and reuse existing
> > > > functionality.
> > > >
> > > > Thoughts? Anyone who'd like to implement the feature?
> > > >
> > > > [1] https://apacheignite.readme.io/docs/critical-failures-handling
> > > > [2]
> > > > http://apache-ignite-developers.2346864.n4.nabble.
> > > com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html
> > > > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > > > [4]
> > > > https://apacheignite.readme.io/docs/critical-failures-
> > > handling#section-failure-handling
> > > >
> > >
> >

dmagda

Re: Automatic Handling of Long Stop-the-World Pauses

I see, then we need to come up with an external process-based solution for
the sake of the new ticket.

--
Denis

On Tue, Jul 10, 2018 at 6:01 AM Andrey Gura <[hidden email]> wrote:

> Denis,
>
> we have LongJVMPauseDetector. But it is Java thread that will be in
> safe-point during stop-the-world pause and therefore will not make any
> progress. So only external process can detect SW pause.
> On Mon, Jul 2, 2018 at 10:34 PM Denis Magda <[hidden email]> wrote:
> >
> > Pavel,
> >
> > We already can monitor the state of individual nodes and show it through
> > metrics. Now I'd like to see how to go further and automate a decision on
> > if a node should be kicked off from the cluster or not.
> >
> > --
> > Denis
> >
> > On Mon, Jul 2, 2018 at 12:28 PM Pavel Kovalenko <[hidden email]>
> wrote:
> >
> > > Denis,
> > >
> > > I think, JVM can't easily help to itself if it's in SW pause. Most
> > > solutions what I saw about handling such situations are checking
> heartbeats
> > > on other nodes or run in parallel supervisor process which can detect
> that
> > > JVM with Ignite in SW.
> > >
> > > 2018-07-02 20:54 GMT+03:00 Denis Magda <[hidden email]>:
> > >
> > > > Igniters,
> > > >
> > > > Pulling this discussion up. Any thoughts?
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[hidden email]>
> wrote:
> > > >
> > > > > Igniters,
> > > > >
> > > > > It's a pleasure to see how our project is evolving in a directing
> of
> > > > being
> > > > > a self-healing solution:
> > > > >
> > > > > - Ignite can already handle critical failures such as OOM, File
> I/O
> > > > > issues, etc. [1]
> > > > > - There is an endeavor to fix cluster lock-ins due to partition
> map
> > > > > exchange issues. [2]
> > > > >
> > > > > There is one more notorious problem that might affect Ignite
> > > deployments
> > > > > which is long stop-the-world GC pauses.
> > > > >
> > > > > I know we did a little progress in this direction [3] by providing
> > > > > particular metrics that help to monitor the pauses. Why don't we
> keep
> > > the
> > > > > pace and teach Ignite to help itself if it sees there is a node
> that
> > > > brings
> > > > > down overall cluster performance due to an STP?
> > > > >
> > > > > I would create policies similar to the critical failures policies
> [4]
> > > or
> > > > > just add a long STP to the list of critical failures and reuse
> existing
> > > > > functionality.
> > > > >
> > > > > Thoughts? Anyone who'd like to implement the feature?
> > > > >
> > > > > [1] https://apacheignite.readme.io/docs/critical-failures-handling
> > > > > [2]
> > > > > http://apache-ignite-developers.2346864.n4.nabble.
> > > > com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html
> > > > > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > > > > [4]
> > > > > https://apacheignite.readme.io/docs/critical-failures-
> > > > handling#section-failure-handling
> > > > >
> > > >
> > >
>