Igniters,
It's a pleasure to see how our project is evolving in a directing of being a self-healing solution: - Ignite can already handle critical failures such as OOM, File I/O issues, etc. [1] - There is an endeavor to fix cluster lock-ins due to partition map exchange issues. [2] There is one more notorious problem that might affect Ignite deployments which is long stop-the-world GC pauses. I know we did a little progress in this direction [3] by providing particular metrics that help to monitor the pauses. Why don't we keep the pace and teach Ignite to help itself if it sees there is a node that brings down overall cluster performance due to an STP? I would create policies similar to the critical failures policies [4] or just add a long STP to the list of critical failures and reuse existing functionality. Thoughts? Anyone who'd like to implement the feature? [1] https://apacheignite.readme.io/docs/critical-failures-handling [2] http://apache-ignite-developers.2346864.n4.nabble.com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html [3] https://issues.apache.org/jira/browse/IGNITE-6171 [4] https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling |
Igniters,
Pulling this discussion up. Any thoughts? -- Denis On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[hidden email]> wrote: > Igniters, > > It's a pleasure to see how our project is evolving in a directing of being > a self-healing solution: > > - Ignite can already handle critical failures such as OOM, File I/O > issues, etc. [1] > - There is an endeavor to fix cluster lock-ins due to partition map > exchange issues. [2] > > There is one more notorious problem that might affect Ignite deployments > which is long stop-the-world GC pauses. > > I know we did a little progress in this direction [3] by providing > particular metrics that help to monitor the pauses. Why don't we keep the > pace and teach Ignite to help itself if it sees there is a node that brings > down overall cluster performance due to an STP? > > I would create policies similar to the critical failures policies [4] or > just add a long STP to the list of critical failures and reuse existing > functionality. > > Thoughts? Anyone who'd like to implement the feature? > > [1] https://apacheignite.readme.io/docs/critical-failures-handling > [2] > http://apache-ignite-developers.2346864.n4.nabble.com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html > [3] https://issues.apache.org/jira/browse/IGNITE-6171 > [4] > https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling > |
Denis,
I think, JVM can't easily help to itself if it's in SW pause. Most solutions what I saw about handling such situations are checking heartbeats on other nodes or run in parallel supervisor process which can detect that JVM with Ignite in SW. 2018-07-02 20:54 GMT+03:00 Denis Magda <[hidden email]>: > Igniters, > > Pulling this discussion up. Any thoughts? > > -- > Denis > > On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[hidden email]> wrote: > > > Igniters, > > > > It's a pleasure to see how our project is evolving in a directing of > being > > a self-healing solution: > > > > - Ignite can already handle critical failures such as OOM, File I/O > > issues, etc. [1] > > - There is an endeavor to fix cluster lock-ins due to partition map > > exchange issues. [2] > > > > There is one more notorious problem that might affect Ignite deployments > > which is long stop-the-world GC pauses. > > > > I know we did a little progress in this direction [3] by providing > > particular metrics that help to monitor the pauses. Why don't we keep the > > pace and teach Ignite to help itself if it sees there is a node that > brings > > down overall cluster performance due to an STP? > > > > I would create policies similar to the critical failures policies [4] or > > just add a long STP to the list of critical failures and reuse existing > > functionality. > > > > Thoughts? Anyone who'd like to implement the feature? > > > > [1] https://apacheignite.readme.io/docs/critical-failures-handling > > [2] > > http://apache-ignite-developers.2346864.n4.nabble. > com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html > > [3] https://issues.apache.org/jira/browse/IGNITE-6171 > > [4] > > https://apacheignite.readme.io/docs/critical-failures- > handling#section-failure-handling > > > |
Pavel,
We already can monitor the state of individual nodes and show it through metrics. Now I'd like to see how to go further and automate a decision on if a node should be kicked off from the cluster or not. -- Denis On Mon, Jul 2, 2018 at 12:28 PM Pavel Kovalenko <[hidden email]> wrote: > Denis, > > I think, JVM can't easily help to itself if it's in SW pause. Most > solutions what I saw about handling such situations are checking heartbeats > on other nodes or run in parallel supervisor process which can detect that > JVM with Ignite in SW. > > 2018-07-02 20:54 GMT+03:00 Denis Magda <[hidden email]>: > > > Igniters, > > > > Pulling this discussion up. Any thoughts? > > > > -- > > Denis > > > > On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[hidden email]> wrote: > > > > > Igniters, > > > > > > It's a pleasure to see how our project is evolving in a directing of > > being > > > a self-healing solution: > > > > > > - Ignite can already handle critical failures such as OOM, File I/O > > > issues, etc. [1] > > > - There is an endeavor to fix cluster lock-ins due to partition map > > > exchange issues. [2] > > > > > > There is one more notorious problem that might affect Ignite > deployments > > > which is long stop-the-world GC pauses. > > > > > > I know we did a little progress in this direction [3] by providing > > > particular metrics that help to monitor the pauses. Why don't we keep > the > > > pace and teach Ignite to help itself if it sees there is a node that > > brings > > > down overall cluster performance due to an STP? > > > > > > I would create policies similar to the critical failures policies [4] > or > > > just add a long STP to the list of critical failures and reuse existing > > > functionality. > > > > > > Thoughts? Anyone who'd like to implement the feature? > > > > > > [1] https://apacheignite.readme.io/docs/critical-failures-handling > > > [2] > > > http://apache-ignite-developers.2346864.n4.nabble. > > com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html > > > [3] https://issues.apache.org/jira/browse/IGNITE-6171 > > > [4] > > > https://apacheignite.readme.io/docs/critical-failures- > > handling#section-failure-handling > > > > > > |
Denis,
we have LongJVMPauseDetector. But it is Java thread that will be in safe-point during stop-the-world pause and therefore will not make any progress. So only external process can detect SW pause. On Mon, Jul 2, 2018 at 10:34 PM Denis Magda <[hidden email]> wrote: > > Pavel, > > We already can monitor the state of individual nodes and show it through > metrics. Now I'd like to see how to go further and automate a decision on > if a node should be kicked off from the cluster or not. > > -- > Denis > > On Mon, Jul 2, 2018 at 12:28 PM Pavel Kovalenko <[hidden email]> wrote: > > > Denis, > > > > I think, JVM can't easily help to itself if it's in SW pause. Most > > solutions what I saw about handling such situations are checking heartbeats > > on other nodes or run in parallel supervisor process which can detect that > > JVM with Ignite in SW. > > > > 2018-07-02 20:54 GMT+03:00 Denis Magda <[hidden email]>: > > > > > Igniters, > > > > > > Pulling this discussion up. Any thoughts? > > > > > > -- > > > Denis > > > > > > On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[hidden email]> wrote: > > > > > > > Igniters, > > > > > > > > It's a pleasure to see how our project is evolving in a directing of > > > being > > > > a self-healing solution: > > > > > > > > - Ignite can already handle critical failures such as OOM, File I/O > > > > issues, etc. [1] > > > > - There is an endeavor to fix cluster lock-ins due to partition map > > > > exchange issues. [2] > > > > > > > > There is one more notorious problem that might affect Ignite > > deployments > > > > which is long stop-the-world GC pauses. > > > > > > > > I know we did a little progress in this direction [3] by providing > > > > particular metrics that help to monitor the pauses. Why don't we keep > > the > > > > pace and teach Ignite to help itself if it sees there is a node that > > > brings > > > > down overall cluster performance due to an STP? > > > > > > > > I would create policies similar to the critical failures policies [4] > > or > > > > just add a long STP to the list of critical failures and reuse existing > > > > functionality. > > > > > > > > Thoughts? Anyone who'd like to implement the feature? > > > > > > > > [1] https://apacheignite.readme.io/docs/critical-failures-handling > > > > [2] > > > > http://apache-ignite-developers.2346864.n4.nabble. > > > com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html > > > > [3] https://issues.apache.org/jira/browse/IGNITE-6171 > > > > [4] > > > > https://apacheignite.readme.io/docs/critical-failures- > > > handling#section-failure-handling > > > > > > > > > |
I see, then we need to come up with an external process-based solution for
the sake of the new ticket. -- Denis On Tue, Jul 10, 2018 at 6:01 AM Andrey Gura <[hidden email]> wrote: > Denis, > > we have LongJVMPauseDetector. But it is Java thread that will be in > safe-point during stop-the-world pause and therefore will not make any > progress. So only external process can detect SW pause. > On Mon, Jul 2, 2018 at 10:34 PM Denis Magda <[hidden email]> wrote: > > > > Pavel, > > > > We already can monitor the state of individual nodes and show it through > > metrics. Now I'd like to see how to go further and automate a decision on > > if a node should be kicked off from the cluster or not. > > > > -- > > Denis > > > > On Mon, Jul 2, 2018 at 12:28 PM Pavel Kovalenko <[hidden email]> > wrote: > > > > > Denis, > > > > > > I think, JVM can't easily help to itself if it's in SW pause. Most > > > solutions what I saw about handling such situations are checking > heartbeats > > > on other nodes or run in parallel supervisor process which can detect > that > > > JVM with Ignite in SW. > > > > > > 2018-07-02 20:54 GMT+03:00 Denis Magda <[hidden email]>: > > > > > > > Igniters, > > > > > > > > Pulling this discussion up. Any thoughts? > > > > > > > > -- > > > > Denis > > > > > > > > On Thu, Jun 21, 2018 at 3:52 PM Denis Magda <[hidden email]> > wrote: > > > > > > > > > Igniters, > > > > > > > > > > It's a pleasure to see how our project is evolving in a directing > of > > > > being > > > > > a self-healing solution: > > > > > > > > > > - Ignite can already handle critical failures such as OOM, File > I/O > > > > > issues, etc. [1] > > > > > - There is an endeavor to fix cluster lock-ins due to partition > map > > > > > exchange issues. [2] > > > > > > > > > > There is one more notorious problem that might affect Ignite > > > deployments > > > > > which is long stop-the-world GC pauses. > > > > > > > > > > I know we did a little progress in this direction [3] by providing > > > > > particular metrics that help to monitor the pauses. Why don't we > keep > > > the > > > > > pace and teach Ignite to help itself if it sees there is a node > that > > > > brings > > > > > down overall cluster performance due to an STP? > > > > > > > > > > I would create policies similar to the critical failures policies > [4] > > > or > > > > > just add a long STP to the list of critical failures and reuse > existing > > > > > functionality. > > > > > > > > > > Thoughts? Anyone who'd like to implement the feature? > > > > > > > > > > [1] https://apacheignite.readme.io/docs/critical-failures-handling > > > > > [2] > > > > > http://apache-ignite-developers.2346864.n4.nabble. > > > > com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html > > > > > [3] https://issues.apache.org/jira/browse/IGNITE-6171 > > > > > [4] > > > > > https://apacheignite.readme.io/docs/critical-failures- > > > > handling#section-failure-handling > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |