[jira] [Created] (IGNITE-8967) Automatic Handling of Long Stop-the-World Pauses

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (IGNITE-8967) Automatic Handling of Long Stop-the-World Pauses

Anton Vinogradov (Jira)
Denis Magda created IGNITE-8967:
-----------------------------------

             Summary: Automatic Handling of Long Stop-the-World Pauses
                 Key: IGNITE-8967
                 URL: https://issues.apache.org/jira/browse/IGNITE-8967
             Project: Ignite
          Issue Type: New Feature
            Reporter: Denis Magda


Based on the discussion on the dev list:
http://apache-ignite-developers.2346864.n4.nabble.com/Automatic-Handling-of-Long-Stop-the-World-Pauses-td31847.html

Ignite goes with a number of self-healing capabilities:
* Ignite can already handle critical failures such as OOM, File I/O issues, etc. [1]
* There is an endeavor to fix cluster lock-ins due to partition map exchange issues. [2]

There is one more notorious problem that might affect Ignite deployments which is long stop-the-world GC pauses. We did a little progress in this direction [3] by providing particular metrics that help to monitor the pauses.

Presently, I would either create specific policies similar to the critical failures policies [4] or just add a long STP issue to the list of critical failures [1].

[1] https://apacheignite.readme.io/docs/critical-failures-handling
[2] http://apache-ignite-developers.2346864.n4.nabble.com/IEP-25-Partition-Map-Exchange-hangs-resolving-td31819.html
[3] https://issues.apache.org/jira/browse/IGNITE-6171
[4] https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)