Igniters,
Internal problems may and, unfortunately, cause unexpected cluster behavior. We should determine behavior in case any of internal problem happened. Well known internal problems can be split to: 1) OOM or any other reason cause node crash 2) Situations required graceful node shutdown with custom notification - IgniteOutOfMemoryException - Persistence errors - ExchangeWorker exits with error 3) Prefomance issues should be covered by metrics - GC STW duration - Timed out tasks and jobs - TX deadlock - Hanged Tx (waits for some service) - Java Deadlocks I created special issue [1] to make sure all these metrics will be presented at WebConsole or VisorConsole (what's preferred?) 4) Situations required external monitoring implementation - GC STW duration exceed maximum possible length (node should be stopped before STW finished) All this problems were reported by different persons different time ago, So, we should reanalyze each of them and, possible, find better ways to solve them than it described at issues. P.s. IEP-7 [2] already contains 9 issues, feel free to mention something else :) [1] https://issues.apache.org/jira/browse/IGNITE-6961 [2] https://cwiki.apache.org/confluence/display/IGNITE/IEP-7%3A+Ignite+internal+problems+detection |
Hi Anton,
> - GC STW duration exceed maximum possible length (node should be stopped before STW finished) Are you sure we should kill node in case long STW? Can we produce warnings into logs and monitoring tools an wait node to become alive a little bit longer if we detect STW. In this case we can notify coordinator or other node, that 'current node is in STW, please wait longer than 3 heartbeat timeout'. It is probable such pauses will occur not often? Sincerely, Dmitriy Pavlov пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <[hidden email]>: > Igniters, > > Internal problems may and, unfortunately, cause unexpected cluster > behavior. > We should determine behavior in case any of internal problem happened. > > Well known internal problems can be split to: > 1) OOM or any other reason cause node crash > > 2) Situations required graceful node shutdown with custom notification > - IgniteOutOfMemoryException > - Persistence errors > - ExchangeWorker exits with error > > 3) Prefomance issues should be covered by metrics > - GC STW duration > - Timed out tasks and jobs > - TX deadlock > - Hanged Tx (waits for some service) > - Java Deadlocks > > I created special issue [1] to make sure all these metrics will be > presented at WebConsole or VisorConsole (what's preferred?) > > 4) Situations required external monitoring implementation > - GC STW duration exceed maximum possible length (node should be stopped > before STW finished) > > All this problems were reported by different persons different time ago, > So, we should reanalyze each of them and, possible, find better ways to > solve them than it described at issues. > > P.s. IEP-7 [2] already contains 9 issues, feel free to mention something > else :) > > [1] https://issues.apache.org/jira/browse/IGNITE-6961 > [2] > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-7%3A+Ignite+internal+problems+detection > |
Dmitry,
There's two cases 1) STW duration is long -> notifying monitoring via JMX metric 2) STW duration exceed N seconds -> no need to wait for something. We already know that node will be segmented or that pause bigger that N seconds will affect cluster performance. Better case is to kill node ASAP to protect the cluster. Some customers have huge timeouts and such node can kill whole cluster in case it will not be killed by watchdog. On Mon, Nov 20, 2017 at 7:23 PM, Dmitry Pavlov <[hidden email]> wrote: > Hi Anton, > > > - GC STW duration exceed maximum possible length (node should be stopped > before > STW finished) > > Are you sure we should kill node in case long STW? Can we produce warnings > into logs and monitoring tools an wait node to become alive a little bit > longer if we detect STW. In this case we can notify coordinator or other > node, that 'current node is in STW, please wait longer than 3 heartbeat > timeout'. > > It is probable such pauses will occur not often? > > Sincerely, > Dmitriy Pavlov > > пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <[hidden email]>: > > > Igniters, > > > > Internal problems may and, unfortunately, cause unexpected cluster > > behavior. > > We should determine behavior in case any of internal problem happened. > > > > Well known internal problems can be split to: > > 1) OOM or any other reason cause node crash > > > > 2) Situations required graceful node shutdown with custom notification > > - IgniteOutOfMemoryException > > - Persistence errors > > - ExchangeWorker exits with error > > > > 3) Prefomance issues should be covered by metrics > > - GC STW duration > > - Timed out tasks and jobs > > - TX deadlock > > - Hanged Tx (waits for some service) > > - Java Deadlocks > > > > I created special issue [1] to make sure all these metrics will be > > presented at WebConsole or VisorConsole (what's preferred?) > > > > 4) Situations required external monitoring implementation > > - GC STW duration exceed maximum possible length (node should be stopped > > before STW finished) > > > > All this problems were reported by different persons different time ago, > > So, we should reanalyze each of them and, possible, find better ways to > > solve them than it described at issues. > > > > P.s. IEP-7 [2] already contains 9 issues, feel free to mention something > > else :) > > > > [1] https://issues.apache.org/jira/browse/IGNITE-6961 > > [2] > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > 7%3A+Ignite+internal+problems+detection > > > |
If an Ignite operation hangs by some reason due to an internal problem or buggy application code it needs to eventual *time out*.
Take atomic operations case brought by Val to our attention recently: http://apache-ignite-developers.2346864.n4.nabble.com/Timeouts-in-atomic-cache-td19839.html An application must not freeze waiting for a human being intervention if an atomic update fails internally. Even more I would let all possible operation to time out: - Ignite compute computations. - Ignite services calls. - Atomic/transactional cache updates. - SQL queries. I’m not sure this is covered by any of the tickets from the IEP-7. Any thoughts/suggestion before the one is created? — Denis > On Nov 20, 2017, at 8:56 AM, Anton Vinogradov <[hidden email]> wrote: > > Dmitry, > > There's two cases > 1) STW duration is long -> notifying monitoring via JMX metric > > 2) STW duration exceed N seconds -> no need to wait for something. > We already know that node will be segmented or that pause bigger that N > seconds will affect cluster performance. > Better case is to kill node ASAP to protect the cluster. Some customers > have huge timeouts and such node can kill whole cluster in case it will not > be killed by watchdog. > > On Mon, Nov 20, 2017 at 7:23 PM, Dmitry Pavlov <[hidden email]> > wrote: > >> Hi Anton, >> >>> - GC STW duration exceed maximum possible length (node should be stopped >> before >> STW finished) >> >> Are you sure we should kill node in case long STW? Can we produce warnings >> into logs and monitoring tools an wait node to become alive a little bit >> longer if we detect STW. In this case we can notify coordinator or other >> node, that 'current node is in STW, please wait longer than 3 heartbeat >> timeout'. >> >> It is probable such pauses will occur not often? >> >> Sincerely, >> Dmitriy Pavlov >> >> пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <[hidden email]>: >> >>> Igniters, >>> >>> Internal problems may and, unfortunately, cause unexpected cluster >>> behavior. >>> We should determine behavior in case any of internal problem happened. >>> >>> Well known internal problems can be split to: >>> 1) OOM or any other reason cause node crash >>> >>> 2) Situations required graceful node shutdown with custom notification >>> - IgniteOutOfMemoryException >>> - Persistence errors >>> - ExchangeWorker exits with error >>> >>> 3) Prefomance issues should be covered by metrics >>> - GC STW duration >>> - Timed out tasks and jobs >>> - TX deadlock >>> - Hanged Tx (waits for some service) >>> - Java Deadlocks >>> >>> I created special issue [1] to make sure all these metrics will be >>> presented at WebConsole or VisorConsole (what's preferred?) >>> >>> 4) Situations required external monitoring implementation >>> - GC STW duration exceed maximum possible length (node should be stopped >>> before STW finished) >>> >>> All this problems were reported by different persons different time ago, >>> So, we should reanalyze each of them and, possible, find better ways to >>> solve them than it described at issues. >>> >>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention something >>> else :) >>> >>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>> [2] >>> >>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >> 7%3A+Ignite+internal+problems+detection >>> >> |
A lack of suggestions and thoughts encourages me to create a ticket:
https://issues.apache.org/jira/browse/IGNITE-6980 <https://issues.apache.org/jira/browse/IGNITE-6980> — Denis > On Nov 20, 2017, at 2:53 PM, Denis Magda <[hidden email]> wrote: > > If an Ignite operation hangs by some reason due to an internal problem or buggy application code it needs to eventual *time out*. > > Take atomic operations case brought by Val to our attention recently: > http://apache-ignite-developers.2346864.n4.nabble.com/Timeouts-in-atomic-cache-td19839.html > > An application must not freeze waiting for a human being intervention if an atomic update fails internally. > > Even more I would let all possible operation to time out: > - Ignite compute computations. > - Ignite services calls. > - Atomic/transactional cache updates. > - SQL queries. > > I’m not sure this is covered by any of the tickets from the IEP-7. Any thoughts/suggestion before the one is created? > > — > Denis > >> On Nov 20, 2017, at 8:56 AM, Anton Vinogradov <[hidden email]> wrote: >> >> Dmitry, >> >> There's two cases >> 1) STW duration is long -> notifying monitoring via JMX metric >> >> 2) STW duration exceed N seconds -> no need to wait for something. >> We already know that node will be segmented or that pause bigger that N >> seconds will affect cluster performance. >> Better case is to kill node ASAP to protect the cluster. Some customers >> have huge timeouts and such node can kill whole cluster in case it will not >> be killed by watchdog. >> >> On Mon, Nov 20, 2017 at 7:23 PM, Dmitry Pavlov <[hidden email]> >> wrote: >> >>> Hi Anton, >>> >>>> - GC STW duration exceed maximum possible length (node should be stopped >>> before >>> STW finished) >>> >>> Are you sure we should kill node in case long STW? Can we produce warnings >>> into logs and monitoring tools an wait node to become alive a little bit >>> longer if we detect STW. In this case we can notify coordinator or other >>> node, that 'current node is in STW, please wait longer than 3 heartbeat >>> timeout'. >>> >>> It is probable such pauses will occur not often? >>> >>> Sincerely, >>> Dmitriy Pavlov >>> >>> пн, 20 нояб. 2017 г. в 18:53, Anton Vinogradov <[hidden email]>: >>> >>>> Igniters, >>>> >>>> Internal problems may and, unfortunately, cause unexpected cluster >>>> behavior. >>>> We should determine behavior in case any of internal problem happened. >>>> >>>> Well known internal problems can be split to: >>>> 1) OOM or any other reason cause node crash >>>> >>>> 2) Situations required graceful node shutdown with custom notification >>>> - IgniteOutOfMemoryException >>>> - Persistence errors >>>> - ExchangeWorker exits with error >>>> >>>> 3) Prefomance issues should be covered by metrics >>>> - GC STW duration >>>> - Timed out tasks and jobs >>>> - TX deadlock >>>> - Hanged Tx (waits for some service) >>>> - Java Deadlocks >>>> >>>> I created special issue [1] to make sure all these metrics will be >>>> presented at WebConsole or VisorConsole (what's preferred?) >>>> >>>> 4) Situations required external monitoring implementation >>>> - GC STW duration exceed maximum possible length (node should be stopped >>>> before STW finished) >>>> >>>> All this problems were reported by different persons different time ago, >>>> So, we should reanalyze each of them and, possible, find better ways to >>>> solve them than it described at issues. >>>> >>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention something >>>> else :) >>>> >>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>>> [2] >>>> >>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >>> 7%3A+Ignite+internal+problems+detection >>>> >>> > |
In reply to this post by Anton Vinogradov
Hi Anton,
I don't think that we should shutdown node in case of IgniteOOMException, if one node has no space, then other probably don't have it too, so re -balancing will cause IgniteOOM on all other nodes and will kill the whole cluster. I think for some configurations cluster should survive and allow to user clean cache or/and add more nodes. Thanks, Mikhail. 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < [hidden email]> написал: > Igniters, > > Internal problems may and, unfortunately, cause unexpected cluster > behavior. > We should determine behavior in case any of internal problem happened. > > Well known internal problems can be split to: > 1) OOM or any other reason cause node crash > > 2) Situations required graceful node shutdown with custom notification > - IgniteOutOfMemoryException > - Persistence errors > - ExchangeWorker exits with error > > 3) Prefomance issues should be covered by metrics > - GC STW duration > - Timed out tasks and jobs > - TX deadlock > - Hanged Tx (waits for some service) > - Java Deadlocks > > I created special issue [1] to make sure all these metrics will be > presented at WebConsole or VisorConsole (what's preferred?) > > 4) Situations required external monitoring implementation > - GC STW duration exceed maximum possible length (node should be stopped > before STW finished) > > All this problems were reported by different persons different time ago, > So, we should reanalyze each of them and, possible, find better ways to > solve them than it described at issues. > > P.s. IEP-7 [2] already contains 9 issues, feel free to mention something > else :) > > [1] https://issues.apache.org/jira/browse/IGNITE-6961 > [2] > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > 7%3A+Ignite+internal+problems+detection > |
In the first iteration I would focus only on reporting facilities, to let
administrator spot dangerous situation. And in the second phase, when all reporting and metrics are ready, we can think on some automatic actions. On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <[hidden email] > wrote: > Hi Anton, > > I don't think that we should shutdown node in case of IgniteOOMException, > if one node has no space, then other probably don't have it too, so re > -balancing will cause IgniteOOM on all other nodes and will kill the whole > cluster. I think for some configurations cluster should survive and allow > to user clean cache or/and add more nodes. > > Thanks, > Mikhail. > > 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > [hidden email]> написал: > > > Igniters, > > > > Internal problems may and, unfortunately, cause unexpected cluster > > behavior. > > We should determine behavior in case any of internal problem happened. > > > > Well known internal problems can be split to: > > 1) OOM or any other reason cause node crash > > > > 2) Situations required graceful node shutdown with custom notification > > - IgniteOutOfMemoryException > > - Persistence errors > > - ExchangeWorker exits with error > > > > 3) Prefomance issues should be covered by metrics > > - GC STW duration > > - Timed out tasks and jobs > > - TX deadlock > > - Hanged Tx (waits for some service) > > - Java Deadlocks > > > > I created special issue [1] to make sure all these metrics will be > > presented at WebConsole or VisorConsole (what's preferred?) > > > > 4) Situations required external monitoring implementation > > - GC STW duration exceed maximum possible length (node should be stopped > > before STW finished) > > > > All this problems were reported by different persons different time ago, > > So, we should reanalyze each of them and, possible, find better ways to > > solve them than it described at issues. > > > > P.s. IEP-7 [2] already contains 9 issues, feel free to mention something > > else :) > > > > [1] https://issues.apache.org/jira/browse/IGNITE-6961 > > [2] > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > 7%3A+Ignite+internal+problems+detection > > > |
Just provide FailureProcessingPolicy with possible reactions:
- NOOP - exceptions will be reported, metrics will be triggered but an affected Ignite process won’t be touched. - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite process termination. - RESTART - NOOP actions + process restart. - EXEC - execute a custom script provided by the user. If needed the policy can be set per know failure such is OOM, Persistence errors so that the user can act accordingly basing on a context. — Denis > On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <[hidden email]> wrote: > > In the first iteration I would focus only on reporting facilities, to let > administrator spot dangerous situation. And in the second phase, when all > reporting and metrics are ready, we can think on some automatic actions. > > On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <[hidden email] >> wrote: > >> Hi Anton, >> >> I don't think that we should shutdown node in case of IgniteOOMException, >> if one node has no space, then other probably don't have it too, so re >> -balancing will cause IgniteOOM on all other nodes and will kill the whole >> cluster. I think for some configurations cluster should survive and allow >> to user clean cache or/and add more nodes. >> >> Thanks, >> Mikhail. >> >> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < >> [hidden email]> написал: >> >>> Igniters, >>> >>> Internal problems may and, unfortunately, cause unexpected cluster >>> behavior. >>> We should determine behavior in case any of internal problem happened. >>> >>> Well known internal problems can be split to: >>> 1) OOM or any other reason cause node crash >>> >>> 2) Situations required graceful node shutdown with custom notification >>> - IgniteOutOfMemoryException >>> - Persistence errors >>> - ExchangeWorker exits with error >>> >>> 3) Prefomance issues should be covered by metrics >>> - GC STW duration >>> - Timed out tasks and jobs >>> - TX deadlock >>> - Hanged Tx (waits for some service) >>> - Java Deadlocks >>> >>> I created special issue [1] to make sure all these metrics will be >>> presented at WebConsole or VisorConsole (what's preferred?) >>> >>> 4) Situations required external monitoring implementation >>> - GC STW duration exceed maximum possible length (node should be stopped >>> before STW finished) >>> >>> All this problems were reported by different persons different time ago, >>> So, we should reanalyze each of them and, possible, find better ways to >>> solve them than it described at issues. >>> >>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention something >>> else :) >>> >>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>> [2] >>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >>> 7%3A+Ignite+internal+problems+detection >>> >> |
Denis,
I propose start with first three policies (it's already implemented, just await some code combing, commit & review). About of fourth policy (EXEC) I think that it's rather additional property (some script path) than policy. 2017-11-23 0:43 GMT+03:00 Denis Magda <[hidden email]>: > Just provide FailureProcessingPolicy with possible reactions: > - NOOP - exceptions will be reported, metrics will be triggered but an > affected Ignite process won’t be touched. > - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite > process termination. > - RESTART - NOOP actions + process restart. > - EXEC - execute a custom script provided by the user. > > If needed the policy can be set per know failure such is OOM, Persistence > errors so that the user can act accordingly basing on a context. > > — > Denis > > > On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <[hidden email]> > wrote: > > > > In the first iteration I would focus only on reporting facilities, to let > > administrator spot dangerous situation. And in the second phase, when all > > reporting and metrics are ready, we can think on some automatic actions. > > > > On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > [hidden email] > >> wrote: > > > >> Hi Anton, > >> > >> I don't think that we should shutdown node in case of > IgniteOOMException, > >> if one node has no space, then other probably don't have it too, so re > >> -balancing will cause IgniteOOM on all other nodes and will kill the > whole > >> cluster. I think for some configurations cluster should survive and > allow > >> to user clean cache or/and add more nodes. > >> > >> Thanks, > >> Mikhail. > >> > >> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > >> [hidden email]> написал: > >> > >>> Igniters, > >>> > >>> Internal problems may and, unfortunately, cause unexpected cluster > >>> behavior. > >>> We should determine behavior in case any of internal problem happened. > >>> > >>> Well known internal problems can be split to: > >>> 1) OOM or any other reason cause node crash > >>> > >>> 2) Situations required graceful node shutdown with custom notification > >>> - IgniteOutOfMemoryException > >>> - Persistence errors > >>> - ExchangeWorker exits with error > >>> > >>> 3) Prefomance issues should be covered by metrics > >>> - GC STW duration > >>> - Timed out tasks and jobs > >>> - TX deadlock > >>> - Hanged Tx (waits for some service) > >>> - Java Deadlocks > >>> > >>> I created special issue [1] to make sure all these metrics will be > >>> presented at WebConsole or VisorConsole (what's preferred?) > >>> > >>> 4) Situations required external monitoring implementation > >>> - GC STW duration exceed maximum possible length (node should be > stopped > >>> before STW finished) > >>> > >>> All this problems were reported by different persons different time > ago, > >>> So, we should reanalyze each of them and, possible, find better ways to > >>> solve them than it described at issues. > >>> > >>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention > something > >>> else :) > >>> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > >>> [2] > >>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > >>> 7%3A+Ignite+internal+problems+detection > >>> > >> > > |
No objections here. Additional policies like EXEC might be added later depending on user needs.
— Denis > On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <[hidden email]> wrote: > > Denis, > I propose start with first three policies (it's already implemented, just > await some code combing, commit & review). > About of fourth policy (EXEC) I think that it's rather additional property > (some script path) than policy. > > 2017-11-23 0:43 GMT+03:00 Denis Magda <[hidden email]>: > >> Just provide FailureProcessingPolicy with possible reactions: >> - NOOP - exceptions will be reported, metrics will be triggered but an >> affected Ignite process won’t be touched. >> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite >> process termination. >> - RESTART - NOOP actions + process restart. >> - EXEC - execute a custom script provided by the user. >> >> If needed the policy can be set per know failure such is OOM, Persistence >> errors so that the user can act accordingly basing on a context. >> >> — >> Denis >> >>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <[hidden email]> >> wrote: >>> >>> In the first iteration I would focus only on reporting facilities, to let >>> administrator spot dangerous situation. And in the second phase, when all >>> reporting and metrics are ready, we can think on some automatic actions. >>> >>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < >> [hidden email] >>>> wrote: >>> >>>> Hi Anton, >>>> >>>> I don't think that we should shutdown node in case of >> IgniteOOMException, >>>> if one node has no space, then other probably don't have it too, so re >>>> -balancing will cause IgniteOOM on all other nodes and will kill the >> whole >>>> cluster. I think for some configurations cluster should survive and >> allow >>>> to user clean cache or/and add more nodes. >>>> >>>> Thanks, >>>> Mikhail. >>>> >>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < >>>> [hidden email]> написал: >>>> >>>>> Igniters, >>>>> >>>>> Internal problems may and, unfortunately, cause unexpected cluster >>>>> behavior. >>>>> We should determine behavior in case any of internal problem happened. >>>>> >>>>> Well known internal problems can be split to: >>>>> 1) OOM or any other reason cause node crash >>>>> >>>>> 2) Situations required graceful node shutdown with custom notification >>>>> - IgniteOutOfMemoryException >>>>> - Persistence errors >>>>> - ExchangeWorker exits with error >>>>> >>>>> 3) Prefomance issues should be covered by metrics >>>>> - GC STW duration >>>>> - Timed out tasks and jobs >>>>> - TX deadlock >>>>> - Hanged Tx (waits for some service) >>>>> - Java Deadlocks >>>>> >>>>> I created special issue [1] to make sure all these metrics will be >>>>> presented at WebConsole or VisorConsole (what's preferred?) >>>>> >>>>> 4) Situations required external monitoring implementation >>>>> - GC STW duration exceed maximum possible length (node should be >> stopped >>>>> before STW finished) >>>>> >>>>> All this problems were reported by different persons different time >> ago, >>>>> So, we should reanalyze each of them and, possible, find better ways to >>>>> solve them than it described at issues. >>>>> >>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention >> something >>>>> else :) >>>>> >>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>>>> [2] >>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >>>>> 7%3A+Ignite+internal+problems+detection >>>>> >>>> >> >> |
Dmitry,
How these policies will be configured? Do you have any API in mind? On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <[hidden email]> wrote: > No objections here. Additional policies like EXEC might be added later > depending on user needs. > > — > Denis > > > On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <[hidden email]> > wrote: > > > > Denis, > > I propose start with first three policies (it's already implemented, just > > await some code combing, commit & review). > > About of fourth policy (EXEC) I think that it's rather additional > property > > (some script path) than policy. > > > > 2017-11-23 0:43 GMT+03:00 Denis Magda <[hidden email]>: > > > >> Just provide FailureProcessingPolicy with possible reactions: > >> - NOOP - exceptions will be reported, metrics will be triggered but an > >> affected Ignite process won’t be touched. > >> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite > >> process termination. > >> - RESTART - NOOP actions + process restart. > >> - EXEC - execute a custom script provided by the user. > >> > >> If needed the policy can be set per know failure such is OOM, > Persistence > >> errors so that the user can act accordingly basing on a context. > >> > >> — > >> Denis > >> > >>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <[hidden email]> > >> wrote: > >>> > >>> In the first iteration I would focus only on reporting facilities, to > let > >>> administrator spot dangerous situation. And in the second phase, when > all > >>> reporting and metrics are ready, we can think on some automatic > actions. > >>> > >>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > >> [hidden email] > >>>> wrote: > >>> > >>>> Hi Anton, > >>>> > >>>> I don't think that we should shutdown node in case of > >> IgniteOOMException, > >>>> if one node has no space, then other probably don't have it too, so > re > >>>> -balancing will cause IgniteOOM on all other nodes and will kill the > >> whole > >>>> cluster. I think for some configurations cluster should survive and > >> allow > >>>> to user clean cache or/and add more nodes. > >>>> > >>>> Thanks, > >>>> Mikhail. > >>>> > >>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > >>>> [hidden email]> написал: > >>>> > >>>>> Igniters, > >>>>> > >>>>> Internal problems may and, unfortunately, cause unexpected cluster > >>>>> behavior. > >>>>> We should determine behavior in case any of internal problem > happened. > >>>>> > >>>>> Well known internal problems can be split to: > >>>>> 1) OOM or any other reason cause node crash > >>>>> > >>>>> 2) Situations required graceful node shutdown with custom > notification > >>>>> - IgniteOutOfMemoryException > >>>>> - Persistence errors > >>>>> - ExchangeWorker exits with error > >>>>> > >>>>> 3) Prefomance issues should be covered by metrics > >>>>> - GC STW duration > >>>>> - Timed out tasks and jobs > >>>>> - TX deadlock > >>>>> - Hanged Tx (waits for some service) > >>>>> - Java Deadlocks > >>>>> > >>>>> I created special issue [1] to make sure all these metrics will be > >>>>> presented at WebConsole or VisorConsole (what's preferred?) > >>>>> > >>>>> 4) Situations required external monitoring implementation > >>>>> - GC STW duration exceed maximum possible length (node should be > >> stopped > >>>>> before STW finished) > >>>>> > >>>>> All this problems were reported by different persons different time > >> ago, > >>>>> So, we should reanalyze each of them and, possible, find better ways > to > >>>>> solve them than it described at issues. > >>>>> > >>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention > >> something > >>>>> else :) > >>>>> > >>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > >>>>> [2] > >>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > >>>>> 7%3A+Ignite+internal+problems+detection > >>>>> > >>>> > >> > >> > > |
Vladimir,
These policies (policy, in fact) can be configured in IgniteConfiguration by calling setFailureProcessingPolicy(FailureProcessingPolicy flrPlc) method. -- Дмитрий Сорокин Тел.: 8-789-13512 Моб.: +7 (916) 560-39-63 28.11.17, 10:28 пользователь "Vladimir Ozerov" <[hidden email]> написал: Dmitry, How these policies will be configured? Do you have any API in mind? On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <[hidden email]> wrote: > No objections here. Additional policies like EXEC might be added later > depending on user needs. > > — > Denis > > > On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <[hidden email]> > wrote: > > > > Denis, > > I propose start with first three policies (it's already implemented, just > > await some code combing, commit & review). > > About of fourth policy (EXEC) I think that it's rather additional > property > > (some script path) than policy. > > > > 2017-11-23 0:43 GMT+03:00 Denis Magda <[hidden email]>: > > > >> Just provide FailureProcessingPolicy with possible reactions: > >> - NOOP - exceptions will be reported, metrics will be triggered but an > >> affected Ignite process won’t be touched. > >> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite > >> process termination. > >> - RESTART - NOOP actions + process restart. > >> - EXEC - execute a custom script provided by the user. > >> > >> If needed the policy can be set per know failure such is OOM, > Persistence > >> errors so that the user can act accordingly basing on a context. > >> > >> — > >> Denis > >> > >>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <[hidden email]> > >> wrote: > >>> > >>> In the first iteration I would focus only on reporting facilities, to > let > >>> administrator spot dangerous situation. And in the second phase, when > all > >>> reporting and metrics are ready, we can think on some automatic > actions. > >>> > >>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > >> [hidden email] > >>>> wrote: > >>> > >>>> Hi Anton, > >>>> > >>>> I don't think that we should shutdown node in case of > >> IgniteOOMException, > >>>> if one node has no space, then other probably don't have it too, so > re > >>>> -balancing will cause IgniteOOM on all other nodes and will kill the > >> whole > >>>> cluster. I think for some configurations cluster should survive and > >> allow > >>>> to user clean cache or/and add more nodes. > >>>> > >>>> Thanks, > >>>> Mikhail. > >>>> > >>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > >>>> [hidden email]> написал: > >>>> > >>>>> Igniters, > >>>>> > >>>>> Internal problems may and, unfortunately, cause unexpected cluster > >>>>> behavior. > >>>>> We should determine behavior in case any of internal problem > happened. > >>>>> > >>>>> Well known internal problems can be split to: > >>>>> 1) OOM or any other reason cause node crash > >>>>> > >>>>> 2) Situations required graceful node shutdown with custom > notification > >>>>> - IgniteOutOfMemoryException > >>>>> - Persistence errors > >>>>> - ExchangeWorker exits with error > >>>>> > >>>>> 3) Prefomance issues should be covered by metrics > >>>>> - GC STW duration > >>>>> - Timed out tasks and jobs > >>>>> - TX deadlock > >>>>> - Hanged Tx (waits for some service) > >>>>> - Java Deadlocks > >>>>> > >>>>> I created special issue [1] to make sure all these metrics will be > >>>>> presented at WebConsole or VisorConsole (what's preferred?) > >>>>> > >>>>> 4) Situations required external monitoring implementation > >>>>> - GC STW duration exceed maximum possible length (node should be > >> stopped > >>>>> before STW finished) > >>>>> > >>>>> All this problems were reported by different persons different time > >> ago, > >>>>> So, we should reanalyze each of them and, possible, find better ways > to > >>>>> solve them than it described at issues. > >>>>> > >>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention > >> something > >>>>> else :) > >>>>> > >>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > >>>>> [2] > >>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > >>>>> 7%3A+Ignite+internal+problems+detection > >>>>> > >>>> > >> > >> > > УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: Это электронное сообщение и любые документы, приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем Вас о том, что если это сообщение не предназначено Вам, использование, копирование, распространение информации, содержащейся в настоящем сообщении, а также осуществление любых действий на основе этой информации, строго запрещено. Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом отправителю по электронной почте и удалите это сообщение. CONFIDENTIALITY NOTICE: This email and any files attached to it are confidential. If you are not the intended recipient you are notified that using, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error please notify the sender and delete this email. |
In reply to this post by Vladimir Ozerov
I think the failure processing policy should be configured via IgniteConfiguration in a way similar to the segmentation policies.
— Denis > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <[hidden email]> wrote: > > Dmitry, > > How these policies will be configured? Do you have any API in mind? > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <[hidden email]> wrote: > >> No objections here. Additional policies like EXEC might be added later >> depending on user needs. >> >> — >> Denis >> >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <[hidden email]> >> wrote: >>> >>> Denis, >>> I propose start with first three policies (it's already implemented, just >>> await some code combing, commit & review). >>> About of fourth policy (EXEC) I think that it's rather additional >> property >>> (some script path) than policy. >>> >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <[hidden email]>: >>> >>>> Just provide FailureProcessingPolicy with possible reactions: >>>> - NOOP - exceptions will be reported, metrics will be triggered but an >>>> affected Ignite process won’t be touched. >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite >>>> process termination. >>>> - RESTART - NOOP actions + process restart. >>>> - EXEC - execute a custom script provided by the user. >>>> >>>> If needed the policy can be set per know failure such is OOM, >> Persistence >>>> errors so that the user can act accordingly basing on a context. >>>> >>>> — >>>> Denis >>>> >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <[hidden email]> >>>> wrote: >>>>> >>>>> In the first iteration I would focus only on reporting facilities, to >> let >>>>> administrator spot dangerous situation. And in the second phase, when >> all >>>>> reporting and metrics are ready, we can think on some automatic >> actions. >>>>> >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < >>>> [hidden email] >>>>>> wrote: >>>>> >>>>>> Hi Anton, >>>>>> >>>>>> I don't think that we should shutdown node in case of >>>> IgniteOOMException, >>>>>> if one node has no space, then other probably don't have it too, so >> re >>>>>> -balancing will cause IgniteOOM on all other nodes and will kill the >>>> whole >>>>>> cluster. I think for some configurations cluster should survive and >>>> allow >>>>>> to user clean cache or/and add more nodes. >>>>>> >>>>>> Thanks, >>>>>> Mikhail. >>>>>> >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < >>>>>> [hidden email]> написал: >>>>>> >>>>>>> Igniters, >>>>>>> >>>>>>> Internal problems may and, unfortunately, cause unexpected cluster >>>>>>> behavior. >>>>>>> We should determine behavior in case any of internal problem >> happened. >>>>>>> >>>>>>> Well known internal problems can be split to: >>>>>>> 1) OOM or any other reason cause node crash >>>>>>> >>>>>>> 2) Situations required graceful node shutdown with custom >> notification >>>>>>> - IgniteOutOfMemoryException >>>>>>> - Persistence errors >>>>>>> - ExchangeWorker exits with error >>>>>>> >>>>>>> 3) Prefomance issues should be covered by metrics >>>>>>> - GC STW duration >>>>>>> - Timed out tasks and jobs >>>>>>> - TX deadlock >>>>>>> - Hanged Tx (waits for some service) >>>>>>> - Java Deadlocks >>>>>>> >>>>>>> I created special issue [1] to make sure all these metrics will be >>>>>>> presented at WebConsole or VisorConsole (what's preferred?) >>>>>>> >>>>>>> 4) Situations required external monitoring implementation >>>>>>> - GC STW duration exceed maximum possible length (node should be >>>> stopped >>>>>>> before STW finished) >>>>>>> >>>>>>> All this problems were reported by different persons different time >>>> ago, >>>>>>> So, we should reanalyze each of them and, possible, find better ways >> to >>>>>>> solve them than it described at issues. >>>>>>> >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention >>>> something >>>>>>> else :) >>>>>>> >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>>>>>> [2] >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >>>>>>> 7%3A+Ignite+internal+problems+detection >>>>>>> >>>>>> >>>> >>>> >> >> |
Denis,
Yes, but can we look at proposed API before we dig into implementation? On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <[hidden email]> wrote: > I think the failure processing policy should be configured via > IgniteConfiguration in a way similar to the segmentation policies. > > — > Denis > > > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <[hidden email]> > wrote: > > > > Dmitry, > > > > How these policies will be configured? Do you have any API in mind? > > > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <[hidden email]> wrote: > > > >> No objections here. Additional policies like EXEC might be added later > >> depending on user needs. > >> > >> — > >> Denis > >> > >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин < > [hidden email]> > >> wrote: > >>> > >>> Denis, > >>> I propose start with first three policies (it's already implemented, > just > >>> await some code combing, commit & review). > >>> About of fourth policy (EXEC) I think that it's rather additional > >> property > >>> (some script path) than policy. > >>> > >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <[hidden email]>: > >>> > >>>> Just provide FailureProcessingPolicy with possible reactions: > >>>> - NOOP - exceptions will be reported, metrics will be triggered but an > >>>> affected Ignite process won’t be touched. > >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite > >>>> process termination. > >>>> - RESTART - NOOP actions + process restart. > >>>> - EXEC - execute a custom script provided by the user. > >>>> > >>>> If needed the policy can be set per know failure such is OOM, > >> Persistence > >>>> errors so that the user can act accordingly basing on a context. > >>>> > >>>> — > >>>> Denis > >>>> > >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <[hidden email]> > >>>> wrote: > >>>>> > >>>>> In the first iteration I would focus only on reporting facilities, to > >> let > >>>>> administrator spot dangerous situation. And in the second phase, when > >> all > >>>>> reporting and metrics are ready, we can think on some automatic > >> actions. > >>>>> > >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > >>>> [hidden email] > >>>>>> wrote: > >>>>> > >>>>>> Hi Anton, > >>>>>> > >>>>>> I don't think that we should shutdown node in case of > >>>> IgniteOOMException, > >>>>>> if one node has no space, then other probably don't have it too, so > >> re > >>>>>> -balancing will cause IgniteOOM on all other nodes and will kill the > >>>> whole > >>>>>> cluster. I think for some configurations cluster should survive and > >>>> allow > >>>>>> to user clean cache or/and add more nodes. > >>>>>> > >>>>>> Thanks, > >>>>>> Mikhail. > >>>>>> > >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > >>>>>> [hidden email]> написал: > >>>>>> > >>>>>>> Igniters, > >>>>>>> > >>>>>>> Internal problems may and, unfortunately, cause unexpected cluster > >>>>>>> behavior. > >>>>>>> We should determine behavior in case any of internal problem > >> happened. > >>>>>>> > >>>>>>> Well known internal problems can be split to: > >>>>>>> 1) OOM or any other reason cause node crash > >>>>>>> > >>>>>>> 2) Situations required graceful node shutdown with custom > >> notification > >>>>>>> - IgniteOutOfMemoryException > >>>>>>> - Persistence errors > >>>>>>> - ExchangeWorker exits with error > >>>>>>> > >>>>>>> 3) Prefomance issues should be covered by metrics > >>>>>>> - GC STW duration > >>>>>>> - Timed out tasks and jobs > >>>>>>> - TX deadlock > >>>>>>> - Hanged Tx (waits for some service) > >>>>>>> - Java Deadlocks > >>>>>>> > >>>>>>> I created special issue [1] to make sure all these metrics will be > >>>>>>> presented at WebConsole or VisorConsole (what's preferred?) > >>>>>>> > >>>>>>> 4) Situations required external monitoring implementation > >>>>>>> - GC STW duration exceed maximum possible length (node should be > >>>> stopped > >>>>>>> before STW finished) > >>>>>>> > >>>>>>> All this problems were reported by different persons different time > >>>> ago, > >>>>>>> So, we should reanalyze each of them and, possible, find better > ways > >> to > >>>>>>> solve them than it described at issues. > >>>>>>> > >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention > >>>> something > >>>>>>> else :) > >>>>>>> > >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > >>>>>>> [2] > >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > >>>>>>> 7%3A+Ignite+internal+problems+detection > >>>>>>> > >>>>>> > >>>> > >>>> > >> > >> > > |
Vladimir,
These policies (policy, in fact) can be configured in IgniteConfiguration by calling setFailureProcessingPolicy(FailureProcessingPolicy flrPlc) method. 2017-11-29 10:35 GMT+03:00 Vladimir Ozerov <[hidden email]>: > Denis, > > Yes, but can we look at proposed API before we dig into implementation? > > On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <[hidden email]> wrote: > > > I think the failure processing policy should be configured via > > IgniteConfiguration in a way similar to the segmentation policies. > > > > — > > Denis > > > > > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <[hidden email]> > > wrote: > > > > > > Dmitry, > > > > > > How these policies will be configured? Do you have any API in mind? > > > > > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <[hidden email]> > wrote: > > > > > >> No objections here. Additional policies like EXEC might be added later > > >> depending on user needs. > > >> > > >> — > > >> Denis > > >> > > >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин < > > [hidden email]> > > >> wrote: > > >>> > > >>> Denis, > > >>> I propose start with first three policies (it's already implemented, > > just > > >>> await some code combing, commit & review). > > >>> About of fourth policy (EXEC) I think that it's rather additional > > >> property > > >>> (some script path) than policy. > > >>> > > >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <[hidden email]>: > > >>> > > >>>> Just provide FailureProcessingPolicy with possible reactions: > > >>>> - NOOP - exceptions will be reported, metrics will be triggered but > an > > >>>> affected Ignite process won’t be touched. > > >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite > > >>>> process termination. > > >>>> - RESTART - NOOP actions + process restart. > > >>>> - EXEC - execute a custom script provided by the user. > > >>>> > > >>>> If needed the policy can be set per know failure such is OOM, > > >> Persistence > > >>>> errors so that the user can act accordingly basing on a context. > > >>>> > > >>>> — > > >>>> Denis > > >>>> > > >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov < > [hidden email]> > > >>>> wrote: > > >>>>> > > >>>>> In the first iteration I would focus only on reporting facilities, > to > > >> let > > >>>>> administrator spot dangerous situation. And in the second phase, > when > > >> all > > >>>>> reporting and metrics are ready, we can think on some automatic > > >> actions. > > >>>>> > > >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > > >>>> [hidden email] > > >>>>>> wrote: > > >>>>> > > >>>>>> Hi Anton, > > >>>>>> > > >>>>>> I don't think that we should shutdown node in case of > > >>>> IgniteOOMException, > > >>>>>> if one node has no space, then other probably don't have it too, > so > > >> re > > >>>>>> -balancing will cause IgniteOOM on all other nodes and will kill > the > > >>>> whole > > >>>>>> cluster. I think for some configurations cluster should survive > and > > >>>> allow > > >>>>>> to user clean cache or/and add more nodes. > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Mikhail. > > >>>>>> > > >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > > >>>>>> [hidden email]> написал: > > >>>>>> > > >>>>>>> Igniters, > > >>>>>>> > > >>>>>>> Internal problems may and, unfortunately, cause unexpected > cluster > > >>>>>>> behavior. > > >>>>>>> We should determine behavior in case any of internal problem > > >> happened. > > >>>>>>> > > >>>>>>> Well known internal problems can be split to: > > >>>>>>> 1) OOM or any other reason cause node crash > > >>>>>>> > > >>>>>>> 2) Situations required graceful node shutdown with custom > > >> notification > > >>>>>>> - IgniteOutOfMemoryException > > >>>>>>> - Persistence errors > > >>>>>>> - ExchangeWorker exits with error > > >>>>>>> > > >>>>>>> 3) Prefomance issues should be covered by metrics > > >>>>>>> - GC STW duration > > >>>>>>> - Timed out tasks and jobs > > >>>>>>> - TX deadlock > > >>>>>>> - Hanged Tx (waits for some service) > > >>>>>>> - Java Deadlocks > > >>>>>>> > > >>>>>>> I created special issue [1] to make sure all these metrics will > be > > >>>>>>> presented at WebConsole or VisorConsole (what's preferred?) > > >>>>>>> > > >>>>>>> 4) Situations required external monitoring implementation > > >>>>>>> - GC STW duration exceed maximum possible length (node should be > > >>>> stopped > > >>>>>>> before STW finished) > > >>>>>>> > > >>>>>>> All this problems were reported by different persons different > time > > >>>> ago, > > >>>>>>> So, we should reanalyze each of them and, possible, find better > > ways > > >> to > > >>>>>>> solve them than it described at issues. > > >>>>>>> > > >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention > > >>>> something > > >>>>>>> else :) > > >>>>>>> > > >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > > >>>>>>> [2] > > >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > >>>>>>> 7%3A+Ignite+internal+problems+detection > > >>>>>>> > > >>>>>> > > >>>> > > >>>> > > >> > > >> > > > > > |
Dmitry,
Thank you, but how FailureProcessingPolicy looks like? It is not clear how can I configure different reactions to different event types. On Wed, Nov 29, 2017 at 1:47 PM, Дмитрий Сорокин <[hidden email]> wrote: > Vladimir, > > These policies (policy, in fact) can be configured in IgniteConfiguration > by calling setFailureProcessingPolicy(FailureProcessingPolicy flrPlc) > method. > > 2017-11-29 10:35 GMT+03:00 Vladimir Ozerov <[hidden email]>: > > > Denis, > > > > Yes, but can we look at proposed API before we dig into implementation? > > > > On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <[hidden email]> wrote: > > > > > I think the failure processing policy should be configured via > > > IgniteConfiguration in a way similar to the segmentation policies. > > > > > > — > > > Denis > > > > > > > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <[hidden email]> > > > wrote: > > > > > > > > Dmitry, > > > > > > > > How these policies will be configured? Do you have any API in mind? > > > > > > > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <[hidden email]> > > wrote: > > > > > > > >> No objections here. Additional policies like EXEC might be added > later > > > >> depending on user needs. > > > >> > > > >> — > > > >> Denis > > > >> > > > >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин < > > > [hidden email]> > > > >> wrote: > > > >>> > > > >>> Denis, > > > >>> I propose start with first three policies (it's already > implemented, > > > just > > > >>> await some code combing, commit & review). > > > >>> About of fourth policy (EXEC) I think that it's rather additional > > > >> property > > > >>> (some script path) than policy. > > > >>> > > > >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <[hidden email]>: > > > >>> > > > >>>> Just provide FailureProcessingPolicy with possible reactions: > > > >>>> - NOOP - exceptions will be reported, metrics will be triggered > but > > an > > > >>>> affected Ignite process won’t be touched. > > > >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + > Ignite > > > >>>> process termination. > > > >>>> - RESTART - NOOP actions + process restart. > > > >>>> - EXEC - execute a custom script provided by the user. > > > >>>> > > > >>>> If needed the policy can be set per know failure such is OOM, > > > >> Persistence > > > >>>> errors so that the user can act accordingly basing on a context. > > > >>>> > > > >>>> — > > > >>>> Denis > > > >>>> > > > >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov < > > [hidden email]> > > > >>>> wrote: > > > >>>>> > > > >>>>> In the first iteration I would focus only on reporting > facilities, > > to > > > >> let > > > >>>>> administrator spot dangerous situation. And in the second phase, > > when > > > >> all > > > >>>>> reporting and metrics are ready, we can think on some automatic > > > >> actions. > > > >>>>> > > > >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > > > >>>> [hidden email] > > > >>>>>> wrote: > > > >>>>> > > > >>>>>> Hi Anton, > > > >>>>>> > > > >>>>>> I don't think that we should shutdown node in case of > > > >>>> IgniteOOMException, > > > >>>>>> if one node has no space, then other probably don't have it > too, > > so > > > >> re > > > >>>>>> -balancing will cause IgniteOOM on all other nodes and will kill > > the > > > >>>> whole > > > >>>>>> cluster. I think for some configurations cluster should survive > > and > > > >>>> allow > > > >>>>>> to user clean cache or/and add more nodes. > > > >>>>>> > > > >>>>>> Thanks, > > > >>>>>> Mikhail. > > > >>>>>> > > > >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > > > >>>>>> [hidden email]> написал: > > > >>>>>> > > > >>>>>>> Igniters, > > > >>>>>>> > > > >>>>>>> Internal problems may and, unfortunately, cause unexpected > > cluster > > > >>>>>>> behavior. > > > >>>>>>> We should determine behavior in case any of internal problem > > > >> happened. > > > >>>>>>> > > > >>>>>>> Well known internal problems can be split to: > > > >>>>>>> 1) OOM or any other reason cause node crash > > > >>>>>>> > > > >>>>>>> 2) Situations required graceful node shutdown with custom > > > >> notification > > > >>>>>>> - IgniteOutOfMemoryException > > > >>>>>>> - Persistence errors > > > >>>>>>> - ExchangeWorker exits with error > > > >>>>>>> > > > >>>>>>> 3) Prefomance issues should be covered by metrics > > > >>>>>>> - GC STW duration > > > >>>>>>> - Timed out tasks and jobs > > > >>>>>>> - TX deadlock > > > >>>>>>> - Hanged Tx (waits for some service) > > > >>>>>>> - Java Deadlocks > > > >>>>>>> > > > >>>>>>> I created special issue [1] to make sure all these metrics will > > be > > > >>>>>>> presented at WebConsole or VisorConsole (what's preferred?) > > > >>>>>>> > > > >>>>>>> 4) Situations required external monitoring implementation > > > >>>>>>> - GC STW duration exceed maximum possible length (node should > be > > > >>>> stopped > > > >>>>>>> before STW finished) > > > >>>>>>> > > > >>>>>>> All this problems were reported by different persons different > > time > > > >>>> ago, > > > >>>>>>> So, we should reanalyze each of them and, possible, find better > > > ways > > > >> to > > > >>>>>>> solve them than it described at issues. > > > >>>>>>> > > > >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention > > > >>>> something > > > >>>>>>> else :) > > > >>>>>>> > > > >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > > > >>>>>>> [2] > > > >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > >>>>>>> 7%3A+Ignite+internal+problems+detection > > > >>>>>>> > > > >>>>>> > > > >>>> > > > >>>> > > > >> > > > >> > > > > > > > > > |
Vladimir,
At the moment policy looks like so: /** * Policy that defines how node will process the failures. Note that default * failure processing policy is defined by {@link IgniteConfiguration#DFLT_FLR_PLC} property. */ public enum FailureProcessingPolicy { /** Restart jvm. */ RESTART_JVM, /** Stop. */ STOP, /** Noop. */ NOOP; } Can You give an example which different event (failure) types need different reactions? We expect that all failures when some ignite system worker (or other critical component) will broken, need same policy for same node. 2017-11-29 13:56 GMT+03:00 Vladimir Ozerov <[hidden email]>: > Dmitry, > > Thank you, but how FailureProcessingPolicy looks like? It is not clear how > can I configure different reactions to different event types. > > On Wed, Nov 29, 2017 at 1:47 PM, Дмитрий Сорокин < > [hidden email]> > wrote: > > > Vladimir, > > > > These policies (policy, in fact) can be configured in IgniteConfiguration > > by calling setFailureProcessingPolicy(FailureProcessingPolicy flrPlc) > > method. > > > > 2017-11-29 10:35 GMT+03:00 Vladimir Ozerov <[hidden email]>: > > > > > Denis, > > > > > > Yes, but can we look at proposed API before we dig into implementation? > > > > > > On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <[hidden email]> > wrote: > > > > > > > I think the failure processing policy should be configured via > > > > IgniteConfiguration in a way similar to the segmentation policies. > > > > > > > > — > > > > Denis > > > > > > > > > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov < > [hidden email]> > > > > wrote: > > > > > > > > > > Dmitry, > > > > > > > > > > How these policies will be configured? Do you have any API in mind? > > > > > > > > > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <[hidden email]> > > > wrote: > > > > > > > > > >> No objections here. Additional policies like EXEC might be added > > later > > > > >> depending on user needs. > > > > >> > > > > >> — > > > > >> Denis > > > > >> > > > > >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин < > > > > [hidden email]> > > > > >> wrote: > > > > >>> > > > > >>> Denis, > > > > >>> I propose start with first three policies (it's already > > implemented, > > > > just > > > > >>> await some code combing, commit & review). > > > > >>> About of fourth policy (EXEC) I think that it's rather additional > > > > >> property > > > > >>> (some script path) than policy. > > > > >>> > > > > >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <[hidden email]>: > > > > >>> > > > > >>>> Just provide FailureProcessingPolicy with possible reactions: > > > > >>>> - NOOP - exceptions will be reported, metrics will be triggered > > but > > > an > > > > >>>> affected Ignite process won’t be touched. > > > > >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + > > Ignite > > > > >>>> process termination. > > > > >>>> - RESTART - NOOP actions + process restart. > > > > >>>> - EXEC - execute a custom script provided by the user. > > > > >>>> > > > > >>>> If needed the policy can be set per know failure such is OOM, > > > > >> Persistence > > > > >>>> errors so that the user can act accordingly basing on a context. > > > > >>>> > > > > >>>> — > > > > >>>> Denis > > > > >>>> > > > > >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov < > > > [hidden email]> > > > > >>>> wrote: > > > > >>>>> > > > > >>>>> In the first iteration I would focus only on reporting > > facilities, > > > to > > > > >> let > > > > >>>>> administrator spot dangerous situation. And in the second > phase, > > > when > > > > >> all > > > > >>>>> reporting and metrics are ready, we can think on some automatic > > > > >> actions. > > > > >>>>> > > > > >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > > > > >>>> [hidden email] > > > > >>>>>> wrote: > > > > >>>>> > > > > >>>>>> Hi Anton, > > > > >>>>>> > > > > >>>>>> I don't think that we should shutdown node in case of > > > > >>>> IgniteOOMException, > > > > >>>>>> if one node has no space, then other probably don't have it > > too, > > > so > > > > >> re > > > > >>>>>> -balancing will cause IgniteOOM on all other nodes and will > kill > > > the > > > > >>>> whole > > > > >>>>>> cluster. I think for some configurations cluster should > survive > > > and > > > > >>>> allow > > > > >>>>>> to user clean cache or/and add more nodes. > > > > >>>>>> > > > > >>>>>> Thanks, > > > > >>>>>> Mikhail. > > > > >>>>>> > > > > >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > > > > >>>>>> [hidden email]> написал: > > > > >>>>>> > > > > >>>>>>> Igniters, > > > > >>>>>>> > > > > >>>>>>> Internal problems may and, unfortunately, cause unexpected > > > cluster > > > > >>>>>>> behavior. > > > > >>>>>>> We should determine behavior in case any of internal problem > > > > >> happened. > > > > >>>>>>> > > > > >>>>>>> Well known internal problems can be split to: > > > > >>>>>>> 1) OOM or any other reason cause node crash > > > > >>>>>>> > > > > >>>>>>> 2) Situations required graceful node shutdown with custom > > > > >> notification > > > > >>>>>>> - IgniteOutOfMemoryException > > > > >>>>>>> - Persistence errors > > > > >>>>>>> - ExchangeWorker exits with error > > > > >>>>>>> > > > > >>>>>>> 3) Prefomance issues should be covered by metrics > > > > >>>>>>> - GC STW duration > > > > >>>>>>> - Timed out tasks and jobs > > > > >>>>>>> - TX deadlock > > > > >>>>>>> - Hanged Tx (waits for some service) > > > > >>>>>>> - Java Deadlocks > > > > >>>>>>> > > > > >>>>>>> I created special issue [1] to make sure all these metrics > will > > > be > > > > >>>>>>> presented at WebConsole or VisorConsole (what's preferred?) > > > > >>>>>>> > > > > >>>>>>> 4) Situations required external monitoring implementation > > > > >>>>>>> - GC STW duration exceed maximum possible length (node should > > be > > > > >>>> stopped > > > > >>>>>>> before STW finished) > > > > >>>>>>> > > > > >>>>>>> All this problems were reported by different persons > different > > > time > > > > >>>> ago, > > > > >>>>>>> So, we should reanalyze each of them and, possible, find > better > > > > ways > > > > >> to > > > > >>>>>>> solve them than it described at issues. > > > > >>>>>>> > > > > >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to > mention > > > > >>>> something > > > > >>>>>>> else :) > > > > >>>>>>> > > > > >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > > > > >>>>>>> [2] > > > > >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > >>>>>>> 7%3A+Ignite+internal+problems+detection > > > > >>>>>>> > > > > >>>>>> > > > > >>>> > > > > >>>> > > > > >> > > > > >> > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |