Apache Ignite Developers - Legacy Mail Archive

Critical worker threads liveness checking drawbacks

Classic

List

Threaded

33 messages Options

Andrey Kuznetsov

Re: Critical worker threads liveness checking drawbacks

Maxim,

Thanks for being attentive! It's definitely a typo. Could you please create
an issue?

чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <[hidden email]>:

> Folks,
>
> I've found in `GridCachePartitionExchangeManager:2684` [1] (master branch)
> exchange future wrapped
> with double `blockingSectionEnd` method. Is it correct? I just want to
> understand this change and
> how should I use this in the future.
>
> Should I file a new issue to fix this? I think here `blockingSectionBegin`
> method should be used.
>
> -------------
> blockingSectionEnd();
>
> try {
> resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> } finally {
> blockingSectionEnd();
> }
>
>
> [1]
>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
>
> On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <[hidden email]>
> wrote:
>
> > Andrey Gura, thank you for the answer!
> >
> > I agree that wrapping of 'init' method reduces the profit of watchdog
> > service in case of PME worker, but in other cases, we should wrap all
> > possible long sections on GridDhtPartitionExchangeFuture. For example
> > 'onCacheChangeRequest' method or
> > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > significant time (reproducer attached).
> >
> > I only want to point out a possible issue which may allow to end-user
> > halt the Ignite cluster accidentally.
> >
> > I'm sure that PME experts know how to fix this issue properly.
> > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <[hidden email]> wrote:
> > >
> > > Vyacheslav,
> > >
> > > Exchange worker is strongly tied with
> > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also
> > > shouldn't be blocked for long time but in reality it happens.It also
> > > means that your change doesn't make sense.
> > >
> > > What actually make sense it is identification of places which
> > > intentionally blocking. May be some places/actions should be braced by
> > > blocking guards.
> > >
> > > If you have failing tests please make sure that your failureHandler is
> > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > [CRITICAL_WORKER_BLOCKED].
> > >
> > >
> > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> [hidden email]>
> > wrote:
> > > >
> > > > Hi Igniters!
> > > >
> > > > Thank you for this important improvement!
> > > >
> > > > I've looked through implementation and noticed that
> > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked
> > > > section. This means it easy to halt the node in case of longrunning
> > > > actions during PME, for example when we create a cache with
> > > > StoreFactrory which connect to 3rd party DB.
> > > >
> > > > I'm not sure that it is the right behavior.
> > > >
> > > > I filled the issue [1] and prepared the PR [2] with reproducer and
> > possible fix.
> > > >
> > > > Andrey, could you please look at and confirm that it makes sense?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > [2] https://github.com/apache/ignite/pull/4845
> > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <[hidden email]>
> > wrote:
> > > > >
> > > > > Denis,
> > > > >
> > > > > I've created the ticket [1] with short description of the
> > functionality.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > >
> > > > >
> > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <[hidden email]>:
> > > > >
> > > > > > Andrey K. and G.,
> > > > > >
> > > > > > Thanks, do we have a documentation ticket created? Prachi
> (copied)
> > can help
> > > > > > with the documentation.
> > > > > >
> > > > > > --
> > > > > > Denis
> > > > > >
> > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <[hidden email]>
> > wrote:
> > > > > >
> > > > > > > Andrey,
> > > > > > >
> > > > > > > finally your change is merged to master branch. Congratulations
> > and
> > > > > > > thank you very much! :)
> > > > > > >
> > > > > > > I think that the next step is feature that will allow signal
> > about
> > > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > > >
> > > > > > > I hope you will continue development of this feature and
> provide
> > your
> > > > > > > vision in new JIRA issue.
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> > [hidden email]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > David, Maxim!
> > > > > > > >
> > > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt all
> > of them
> > > > > > > right
> > > > > > > > now: the scope is much broader than the scope of the change I
> > > > > > implement.
> > > > > > > I
> > > > > > > > have had a talk to a group of Ignite commiters, and we agreed
> > to
> > > > > > complete
> > > > > > > > the change as follows.
> > > > > > > > - Blocking instructions in system-critical which may
> resonably
> > last
> > > > > > long
> > > > > > > > should be explicitly excluded from the monitoring.
> > > > > > > > - Failure handlers should have a setting to suppress some
> > failures on
> > > > > > > > per-failure-type basis.
> > > > > > > > According to this I have updated the implementation: [1]
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > > >
> > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> > [hidden email]>:
> > > > > > > >
> > > > > > > > > When I've done this before,I've needed to find the oldest
> > thread,
> > > > > > and
> > > > > > > kill
> > > > > > > > > the node running that. From a language standpoint,
> Maxim's
> > "without
> > > > > > > > > progress" better than "heartbeat". For example, what I'm
> > most
> > > > > > > interested
> > > > > > > > > in on a distributed system is which thread started the work
> > it has
> > > > > > not
> > > > > > > > > completed the earliest, and when did that thread last make
> > forward
> > > > > > > > > process. You don't want to kill a node because a thread
> > is
> > > > > > waiting
> > > > > > > on a
> > > > > > > > > lock held by a thread that went off-node and has not
> gotten a
> > > > > > response.
> > > > > > > > > If you don't understand the dependency relationships, you
> > will make
> > > > > > > > > incorrect recovery decisions.
> > > > > > > > >
> > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <
> > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I think we should find exact answers to these questions:
> > > > > > > > > > 1. What `critical` issue exactly is?
> > > > > > > > > > 2. How can we find critical issues?
> > > > > > > > > > 3. How can we handle critical issues?
> > > > > > > > > >
> > > > > > > > > > First,
> > > > > > > > > > - Ignore uninterruptable actions (e.g. worker\service
> > shutdown)
> > > > > > > > > > - Long I/O operations (should be a configurable timeout
> > for each
> > > > > > > type of
> > > > > > > > > > usage)
> > > > > > > > > > - Infinite loops
> > > > > > > > > > - Stalled\deadlocked threads (and\or too many parked
> > threads,
> > > > > > > exclude
> > > > > > > > > I/O)
> > > > > > > > > >
> > > > > > > > > > Second,
> > > > > > > > > > - The working queue is without progress (e.g. disco,
> > exchange
> > > > > > > queues)
> > > > > > > > > > - Work hasn't been completed since the last heartbeat
> > (checking
> > > > > > > > > > milestones)
> > > > > > > > > > - Too many system resources used by a thread for the
> long
> > period
> > > > > > of
> > > > > > > time
> > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > - Timing fields associated with each thread status
> > exceeded a
> > > > > > > maximum
> > > > > > > > > time
> > > > > > > > > > limit.
> > > > > > > > > >
> > > > > > > > > > Third (not too many options here),
> > > > > > > > > > - `log everything` should be the default behaviour in
> all
> > these
> > > > > > > cases,
> > > > > > > > > > since it may be difficult to find the cause after the
> > restart.
> > > > > > > > > > - Wait some interval of time and kill the hanging node
> > (cluster
> > > > > > > should
> > > > > > > > > be
> > > > > > > > > > configured stable enough)
> > > > > > > > > >
> > > > > > > > > > Questions,
> > > > > > > > > > - Not sure, but can workers miss their heartbeat
> > deadlines if CPU
> > > > > > > loads
> > > > > > > > > up
> > > > > > > > > > to 80%-90%? Bursts of momentary overloads can be
> > > > > > > > > > expected behaviour as a normal part of system
> > operations.
> > > > > > > > > > - Why do we decide that critical thread should monitor
> > each other?
> > > > > > > For
> > > > > > > > > > instance, if all the tasks were blocked and unable to
> run,
> > > > > > > > > > node reset would never occur. As for me, a better
> > solution is
> > > > > > to
> > > > > > > use
> > > > > > > > > a
> > > > > > > > > > separate monitor thread or pool (maybe both with software
> > > > > > > > > > and hardware checks) that not only checks heartbeats
> > but
> > > > > > > monitors the
> > > > > > > > > > other system as well.
> > > > > > > > > >
> > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <
> > [hidden email]>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > It would be safer to restart the entire cluster than to
> > remove
> > > > > > the
> > > > > > > last
> > > > > > > > > > > node for a cache that should be redundant.
> > > > > > > > > > >
> > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <
> > [hidden email]>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > I agree with Yakov that we can provide some option
> > that manage
> > > > > > > worker
> > > > > > > > > > > > liveness checker behavior in case of observing that
> > some worker
> > > > > > > is
> > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > At least it will some workaround for cases when node
> > fails is
> > > > > > > too
> > > > > > > > > > > > annoying.
> > > > > > > > > > > >
> > > > > > > > > > > > Backups count threshold sounds good but I don't
> > understand how
> > > > > > it
> > > > > > > > > will
> > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > >
> > > > > > > > > > > > The simplest solution here is alert in cases of
> > blocking of
> > > > > > some
> > > > > > > > > > > > critical worker (we can improve WorkersRegistry for
> > this
> > > > > > purpose
> > > > > > > and
> > > > > > > > > > > > expose list of blocked workers) and optionally call
> > system
> > > > > > > configured
> > > > > > > > > > > > failure processor. BTW, failure processor can be
> > extended in
> > > > > > > order to
> > > > > > > > > > > > perform any checks (e.g. backup count) and decide
> > whether it
> > > > > > > should
> > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <
> > > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > David, Yakov, I understand your fears. But liveness
> > checks
> > > > > > deal
> > > > > > > > > with
> > > > > > > > > > > > > _critical_ conditions, i.e. when such a condition
> is
> > met we
> > > > > > > > > conclude
> > > > > > > > > > > the
> > > > > > > > > > > > > node as totally broken, and there is no sense to
> > keep it
> > > > > > alive
> > > > > > > > > > > regardless
> > > > > > > > > > > > > the data it contains. If we want to give it a
> > chance, then
> > > > > > the
> > > > > > > > > > > condition
> > > > > > > > > > > > > (long fsync etc.) should not considered as critical
> > at all.
> > > > > > > > > > > > >
> > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <
> > > > > > > [hidden email]>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Agree with David. We need to have an opporunity
> > set backups
> > > > > > > count
> > > > > > > > > > > > threshold
> > > > > > > > > > > > > > (at runtime also!) that will not allow any
> > automatic stop
> > > > > > if
> > > > > > > > > there
> > > > > > > > > > > > will be
> > > > > > > > > > > > > > a data loss. Andrey, what do you think?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > --
> > > > > > > > > > Maxim Muzafarov
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Andrey Kuznetsov.
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey Kuznetsov.
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards, Vyacheslav D.
> >
> >
> >
> > --
> > Best Regards, Vyacheslav D.
> >
> --
> --
> Maxim Muzafarov
>

--
Best regards,
Andrey Kuznetsov.

Nikolay Izhikov-2

Re: Critical worker threads liveness checking drawbacks

Hello, Igniters.

I found that this feature can't be disabled from config.
The only way to disable it is from JMX bean.

I think it very dangerous: If we have some corner case or a bug in this Watch Dog it can make Ignite unusable.
I propose to implement possibility to disable this feature both - from config and from JVM options.

What do you think?

В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:

> Maxim,
>
> Thanks for being attentive! It's definitely a typo. Could you please create
> an issue?
>
> чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <[hidden email]>:
>
> > Folks,
> >
> > I've found in `GridCachePartitionExchangeManager:2684` [1] (master branch)
> > exchange future wrapped
> > with double `blockingSectionEnd` method. Is it correct? I just want to
> > understand this change and
> > how should I use this in the future.
> >
> > Should I file a new issue to fix this? I think here `blockingSectionBegin`
> > method should be used.
> >
> > -------------
> > blockingSectionEnd();
> >
> > try {
> > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > } finally {
> > blockingSectionEnd();
> > }
> >
> >
> > [1]
> >
> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> >
> > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <[hidden email]>
> > wrote:
> >
> > > Andrey Gura, thank you for the answer!
> > >
> > > I agree that wrapping of 'init' method reduces the profit of watchdog
> > > service in case of PME worker, but in other cases, we should wrap all
> > > possible long sections on GridDhtPartitionExchangeFuture. For example
> > > 'onCacheChangeRequest' method or
> > > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > > significant time (reproducer attached).
> > >
> > > I only want to point out a possible issue which may allow to end-user
> > > halt the Ignite cluster accidentally.
> > >
> > > I'm sure that PME experts know how to fix this issue properly.
> > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <[hidden email]> wrote:
> > > >
> > > > Vyacheslav,
> > > >
> > > > Exchange worker is strongly tied with
> > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also
> > > > shouldn't be blocked for long time but in reality it happens.It also
> > > > means that your change doesn't make sense.
> > > >
> > > > What actually make sense it is identification of places which
> > > > intentionally blocking. May be some places/actions should be braced by
> > > > blocking guards.
> > > >
> > > > If you have failing tests please make sure that your failureHandler is
> > > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > > [CRITICAL_WORKER_BLOCKED].
> > > >
> > > >
> > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> >
> > [hidden email]>
> > > wrote:
> > > > >
> > > > > Hi Igniters!
> > > > >
> > > > > Thank you for this important improvement!
> > > > >
> > > > > I've looked through implementation and noticed that
> > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked
> > > > > section. This means it easy to halt the node in case of longrunning
> > > > > actions during PME, for example when we create a cache with
> > > > > StoreFactrory which connect to 3rd party DB.
> > > > >
> > > > > I'm not sure that it is the right behavior.
> > > > >
> > > > > I filled the issue [1] and prepared the PR [2] with reproducer and
> > >
> > > possible fix.
> > > > >
> > > > > Andrey, could you please look at and confirm that it makes sense?
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <[hidden email]>
> > >
> > > wrote:
> > > > > >
> > > > > > Denis,
> > > > > >
> > > > > > I've created the ticket [1] with short description of the
> > >
> > > functionality.
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > >
> > > > > >
> > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <[hidden email]>:
> > > > > >
> > > > > > > Andrey K. and G.,
> > > > > > >
> > > > > > > Thanks, do we have a documentation ticket created? Prachi
> >
> > (copied)
> > > can help
> > > > > > > with the documentation.
> > > > > > >
> > > > > > > --
> > > > > > > Denis
> > > > > > >
> > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <[hidden email]>
> > >
> > > wrote:
> > > > > > >
> > > > > > > > Andrey,
> > > > > > > >
> > > > > > > > finally your change is merged to master branch. Congratulations
> > >
> > > and
> > > > > > > > thank you very much! :)
> > > > > > > >
> > > > > > > > I think that the next step is feature that will allow signal
> > >
> > > about
> > > > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > > > >
> > > > > > > > I hope you will continue development of this feature and
> >
> > provide
> > > your
> > > > > > > > vision in new JIRA issue.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> > >
> > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > David, Maxim!
> > > > > > > > >
> > > > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt all
> > >
> > > of them
> > > > > > > > right
> > > > > > > > > now: the scope is much broader than the scope of the change I
> > > > > > >
> > > > > > > implement.
> > > > > > > > I
> > > > > > > > > have had a talk to a group of Ignite commiters, and we agreed
> > >
> > > to
> > > > > > > complete
> > > > > > > > > the change as follows.
> > > > > > > > > - Blocking instructions in system-critical which may
> >
> > resonably
> > > last
> > > > > > > long
> > > > > > > > > should be explicitly excluded from the monitoring.
> > > > > > > > > - Failure handlers should have a setting to suppress some
> > >
> > > failures on
> > > > > > > > > per-failure-type basis.
> > > > > > > > > According to this I have updated the implementation: [1]
> > > > > > > > >
> > > > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > > > >
> > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> > >
> > > [hidden email]>:
> > > > > > > > >
> > > > > > > > > > When I've done this before,I've needed to find the oldest
> > >
> > > thread,
> > > > > > > and
> > > > > > > > kill
> > > > > > > > > > the node running that. From a language standpoint,
> >
> > Maxim's
> > > "without
> > > > > > > > > > progress" better than "heartbeat". For example, what I'm
> > >
> > > most
> > > > > > > > interested
> > > > > > > > > > in on a distributed system is which thread started the work
> > >
> > > it has
> > > > > > > not
> > > > > > > > > > completed the earliest, and when did that thread last make
> > >
> > > forward
> > > > > > > > > > process. You don't want to kill a node because a thread
> > >
> > > is
> > > > > > > waiting
> > > > > > > > on a
> > > > > > > > > > lock held by a thread that went off-node and has not
> >
> > gotten a
> > > > > > > response.
> > > > > > > > > > If you don't understand the dependency relationships, you
> > >
> > > will make
> > > > > > > > > > incorrect recovery decisions.
> > > > > > > > > >
> > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <
> > >
> > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I think we should find exact answers to these questions:
> > > > > > > > > > > 1. What `critical` issue exactly is?
> > > > > > > > > > > 2. How can we find critical issues?
> > > > > > > > > > > 3. How can we handle critical issues?
> > > > > > > > > > >
> > > > > > > > > > > First,
> > > > > > > > > > > - Ignore uninterruptable actions (e.g. worker\service
> > >
> > > shutdown)
> > > > > > > > > > > - Long I/O operations (should be a configurable timeout
> > >
> > > for each
> > > > > > > > type of
> > > > > > > > > > > usage)
> > > > > > > > > > > - Infinite loops
> > > > > > > > > > > - Stalled\deadlocked threads (and\or too many parked
> > >
> > > threads,
> > > > > > > > exclude
> > > > > > > > > > I/O)
> > > > > > > > > > >
> > > > > > > > > > > Second,
> > > > > > > > > > > - The working queue is without progress (e.g. disco,
> > >
> > > exchange
> > > > > > > > queues)
> > > > > > > > > > > - Work hasn't been completed since the last heartbeat
> > >
> > > (checking
> > > > > > > > > > > milestones)
> > > > > > > > > > > - Too many system resources used by a thread for the
> >
> > long
> > > period
> > > > > > > of
> > > > > > > > time
> > > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > > - Timing fields associated with each thread status
> > >
> > > exceeded a
> > > > > > > > maximum
> > > > > > > > > > time
> > > > > > > > > > > limit.
> > > > > > > > > > >
> > > > > > > > > > > Third (not too many options here),
> > > > > > > > > > > - `log everything` should be the default behaviour in
> >
> > all
> > > these
> > > > > > > > cases,
> > > > > > > > > > > since it may be difficult to find the cause after the
> > >
> > > restart.
> > > > > > > > > > > - Wait some interval of time and kill the hanging node
> > >
> > > (cluster
> > > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > > configured stable enough)
> > > > > > > > > > >
> > > > > > > > > > > Questions,
> > > > > > > > > > > - Not sure, but can workers miss their heartbeat
> > >
> > > deadlines if CPU
> > > > > > > > loads
> > > > > > > > > > up
> > > > > > > > > > > to 80%-90%? Bursts of momentary overloads can be
> > > > > > > > > > > expected behaviour as a normal part of system
> > >
> > > operations.
> > > > > > > > > > > - Why do we decide that critical thread should monitor
> > >
> > > each other?
> > > > > > > > For
> > > > > > > > > > > instance, if all the tasks were blocked and unable to
> >
> > run,
> > > > > > > > > > > node reset would never occur. As for me, a better
> > >
> > > solution is
> > > > > > > to
> > > > > > > > use
> > > > > > > > > > a
> > > > > > > > > > > separate monitor thread or pool (maybe both with software
> > > > > > > > > > > and hardware checks) that not only checks heartbeats
> > >
> > > but
> > > > > > > > monitors the
> > > > > > > > > > > other system as well.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <
> > >
> > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > It would be safer to restart the entire cluster than to
> > >
> > > remove
> > > > > > > the
> > > > > > > > last
> > > > > > > > > > > > node for a cache that should be redundant.
> > > > > > > > > > > >
> > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <
> > >
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I agree with Yakov that we can provide some option
> > >
> > > that manage
> > > > > > > > worker
> > > > > > > > > > > > > liveness checker behavior in case of observing that
> > >
> > > some worker
> > > > > > > > is
> > > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > > At least it will some workaround for cases when node
> > >
> > > fails is
> > > > > > > > too
> > > > > > > > > > > > > annoying.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Backups count threshold sounds good but I don't
> > >
> > > understand how
> > > > > > > it
> > > > > > > > > > will
> > > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The simplest solution here is alert in cases of
> > >
> > > blocking of
> > > > > > > some
> > > > > > > > > > > > > critical worker (we can improve WorkersRegistry for
> > >
> > > this
> > > > > > > purpose
> > > > > > > > and
> > > > > > > > > > > > > expose list of blocked workers) and optionally call
> > >
> > > system
> > > > > > > > configured
> > > > > > > > > > > > > failure processor. BTW, failure processor can be
> > >
> > > extended in
> > > > > > > > order to
> > > > > > > > > > > > > perform any checks (e.g. backup count) and decide
> > >
> > > whether it
> > > > > > > > should
> > > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <
> > > > > > > >
> > > > > > > > [hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > David, Yakov, I understand your fears. But liveness
> > >
> > > checks
> > > > > > > deal
> > > > > > > > > > with
> > > > > > > > > > > > > > _critical_ conditions, i.e. when such a condition
> >
> > is
> > > met we
> > > > > > > > > > conclude
> > > > > > > > > > > > the
> > > > > > > > > > > > > > node as totally broken, and there is no sense to
> > >
> > > keep it
> > > > > > > alive
> > > > > > > > > > > > regardless
> > > > > > > > > > > > > > the data it contains. If we want to give it a
> > >
> > > chance, then
> > > > > > > the
> > > > > > > > > > > > condition
> > > > > > > > > > > > > > (long fsync etc.) should not considered as critical
> > >
> > > at all.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <
> > > > > > > >
> > > > > > > > [hidden email]>:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Agree with David. We need to have an opporunity
> > >
> > > set backups
> > > > > > > > count
> > > > > > > > > > > > > threshold
> > > > > > > > > > > > > > > (at runtime also!) that will not allow any
> > >
> > > automatic stop
> > > > > > > if
> > > > > > > > > > there
> > > > > > > > > > > > > will be
> > > > > > > > > > > > > > > a data loss. Andrey, what do you think?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > --
> > > > > > > > > > > Maxim Muzafarov
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > > Andrey Kuznetsov.
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Andrey Kuznetsov.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards, Vyacheslav D.
> > >
> > >
> > >
> > > --
> > > Best Regards, Vyacheslav D.
> > >
> >
> > --
> > --
> > Maxim Muzafarov
> >
>
>

signature.asc (499 bytes) Download Attachment

Alexey Goncharuk

Re: Critical worker threads liveness checking drawbacks

Nikolay, I agree, a user should be able to disable both thread liveness
check and checkpoint read lock timeout check from config and a system
property.

пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <[hidden email]>:

> Hello, Igniters.
>
> I found that this feature can't be disabled from config.
> The only way to disable it is from JMX bean.
>
> I think it very dangerous: If we have some corner case or a bug in this
> Watch Dog it can make Ignite unusable.
> I propose to implement possibility to disable this feature both - from
> config and from JVM options.
>
> What do you think?
>
> В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > Maxim,
> >
> > Thanks for being attentive! It's definitely a typo. Could you please
> create
> > an issue?
> >
> > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <[hidden email]>:
> >
> > > Folks,
> > >
> > > I've found in `GridCachePartitionExchangeManager:2684` [1] (master
> branch)
> > > exchange future wrapped
> > > with double `blockingSectionEnd` method. Is it correct? I just want to
> > > understand this change and
> > > how should I use this in the future.
> > >
> > > Should I file a new issue to fix this? I think here
> `blockingSectionBegin`
> > > method should be used.
> > >
> > > -------------
> > > blockingSectionEnd();
> > >
> > > try {
> > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > } finally {
> > > blockingSectionEnd();
> > > }
> > >
> > >
> > > [1]
> > >
> > >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > >
> > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <[hidden email]>
> > > wrote:
> > >
> > > > Andrey Gura, thank you for the answer!
> > > >
> > > > I agree that wrapping of 'init' method reduces the profit of watchdog
> > > > service in case of PME worker, but in other cases, we should wrap all
> > > > possible long sections on GridDhtPartitionExchangeFuture. For example
> > > > 'onCacheChangeRequest' method or
> > > > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > > > significant time (reproducer attached).
> > > >
> > > > I only want to point out a possible issue which may allow to end-user
> > > > halt the Ignite cluster accidentally.
> > > >
> > > > I'm sure that PME experts know how to fix this issue properly.
> > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <[hidden email]>
> wrote:
> > > > >
> > > > > Vyacheslav,
> > > > >
> > > > > Exchange worker is strongly tied with
> > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker
> also
> > > > > shouldn't be blocked for long time but in reality it happens.It
> also
> > > > > means that your change doesn't make sense.
> > > > >
> > > > > What actually make sense it is identification of places which
> > > > > intentionally blocking. May be some places/actions should be
> braced by
> > > > > blocking guards.
> > > > >
> > > > > If you have failing tests please make sure that your
> failureHandler is
> > > > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > > > [CRITICAL_WORKER_BLOCKED].
> > > > >
> > > > >
> > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > >
> > > [hidden email]>
> > > > wrote:
> > > > > >
> > > > > > Hi Igniters!
> > > > > >
> > > > > > Thank you for this important improvement!
> > > > > >
> > > > > > I've looked through implementation and noticed that
> > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in
> blocked
> > > > > > section. This means it easy to halt the node in case of
> longrunning
> > > > > > actions during PME, for example when we create a cache with
> > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > >
> > > > > > I'm not sure that it is the right behavior.
> > > > > >
> > > > > > I filled the issue [1] and prepared the PR [2] with reproducer
> and
> > > >
> > > > possible fix.
> > > > > >
> > > > > > Andrey, could you please look at and confirm that it makes sense?
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> [hidden email]>
> > > >
> > > > wrote:
> > > > > > >
> > > > > > > Denis,
> > > > > > >
> > > > > > > I've created the ticket [1] with short description of the
> > > >
> > > > functionality.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > >
> > > > > > >
> > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <[hidden email]>:
> > > > > > >
> > > > > > > > Andrey K. and G.,
> > > > > > > >
> > > > > > > > Thanks, do we have a documentation ticket created? Prachi
> > >
> > > (copied)
> > > > can help
> > > > > > > > with the documentation.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Denis
> > > > > > > >
> > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
> [hidden email]>
> > > >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Andrey,
> > > > > > > > >
> > > > > > > > > finally your change is merged to master branch.
> Congratulations
> > > >
> > > > and
> > > > > > > > > thank you very much! :)
> > > > > > > > >
> > > > > > > > > I think that the next step is feature that will allow
> signal
> > > >
> > > > about
> > > > > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > > > > >
> > > > > > > > > I hope you will continue development of this feature and
> > >
> > > provide
> > > > your
> > > > > > > > > vision in new JIRA issue.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> > > >
> > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > David, Maxim!
> > > > > > > > > >
> > > > > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt
> all
> > > >
> > > > of them
> > > > > > > > > right
> > > > > > > > > > now: the scope is much broader than the scope of the
> change I
> > > > > > > >
> > > > > > > > implement.
> > > > > > > > > I
> > > > > > > > > > have had a talk to a group of Ignite commiters, and we
> agreed
> > > >
> > > > to
> > > > > > > > complete
> > > > > > > > > > the change as follows.
> > > > > > > > > > - Blocking instructions in system-critical which may
> > >
> > > resonably
> > > > last
> > > > > > > > long
> > > > > > > > > > should be explicitly excluded from the monitoring.
> > > > > > > > > > - Failure handlers should have a setting to suppress some
> > > >
> > > > failures on
> > > > > > > > > > per-failure-type basis.
> > > > > > > > > > According to this I have updated the implementation: [1]
> > > > > > > > > >
> > > > > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > > > > >
> > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> > > >
> > > > [hidden email]>:
> > > > > > > > > >
> > > > > > > > > > > When I've done this before,I've needed to find the
> oldest
> > > >
> > > > thread,
> > > > > > > > and
> > > > > > > > > kill
> > > > > > > > > > > the node running that. From a language standpoint,
> > >
> > > Maxim's
> > > > "without
> > > > > > > > > > > progress" better than "heartbeat". For example, what
> I'm
> > > >
> > > > most
> > > > > > > > > interested
> > > > > > > > > > > in on a distributed system is which thread started the
> work
> > > >
> > > > it has
> > > > > > > > not
> > > > > > > > > > > completed the earliest, and when did that thread last
> make
> > > >
> > > > forward
> > > > > > > > > > > process. You don't want to kill a node because a
> thread
> > > >
> > > > is
> > > > > > > > waiting
> > > > > > > > > on a
> > > > > > > > > > > lock held by a thread that went off-node and has not
> > >
> > > gotten a
> > > > > > > > response.
> > > > > > > > > > > If you don't understand the dependency relationships,
> you
> > > >
> > > > will make
> > > > > > > > > > > incorrect recovery decisions.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <
> > > >
> > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I think we should find exact answers to these
> questions:
> > > > > > > > > > > > 1. What `critical` issue exactly is?
> > > > > > > > > > > > 2. How can we find critical issues?
> > > > > > > > > > > > 3. How can we handle critical issues?
> > > > > > > > > > > >
> > > > > > > > > > > > First,
> > > > > > > > > > > > - Ignore uninterruptable actions (e.g.
> worker\service
> > > >
> > > > shutdown)
> > > > > > > > > > > > - Long I/O operations (should be a configurable
> timeout
> > > >
> > > > for each
> > > > > > > > > type of
> > > > > > > > > > > > usage)
> > > > > > > > > > > > - Infinite loops
> > > > > > > > > > > > - Stalled\deadlocked threads (and\or too many parked
> > > >
> > > > threads,
> > > > > > > > > exclude
> > > > > > > > > > > I/O)
> > > > > > > > > > > >
> > > > > > > > > > > > Second,
> > > > > > > > > > > > - The working queue is without progress (e.g. disco,
> > > >
> > > > exchange
> > > > > > > > > queues)
> > > > > > > > > > > > - Work hasn't been completed since the last
> heartbeat
> > > >
> > > > (checking
> > > > > > > > > > > > milestones)
> > > > > > > > > > > > - Too many system resources used by a thread for the
> > >
> > > long
> > > > period
> > > > > > > > of
> > > > > > > > > time
> > > > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > > > - Timing fields associated with each thread status
> > > >
> > > > exceeded a
> > > > > > > > > maximum
> > > > > > > > > > > time
> > > > > > > > > > > > limit.
> > > > > > > > > > > >
> > > > > > > > > > > > Third (not too many options here),
> > > > > > > > > > > > - `log everything` should be the default behaviour
> in
> > >
> > > all
> > > > these
> > > > > > > > > cases,
> > > > > > > > > > > > since it may be difficult to find the cause after the
> > > >
> > > > restart.
> > > > > > > > > > > > - Wait some interval of time and kill the hanging
> node
> > > >
> > > > (cluster
> > > > > > > > > should
> > > > > > > > > > > be
> > > > > > > > > > > > configured stable enough)
> > > > > > > > > > > >
> > > > > > > > > > > > Questions,
> > > > > > > > > > > > - Not sure, but can workers miss their heartbeat
> > > >
> > > > deadlines if CPU
> > > > > > > > > loads
> > > > > > > > > > > up
> > > > > > > > > > > > to 80%-90%? Bursts of momentary overloads can be
> > > > > > > > > > > > expected behaviour as a normal part of system
> > > >
> > > > operations.
> > > > > > > > > > > > - Why do we decide that critical thread should
> monitor
> > > >
> > > > each other?
> > > > > > > > > For
> > > > > > > > > > > > instance, if all the tasks were blocked and unable to
> > >
> > > run,
> > > > > > > > > > > > node reset would never occur. As for me, a better
> > > >
> > > > solution is
> > > > > > > > to
> > > > > > > > > use
> > > > > > > > > > > a
> > > > > > > > > > > > separate monitor thread or pool (maybe both with
> software
> > > > > > > > > > > > and hardware checks) that not only checks
> heartbeats
> > > >
> > > > but
> > > > > > > > > monitors the
> > > > > > > > > > > > other system as well.
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <
> > > >
> > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > It would be safer to restart the entire cluster
> than to
> > > >
> > > > remove
> > > > > > > > the
> > > > > > > > > last
> > > > > > > > > > > > > node for a cache that should be redundant.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <
> > > >
> > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I agree with Yakov that we can provide some
> option
> > > >
> > > > that manage
> > > > > > > > > worker
> > > > > > > > > > > > > > liveness checker behavior in case of observing
> that
> > > >
> > > > some worker
> > > > > > > > > is
> > > > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > > > At least it will some workaround for cases when
> node
> > > >
> > > > fails is
> > > > > > > > > too
> > > > > > > > > > > > > > annoying.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Backups count threshold sounds good but I don't
> > > >
> > > > understand how
> > > > > > > > it
> > > > > > > > > > > will
> > > > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The simplest solution here is alert in cases of
> > > >
> > > > blocking of
> > > > > > > > some
> > > > > > > > > > > > > > critical worker (we can improve WorkersRegistry
> for
> > > >
> > > > this
> > > > > > > > purpose
> > > > > > > > > and
> > > > > > > > > > > > > > expose list of blocked workers) and optionally
> call
> > > >
> > > > system
> > > > > > > > > configured
> > > > > > > > > > > > > > failure processor. BTW, failure processor can be
> > > >
> > > > extended in
> > > > > > > > > order to
> > > > > > > > > > > > > > perform any checks (e.g. backup count) and decide
> > > >
> > > > whether it
> > > > > > > > > should
> > > > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <
> > > > > > > > >
> > > > > > > > > [hidden email]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > David, Yakov, I understand your fears. But
> liveness
> > > >
> > > > checks
> > > > > > > > deal
> > > > > > > > > > > with
> > > > > > > > > > > > > > > _critical_ conditions, i.e. when such a
> condition
> > >
> > > is
> > > > met we
> > > > > > > > > > > conclude
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > node as totally broken, and there is no sense
> to
> > > >
> > > > keep it
> > > > > > > > alive
> > > > > > > > > > > > > regardless
> > > > > > > > > > > > > > > the data it contains. If we want to give it a
> > > >
> > > > chance, then
> > > > > > > > the
> > > > > > > > > > > > > condition
> > > > > > > > > > > > > > > (long fsync etc.) should not considered as
> critical
> > > >
> > > > at all.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <
> > > > > > > > >
> > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Agree with David. We need to have an
> opporunity
> > > >
> > > > set backups
> > > > > > > > > count
> > > > > > > > > > > > > > threshold
> > > > > > > > > > > > > > > > (at runtime also!) that will not allow any
> > > >
> > > > automatic stop
> > > > > > > > if
> > > > > > > > > > > there
> > > > > > > > > > > > > > will be
> > > > > > > > > > > > > > > > a data loss. Andrey, what do you think?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > --
> > > > > > > > > > > > Maxim Muzafarov
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > > Andrey Kuznetsov.
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Andrey Kuznetsov.
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards, Vyacheslav D.
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards, Vyacheslav D.
> > > >
> > >
> > > --
> > > --
> > > Maxim Muzafarov
> > >
> >
> >
>

Nikolay Izhikov-2

Re: Critical worker threads liveness checking drawbacks

Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737

Fixed version is 2.7.

В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:

> Nikolay, I agree, a user should be able to disable both thread liveness
> check and checkpoint read lock timeout check from config and a system
> property.
>
> пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <[hidden email]>:
>
> > Hello, Igniters.
> >
> > I found that this feature can't be disabled from config.
> > The only way to disable it is from JMX bean.
> >
> > I think it very dangerous: If we have some corner case or a bug in this
> > Watch Dog it can make Ignite unusable.
> > I propose to implement possibility to disable this feature both - from
> > config and from JVM options.
> >
> > What do you think?
> >
> > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > Maxim,
> > >
> > > Thanks for being attentive! It's definitely a typo. Could you please
> >
> > create
> > > an issue?
> > >
> > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <[hidden email]>:
> > >
> > > > Folks,
> > > >
> > > > I've found in `GridCachePartitionExchangeManager:2684` [1] (master
> >
> > branch)
> > > > exchange future wrapped
> > > > with double `blockingSectionEnd` method. Is it correct? I just want to
> > > > understand this change and
> > > > how should I use this in the future.
> > > >
> > > > Should I file a new issue to fix this? I think here
> >
> > `blockingSectionBegin`
> > > > method should be used.
> > > >
> > > > -------------
> > > > blockingSectionEnd();
> > > >
> > > > try {
> > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > > } finally {
> > > > blockingSectionEnd();
> > > > }
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> >
> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > >
> > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <[hidden email]>
> > > > wrote:
> > > >
> > > > > Andrey Gura, thank you for the answer!
> > > > >
> > > > > I agree that wrapping of 'init' method reduces the profit of watchdog
> > > > > service in case of PME worker, but in other cases, we should wrap all
> > > > > possible long sections on GridDhtPartitionExchangeFuture. For example
> > > > > 'onCacheChangeRequest' method or
> > > > > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > > > > significant time (reproducer attached).
> > > > >
> > > > > I only want to point out a possible issue which may allow to end-user
> > > > > halt the Ignite cluster accidentally.
> > > > >
> > > > > I'm sure that PME experts know how to fix this issue properly.
> > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <[hidden email]>
> >
> > wrote:
> > > > > >
> > > > > > Vyacheslav,
> > > > > >
> > > > > > Exchange worker is strongly tied with
> > > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker
> >
> > also
> > > > > > shouldn't be blocked for long time but in reality it happens.It
> >
> > also
> > > > > > means that your change doesn't make sense.
> > > > > >
> > > > > > What actually make sense it is identification of places which
> > > > > > intentionally blocking. May be some places/actions should be
> >
> > braced by
> > > > > > blocking guards.
> > > > > >
> > > > > > If you have failing tests please make sure that your
> >
> > failureHandler is
> > > > > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > >
> > > > [hidden email]>
> > > > > wrote:
> > > > > > >
> > > > > > > Hi Igniters!
> > > > > > >
> > > > > > > Thank you for this important improvement!
> > > > > > >
> > > > > > > I've looked through implementation and noticed that
> > > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in
> >
> > blocked
> > > > > > > section. This means it easy to halt the node in case of
> >
> > longrunning
> > > > > > > actions during PME, for example when we create a cache with
> > > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > > >
> > > > > > > I'm not sure that it is the right behavior.
> > > > > > >
> > > > > > > I filled the issue [1] and prepared the PR [2] with reproducer
> >
> > and
> > > > >
> > > > > possible fix.
> > > > > > >
> > > > > > > Andrey, could you please look at and confirm that it makes sense?
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> >
> > [hidden email]>
> > > > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > Denis,
> > > > > > > >
> > > > > > > > I've created the ticket [1] with short description of the
> > > > >
> > > > > functionality.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > > >
> > > > > > > >
> > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <[hidden email]>:
> > > > > > > >
> > > > > > > > > Andrey K. and G.,
> > > > > > > > >
> > > > > > > > > Thanks, do we have a documentation ticket created? Prachi
> > > >
> > > > (copied)
> > > > > can help
> > > > > > > > > with the documentation.
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Denis
> > > > > > > > >
> > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
> >
> > [hidden email]>
> > > > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Andrey,
> > > > > > > > > >
> > > > > > > > > > finally your change is merged to master branch.
> >
> > Congratulations
> > > > >
> > > > > and
> > > > > > > > > > thank you very much! :)
> > > > > > > > > >
> > > > > > > > > > I think that the next step is feature that will allow
> >
> > signal
> > > > >
> > > > > about
> > > > > > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > > > > > >
> > > > > > > > > > I hope you will continue development of this feature and
> > > >
> > > > provide
> > > > > your
> > > > > > > > > > vision in new JIRA issue.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> > > > >
> > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > David, Maxim!
> > > > > > > > > > >
> > > > > > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt
> >
> > all
> > > > >
> > > > > of them
> > > > > > > > > > right
> > > > > > > > > > > now: the scope is much broader than the scope of the
> >
> > change I
> > > > > > > > >
> > > > > > > > > implement.
> > > > > > > > > > I
> > > > > > > > > > > have had a talk to a group of Ignite commiters, and we
> >
> > agreed
> > > > >
> > > > > to
> > > > > > > > > complete
> > > > > > > > > > > the change as follows.
> > > > > > > > > > > - Blocking instructions in system-critical which may
> > > >
> > > > resonably
> > > > > last
> > > > > > > > > long
> > > > > > > > > > > should be explicitly excluded from the monitoring.
> > > > > > > > > > > - Failure handlers should have a setting to suppress some
> > > > >
> > > > > failures on
> > > > > > > > > > > per-failure-type basis.
> > > > > > > > > > > According to this I have updated the implementation: [1]
> > > > > > > > > > >
> > > > > > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > > > > > >
> > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> > > > >
> > > > > [hidden email]>:
> > > > > > > > > > >
> > > > > > > > > > > > When I've done this before,I've needed to find the
> >
> > oldest
> > > > >
> > > > > thread,
> > > > > > > > > and
> > > > > > > > > > kill
> > > > > > > > > > > > the node running that. From a language standpoint,
> > > >
> > > > Maxim's
> > > > > "without
> > > > > > > > > > > > progress" better than "heartbeat". For example, what
> >
> > I'm
> > > > >
> > > > > most
> > > > > > > > > > interested
> > > > > > > > > > > > in on a distributed system is which thread started the
> >
> > work
> > > > >
> > > > > it has
> > > > > > > > > not
> > > > > > > > > > > > completed the earliest, and when did that thread last
> >
> > make
> > > > >
> > > > > forward
> > > > > > > > > > > > process. You don't want to kill a node because a
> >
> > thread
> > > > >
> > > > > is
> > > > > > > > > waiting
> > > > > > > > > > on a
> > > > > > > > > > > > lock held by a thread that went off-node and has not
> > > >
> > > > gotten a
> > > > > > > > > response.
> > > > > > > > > > > > If you don't understand the dependency relationships,
> >
> > you
> > > > >
> > > > > will make
> > > > > > > > > > > > incorrect recovery decisions.
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <
> > > > >
> > > > > [hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > I think we should find exact answers to these
> >
> > questions:
> > > > > > > > > > > > > 1. What `critical` issue exactly is?
> > > > > > > > > > > > > 2. How can we find critical issues?
> > > > > > > > > > > > > 3. How can we handle critical issues?
> > > > > > > > > > > > >
> > > > > > > > > > > > > First,
> > > > > > > > > > > > > - Ignore uninterruptable actions (e.g.
> >
> > worker\service
> > > > >
> > > > > shutdown)
> > > > > > > > > > > > > - Long I/O operations (should be a configurable
> >
> > timeout
> > > > >
> > > > > for each
> > > > > > > > > > type of
> > > > > > > > > > > > > usage)
> > > > > > > > > > > > > - Infinite loops
> > > > > > > > > > > > > - Stalled\deadlocked threads (and\or too many parked
> > > > >
> > > > > threads,
> > > > > > > > > > exclude
> > > > > > > > > > > > I/O)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Second,
> > > > > > > > > > > > > - The working queue is without progress (e.g. disco,
> > > > >
> > > > > exchange
> > > > > > > > > > queues)
> > > > > > > > > > > > > - Work hasn't been completed since the last
> >
> > heartbeat
> > > > >
> > > > > (checking
> > > > > > > > > > > > > milestones)
> > > > > > > > > > > > > - Too many system resources used by a thread for the
> > > >
> > > > long
> > > > > period
> > > > > > > > > of
> > > > > > > > > > time
> > > > > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > > > > - Timing fields associated with each thread status
> > > > >
> > > > > exceeded a
> > > > > > > > > > maximum
> > > > > > > > > > > > time
> > > > > > > > > > > > > limit.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Third (not too many options here),
> > > > > > > > > > > > > - `log everything` should be the default behaviour
> >
> > in
> > > >
> > > > all
> > > > > these
> > > > > > > > > > cases,
> > > > > > > > > > > > > since it may be difficult to find the cause after the
> > > > >
> > > > > restart.
> > > > > > > > > > > > > - Wait some interval of time and kill the hanging
> >
> > node
> > > > >
> > > > > (cluster
> > > > > > > > > > should
> > > > > > > > > > > > be
> > > > > > > > > > > > > configured stable enough)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Questions,
> > > > > > > > > > > > > - Not sure, but can workers miss their heartbeat
> > > > >
> > > > > deadlines if CPU
> > > > > > > > > > loads
> > > > > > > > > > > > up
> > > > > > > > > > > > > to 80%-90%? Bursts of momentary overloads can be
> > > > > > > > > > > > > expected behaviour as a normal part of system
> > > > >
> > > > > operations.
> > > > > > > > > > > > > - Why do we decide that critical thread should
> >
> > monitor
> > > > >
> > > > > each other?
> > > > > > > > > > For
> > > > > > > > > > > > > instance, if all the tasks were blocked and unable to
> > > >
> > > > run,
> > > > > > > > > > > > > node reset would never occur. As for me, a better
> > > > >
> > > > > solution is
> > > > > > > > > to
> > > > > > > > > > use
> > > > > > > > > > > > a
> > > > > > > > > > > > > separate monitor thread or pool (maybe both with
> >
> > software
> > > > > > > > > > > > > and hardware checks) that not only checks
> >
> > heartbeats
> > > > >
> > > > > but
> > > > > > > > > > monitors the
> > > > > > > > > > > > > other system as well.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <
> > > > >
> > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > It would be safer to restart the entire cluster
> >
> > than to
> > > > >
> > > > > remove
> > > > > > > > > the
> > > > > > > > > > last
> > > > > > > > > > > > > > node for a cache that should be redundant.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <
> > > > >
> > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I agree with Yakov that we can provide some
> >
> > option
> > > > >
> > > > > that manage
> > > > > > > > > > worker
> > > > > > > > > > > > > > > liveness checker behavior in case of observing
> >
> > that
> > > > >
> > > > > some worker
> > > > > > > > > > is
> > > > > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > > > > At least it will some workaround for cases when
> >
> > node
> > > > >
> > > > > fails is
> > > > > > > > > > too
> > > > > > > > > > > > > > > annoying.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Backups count threshold sounds good but I don't
> > > > >
> > > > > understand how
> > > > > > > > > it
> > > > > > > > > > > > will
> > > > > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The simplest solution here is alert in cases of
> > > > >
> > > > > blocking of
> > > > > > > > > some
> > > > > > > > > > > > > > > critical worker (we can improve WorkersRegistry
> >
> > for
> > > > >
> > > > > this
> > > > > > > > > purpose
> > > > > > > > > > and
> > > > > > > > > > > > > > > expose list of blocked workers) and optionally
> >
> > call
> > > > >
> > > > > system
> > > > > > > > > > configured
> > > > > > > > > > > > > > > failure processor. BTW, failure processor can be
> > > > >
> > > > > extended in
> > > > > > > > > > order to
> > > > > > > > > > > > > > > perform any checks (e.g. backup count) and decide
> > > > >
> > > > > whether it
> > > > > > > > > > should
> > > > > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <
> > > > > > > > > >
> > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > David, Yakov, I understand your fears. But
> >
> > liveness
> > > > >
> > > > > checks
> > > > > > > > > deal
> > > > > > > > > > > > with
> > > > > > > > > > > > > > > > _critical_ conditions, i.e. when such a
> >
> > condition
> > > >
> > > > is
> > > > > met we
> > > > > > > > > > > > conclude
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > node as totally broken, and there is no sense
> >
> > to
> > > > >
> > > > > keep it
> > > > > > > > > alive
> > > > > > > > > > > > > > regardless
> > > > > > > > > > > > > > > > the data it contains. If we want to give it a
> > > > >
> > > > > chance, then
> > > > > > > > > the
> > > > > > > > > > > > > > condition
> > > > > > > > > > > > > > > > (long fsync etc.) should not considered as
> >
> > critical
> > > > >
> > > > > at all.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <
> > > > > > > > > >
> > > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Agree with David. We need to have an
> >
> > opporunity
> > > > >
> > > > > set backups
> > > > > > > > > > count
> > > > > > > > > > > > > > > threshold
> > > > > > > > > > > > > > > > > (at runtime also!) that will not allow any
> > > > >
> > > > > automatic stop
> > > > > > > > > if
> > > > > > > > > > > > there
> > > > > > > > > > > > > > > will be
> > > > > > > > > > > > > > > > > a data loss. Andrey, what do you think?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Maxim Muzafarov
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Andrey Kuznetsov.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best Regards, Vyacheslav D.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards, Vyacheslav D.
> > > > >
> > > >
> > > > --
> > > > --
> > > > Maxim Muzafarov
> > > >
> > >
> > >

signature.asc (499 bytes) Download Attachment

agura

Re: Critical worker threads liveness checking drawbacks

Guys,

why we need both config option and system property? I believe one way is enough.
On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <[hidden email]> wrote:

>
> Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
>
> Fixed version is 2.7.
>
> В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > Nikolay, I agree, a user should be able to disable both thread liveness
> > check and checkpoint read lock timeout check from config and a system
> > property.
> >
> > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <[hidden email]>:
> >
> > > Hello, Igniters.
> > >
> > > I found that this feature can't be disabled from config.
> > > The only way to disable it is from JMX bean.
> > >
> > > I think it very dangerous: If we have some corner case or a bug in this
> > > Watch Dog it can make Ignite unusable.
> > > I propose to implement possibility to disable this feature both - from
> > > config and from JVM options.
> > >
> > > What do you think?
> > >
> > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > Maxim,
> > > >
> > > > Thanks for being attentive! It's definitely a typo. Could you please
> > >
> > > create
> > > > an issue?
> > > >
> > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <[hidden email]>:
> > > >
> > > > > Folks,
> > > > >
> > > > > I've found in `GridCachePartitionExchangeManager:2684` [1] (master
> > >
> > > branch)
> > > > > exchange future wrapped
> > > > > with double `blockingSectionEnd` method. Is it correct? I just want to
> > > > > understand this change and
> > > > > how should I use this in the future.
> > > > >
> > > > > Should I file a new issue to fix this? I think here
> > >
> > > `blockingSectionBegin`
> > > > > method should be used.
> > > > >
> > > > > -------------
> > > > > blockingSectionEnd();
> > > > >
> > > > > try {
> > > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > > > } finally {
> > > > > blockingSectionEnd();
> > > > > }
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > >
> > > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > >
> > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <[hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Andrey Gura, thank you for the answer!
> > > > > >
> > > > > > I agree that wrapping of 'init' method reduces the profit of watchdog
> > > > > > service in case of PME worker, but in other cases, we should wrap all
> > > > > > possible long sections on GridDhtPartitionExchangeFuture. For example
> > > > > > 'onCacheChangeRequest' method or
> > > > > > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > > > > > significant time (reproducer attached).
> > > > > >
> > > > > > I only want to point out a possible issue which may allow to end-user
> > > > > > halt the Ignite cluster accidentally.
> > > > > >
> > > > > > I'm sure that PME experts know how to fix this issue properly.
> > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <[hidden email]>
> > >
> > > wrote:
> > > > > > >
> > > > > > > Vyacheslav,
> > > > > > >
> > > > > > > Exchange worker is strongly tied with
> > > > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker
> > >
> > > also
> > > > > > > shouldn't be blocked for long time but in reality it happens.It
> > >
> > > also
> > > > > > > means that your change doesn't make sense.
> > > > > > >
> > > > > > > What actually make sense it is identification of places which
> > > > > > > intentionally blocking. May be some places/actions should be
> > >
> > > braced by
> > > > > > > blocking guards.
> > > > > > >
> > > > > > > If you have failing tests please make sure that your
> > >
> > > failureHandler is
> > > > > > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > > >
> > > > > [hidden email]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Igniters!
> > > > > > > >
> > > > > > > > Thank you for this important improvement!
> > > > > > > >
> > > > > > > > I've looked through implementation and noticed that
> > > > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in
> > >
> > > blocked
> > > > > > > > section. This means it easy to halt the node in case of
> > >
> > > longrunning
> > > > > > > > actions during PME, for example when we create a cache with
> > > > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > > > >
> > > > > > > > I'm not sure that it is the right behavior.
> > > > > > > >
> > > > > > > > I filled the issue [1] and prepared the PR [2] with reproducer
> > >
> > > and
> > > > > >
> > > > > > possible fix.
> > > > > > > >
> > > > > > > > Andrey, could you please look at and confirm that it makes sense?
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> > >
> > > [hidden email]>
> > > > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Denis,
> > > > > > > > >
> > > > > > > > > I've created the ticket [1] with short description of the
> > > > > >
> > > > > > functionality.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <[hidden email]>:
> > > > > > > > >
> > > > > > > > > > Andrey K. and G.,
> > > > > > > > > >
> > > > > > > > > > Thanks, do we have a documentation ticket created? Prachi
> > > > >
> > > > > (copied)
> > > > > > can help
> > > > > > > > > > with the documentation.
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Denis
> > > > > > > > > >
> > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
> > >
> > > [hidden email]>
> > > > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Andrey,
> > > > > > > > > > >
> > > > > > > > > > > finally your change is merged to master branch.
> > >
> > > Congratulations
> > > > > >
> > > > > > and
> > > > > > > > > > > thank you very much! :)
> > > > > > > > > > >
> > > > > > > > > > > I think that the next step is feature that will allow
> > >
> > > signal
> > > > > >
> > > > > > about
> > > > > > > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > > > > > > >
> > > > > > > > > > > I hope you will continue development of this feature and
> > > > >
> > > > > provide
> > > > > > your
> > > > > > > > > > > vision in new JIRA issue.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> > > > > >
> > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > David, Maxim!
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt
> > >
> > > all
> > > > > >
> > > > > > of them
> > > > > > > > > > > right
> > > > > > > > > > > > now: the scope is much broader than the scope of the
> > >
> > > change I
> > > > > > > > > >
> > > > > > > > > > implement.
> > > > > > > > > > > I
> > > > > > > > > > > > have had a talk to a group of Ignite commiters, and we
> > >
> > > agreed
> > > > > >
> > > > > > to
> > > > > > > > > > complete
> > > > > > > > > > > > the change as follows.
> > > > > > > > > > > > - Blocking instructions in system-critical which may
> > > > >
> > > > > resonably
> > > > > > last
> > > > > > > > > > long
> > > > > > > > > > > > should be explicitly excluded from the monitoring.
> > > > > > > > > > > > - Failure handlers should have a setting to suppress some
> > > > > >
> > > > > > failures on
> > > > > > > > > > > > per-failure-type basis.
> > > > > > > > > > > > According to this I have updated the implementation: [1]
> > > > > > > > > > > >
> > > > > > > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > > > > > > >
> > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> > > > > >
> > > > > > [hidden email]>:
> > > > > > > > > > > >
> > > > > > > > > > > > > When I've done this before,I've needed to find the
> > >
> > > oldest
> > > > > >
> > > > > > thread,
> > > > > > > > > > and
> > > > > > > > > > > kill
> > > > > > > > > > > > > the node running that. From a language standpoint,
> > > > >
> > > > > Maxim's
> > > > > > "without
> > > > > > > > > > > > > progress" better than "heartbeat". For example, what
> > >
> > > I'm
> > > > > >
> > > > > > most
> > > > > > > > > > > interested
> > > > > > > > > > > > > in on a distributed system is which thread started the
> > >
> > > work
> > > > > >
> > > > > > it has
> > > > > > > > > > not
> > > > > > > > > > > > > completed the earliest, and when did that thread last
> > >
> > > make
> > > > > >
> > > > > > forward
> > > > > > > > > > > > > process. You don't want to kill a node because a
> > >
> > > thread
> > > > > >
> > > > > > is
> > > > > > > > > > waiting
> > > > > > > > > > > on a
> > > > > > > > > > > > > lock held by a thread that went off-node and has not
> > > > >
> > > > > gotten a
> > > > > > > > > > response.
> > > > > > > > > > > > > If you don't understand the dependency relationships,
> > >
> > > you
> > > > > >
> > > > > > will make
> > > > > > > > > > > > > incorrect recovery decisions.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <
> > > > > >
> > > > > > [hidden email]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I think we should find exact answers to these
> > >
> > > questions:
> > > > > > > > > > > > > > 1. What `critical` issue exactly is?
> > > > > > > > > > > > > > 2. How can we find critical issues?
> > > > > > > > > > > > > > 3. How can we handle critical issues?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > First,
> > > > > > > > > > > > > > - Ignore uninterruptable actions (e.g.
> > >
> > > worker\service
> > > > > >
> > > > > > shutdown)
> > > > > > > > > > > > > > - Long I/O operations (should be a configurable
> > >
> > > timeout
> > > > > >
> > > > > > for each
> > > > > > > > > > > type of
> > > > > > > > > > > > > > usage)
> > > > > > > > > > > > > > - Infinite loops
> > > > > > > > > > > > > > - Stalled\deadlocked threads (and\or too many parked
> > > > > >
> > > > > > threads,
> > > > > > > > > > > exclude
> > > > > > > > > > > > > I/O)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Second,
> > > > > > > > > > > > > > - The working queue is without progress (e.g. disco,
> > > > > >
> > > > > > exchange
> > > > > > > > > > > queues)
> > > > > > > > > > > > > > - Work hasn't been completed since the last
> > >
> > > heartbeat
> > > > > >
> > > > > > (checking
> > > > > > > > > > > > > > milestones)
> > > > > > > > > > > > > > - Too many system resources used by a thread for the
> > > > >
> > > > > long
> > > > > > period
> > > > > > > > > > of
> > > > > > > > > > > time
> > > > > > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > > > > > - Timing fields associated with each thread status
> > > > > >
> > > > > > exceeded a
> > > > > > > > > > > maximum
> > > > > > > > > > > > > time
> > > > > > > > > > > > > > limit.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Third (not too many options here),
> > > > > > > > > > > > > > - `log everything` should be the default behaviour
> > >
> > > in
> > > > >
> > > > > all
> > > > > > these
> > > > > > > > > > > cases,
> > > > > > > > > > > > > > since it may be difficult to find the cause after the
> > > > > >
> > > > > > restart.
> > > > > > > > > > > > > > - Wait some interval of time and kill the hanging
> > >
> > > node
> > > > > >
> > > > > > (cluster
> > > > > > > > > > > should
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > configured stable enough)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Questions,
> > > > > > > > > > > > > > - Not sure, but can workers miss their heartbeat
> > > > > >
> > > > > > deadlines if CPU
> > > > > > > > > > > loads
> > > > > > > > > > > > > up
> > > > > > > > > > > > > > to 80%-90%? Bursts of momentary overloads can be
> > > > > > > > > > > > > > expected behaviour as a normal part of system
> > > > > >
> > > > > > operations.
> > > > > > > > > > > > > > - Why do we decide that critical thread should
> > >
> > > monitor
> > > > > >
> > > > > > each other?
> > > > > > > > > > > For
> > > > > > > > > > > > > > instance, if all the tasks were blocked and unable to
> > > > >
> > > > > run,
> > > > > > > > > > > > > > node reset would never occur. As for me, a better
> > > > > >
> > > > > > solution is
> > > > > > > > > > to
> > > > > > > > > > > use
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > separate monitor thread or pool (maybe both with
> > >
> > > software
> > > > > > > > > > > > > > and hardware checks) that not only checks
> > >
> > > heartbeats
> > > > > >
> > > > > > but
> > > > > > > > > > > monitors the
> > > > > > > > > > > > > > other system as well.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <
> > > > > >
> > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It would be safer to restart the entire cluster
> > >
> > > than to
> > > > > >
> > > > > > remove
> > > > > > > > > > the
> > > > > > > > > > > last
> > > > > > > > > > > > > > > node for a cache that should be redundant.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <
> > > > > >
> > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I agree with Yakov that we can provide some
> > >
> > > option
> > > > > >
> > > > > > that manage
> > > > > > > > > > > worker
> > > > > > > > > > > > > > > > liveness checker behavior in case of observing
> > >
> > > that
> > > > > >
> > > > > > some worker
> > > > > > > > > > > is
> > > > > > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > > > > > At least it will some workaround for cases when
> > >
> > > node
> > > > > >
> > > > > > fails is
> > > > > > > > > > > too
> > > > > > > > > > > > > > > > annoying.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Backups count threshold sounds good but I don't
> > > > > >
> > > > > > understand how
> > > > > > > > > > it
> > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The simplest solution here is alert in cases of
> > > > > >
> > > > > > blocking of
> > > > > > > > > > some
> > > > > > > > > > > > > > > > critical worker (we can improve WorkersRegistry
> > >
> > > for
> > > > > >
> > > > > > this
> > > > > > > > > > purpose
> > > > > > > > > > > and
> > > > > > > > > > > > > > > > expose list of blocked workers) and optionally
> > >
> > > call
> > > > > >
> > > > > > system
> > > > > > > > > > > configured
> > > > > > > > > > > > > > > > failure processor. BTW, failure processor can be
> > > > > >
> > > > > > extended in
> > > > > > > > > > > order to
> > > > > > > > > > > > > > > > perform any checks (e.g. backup count) and decide
> > > > > >
> > > > > > whether it
> > > > > > > > > > > should
> > > > > > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <
> > > > > > > > > > >
> > > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > David, Yakov, I understand your fears. But
> > >
> > > liveness
> > > > > >
> > > > > > checks
> > > > > > > > > > deal
> > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > _critical_ conditions, i.e. when such a
> > >
> > > condition
> > > > >
> > > > > is
> > > > > > met we
> > > > > > > > > > > > > conclude
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > node as totally broken, and there is no sense
> > >
> > > to
> > > > > >
> > > > > > keep it
> > > > > > > > > > alive
> > > > > > > > > > > > > > > regardless
> > > > > > > > > > > > > > > > > the data it contains. If we want to give it a
> > > > > >
> > > > > > chance, then
> > > > > > > > > > the
> > > > > > > > > > > > > > > condition
> > > > > > > > > > > > > > > > > (long fsync etc.) should not considered as
> > >
> > > critical
> > > > > >
> > > > > > at all.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <
> > > > > > > > > > >
> > > > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Agree with David. We need to have an
> > >
> > > opporunity
> > > > > >
> > > > > > set backups
> > > > > > > > > > > count
> > > > > > > > > > > > > > > > threshold
> > > > > > > > > > > > > > > > > > (at runtime also!) that will not allow any
> > > > > >
> > > > > > automatic stop
> > > > > > > > > > if
> > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > will be
> > > > > > > > > > > > > > > > > > a data loss. Andrey, what do you think?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Maxim Muzafarov
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > > Andrey Kuznetsov.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best Regards, Vyacheslav D.
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards, Vyacheslav D.
> > > > > >
> > > > >
> > > > > --
> > > > > --
> > > > > Maxim Muzafarov
> > > > >
> > > >
> > > >

Vladimir Ozerov

Re: Critical worker threads liveness checking drawbacks

Then it should be config option.

пт, 28 сент. 2018 г. в 13:15, Andrey Gura <[hidden email]>:

> Guys,
>
> why we need both config option and system property? I believe one way is
> enough.
> On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <[hidden email]>
> wrote:
> >
> > Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
> >
> > Fixed version is 2.7.
> >
> > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > Nikolay, I agree, a user should be able to disable both thread liveness
> > > check and checkpoint read lock timeout check from config and a system
> > > property.
> > >
> > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <[hidden email]>:
> > >
> > > > Hello, Igniters.
> > > >
> > > > I found that this feature can't be disabled from config.
> > > > The only way to disable it is from JMX bean.
> > > >
> > > > I think it very dangerous: If we have some corner case or a bug in
> this
> > > > Watch Dog it can make Ignite unusable.
> > > > I propose to implement possibility to disable this feature both -
> from
> > > > config and from JVM options.
> > > >
> > > > What do you think?
> > > >
> > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > Maxim,
> > > > >
> > > > > Thanks for being attentive! It's definitely a typo. Could you
> please
> > > >
> > > > create
> > > > > an issue?
> > > > >
> > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <[hidden email]
> >:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > > I've found in `GridCachePartitionExchangeManager:2684` [1]
> (master
> > > >
> > > > branch)
> > > > > > exchange future wrapped
> > > > > > with double `blockingSectionEnd` method. Is it correct? I just
> want to
> > > > > > understand this change and
> > > > > > how should I use this in the future.
> > > > > >
> > > > > > Should I file a new issue to fix this? I think here
> > > >
> > > > `blockingSectionBegin`
> > > > > > method should be used.
> > > > > >
> > > > > > -------------
> > > > > > blockingSectionEnd();
> > > > > >
> > > > > > try {
> > > > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > > > > } finally {
> > > > > > blockingSectionEnd();
> > > > > > }
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > >
> > > >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > > >
> > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Andrey Gura, thank you for the answer!
> > > > > > >
> > > > > > > I agree that wrapping of 'init' method reduces the profit of
> watchdog
> > > > > > > service in case of PME worker, but in other cases, we should
> wrap all
> > > > > > > possible long sections on GridDhtPartitionExchangeFuture. For
> example
> > > > > > > 'onCacheChangeRequest' method or
> > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because it may
> take
> > > > > > > significant time (reproducer attached).
> > > > > > >
> > > > > > > I only want to point out a possible issue which may allow to
> end-user
> > > > > > > halt the Ignite cluster accidentally.
> > > > > > >
> > > > > > > I'm sure that PME experts know how to fix this issue properly.
> > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <[hidden email]
> >
> > > >
> > > > wrote:
> > > > > > > >
> > > > > > > > Vyacheslav,
> > > > > > > >
> > > > > > > > Exchange worker is strongly tied with
> > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange
> worker
> > > >
> > > > also
> > > > > > > > shouldn't be blocked for long time but in reality it
> happens.It
> > > >
> > > > also
> > > > > > > > means that your change doesn't make sense.
> > > > > > > >
> > > > > > > > What actually make sense it is identification of places which
> > > > > > > > intentionally blocking. May be some places/actions should be
> > > >
> > > > braced by
> > > > > > > > blocking guards.
> > > > > > > >
> > > > > > > > If you have failing tests please make sure that your
> > > >
> > > > failureHandler is
> > > > > > > > NoOpFailureHandler or any other handler with
> ignoreFailureTypes =
> > > > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > > > >
> > > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Igniters!
> > > > > > > > >
> > > > > > > > > Thank you for this important improvement!
> > > > > > > > >
> > > > > > > > > I've looked through implementation and noticed that
> > > > > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped
> in
> > > >
> > > > blocked
> > > > > > > > > section. This means it easy to halt the node in case of
> > > >
> > > > longrunning
> > > > > > > > > actions during PME, for example when we create a cache with
> > > > > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > > > > >
> > > > > > > > > I'm not sure that it is the right behavior.
> > > > > > > > >
> > > > > > > > > I filled the issue [1] and prepared the PR [2] with
> reproducer
> > > >
> > > > and
> > > > > > >
> > > > > > > possible fix.
> > > > > > > > >
> > > > > > > > > Andrey, could you please look at and confirm that it makes
> sense?
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> > > >
> > > > [hidden email]>
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Denis,
> > > > > > > > > >
> > > > > > > > > > I've created the ticket [1] with short description of the
> > > > > > >
> > > > > > > functionality.
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <
> [hidden email]>:
> > > > > > > > > >
> > > > > > > > > > > Andrey K. and G.,
> > > > > > > > > > >
> > > > > > > > > > > Thanks, do we have a documentation ticket created?
> Prachi
> > > > > >
> > > > > > (copied)
> > > > > > > can help
> > > > > > > > > > > with the documentation.
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Denis
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
> > > >
> > > > [hidden email]>
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Andrey,
> > > > > > > > > > > >
> > > > > > > > > > > > finally your change is merged to master branch.
> > > >
> > > > Congratulations
> > > > > > >
> > > > > > > and
> > > > > > > > > > > > thank you very much! :)
> > > > > > > > > > > >
> > > > > > > > > > > > I think that the next step is feature that will allow
> > > >
> > > > signal
> > > > > > >
> > > > > > > about
> > > > > > > > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > > > > > > > >
> > > > > > > > > > > > I hope you will continue development of this feature
> and
> > > > > >
> > > > > > provide
> > > > > > > your
> > > > > > > > > > > > vision in new JIRA issue.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> > > > > > >
> > > > > > > [hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > David, Maxim!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks a lot for you ideas. Unfortunately, I can't
> adopt
> > > >
> > > > all
> > > > > > >
> > > > > > > of them
> > > > > > > > > > > > right
> > > > > > > > > > > > > now: the scope is much broader than the scope of
> the
> > > >
> > > > change I
> > > > > > > > > > >
> > > > > > > > > > > implement.
> > > > > > > > > > > > I
> > > > > > > > > > > > > have had a talk to a group of Ignite commiters,
> and we
> > > >
> > > > agreed
> > > > > > >
> > > > > > > to
> > > > > > > > > > > complete
> > > > > > > > > > > > > the change as follows.
> > > > > > > > > > > > > - Blocking instructions in system-critical which
> may
> > > > > >
> > > > > > resonably
> > > > > > > last
> > > > > > > > > > > long
> > > > > > > > > > > > > should be explicitly excluded from the monitoring.
> > > > > > > > > > > > > - Failure handlers should have a setting to
> suppress some
> > > > > > >
> > > > > > > failures on
> > > > > > > > > > > > > per-failure-type basis.
> > > > > > > > > > > > > According to this I have updated the
> implementation: [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > > > > > > > >
> > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> > > > > > >
> > > > > > > [hidden email]>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > When I've done this before,I've needed to find
> the
> > > >
> > > > oldest
> > > > > > >
> > > > > > > thread,
> > > > > > > > > > > and
> > > > > > > > > > > > kill
> > > > > > > > > > > > > > the node running that. From a language
> standpoint,
> > > > > >
> > > > > > Maxim's
> > > > > > > "without
> > > > > > > > > > > > > > progress" better than "heartbeat". For
> example, what
> > > >
> > > > I'm
> > > > > > >
> > > > > > > most
> > > > > > > > > > > > interested
> > > > > > > > > > > > > > in on a distributed system is which thread
> started the
> > > >
> > > > work
> > > > > > >
> > > > > > > it has
> > > > > > > > > > > not
> > > > > > > > > > > > > > completed the earliest, and when did that thread
> last
> > > >
> > > > make
> > > > > > >
> > > > > > > forward
> > > > > > > > > > > > > > process. You don't want to kill a node
> because a
> > > >
> > > > thread
> > > > > > >
> > > > > > > is
> > > > > > > > > > > waiting
> > > > > > > > > > > > on a
> > > > > > > > > > > > > > lock held by a thread that went off-node and has
> not
> > > > > >
> > > > > > gotten a
> > > > > > > > > > > response.
> > > > > > > > > > > > > > If you don't understand the dependency
> relationships,
> > > >
> > > > you
> > > > > > >
> > > > > > > will make
> > > > > > > > > > > > > > incorrect recovery decisions.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <
> > > > > > >
> > > > > > > [hidden email]>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think we should find exact answers to these
> > > >
> > > > questions:
> > > > > > > > > > > > > > > 1. What `critical` issue exactly is?
> > > > > > > > > > > > > > > 2. How can we find critical issues?
> > > > > > > > > > > > > > > 3. How can we handle critical issues?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > First,
> > > > > > > > > > > > > > > - Ignore uninterruptable actions (e.g.
> > > >
> > > > worker\service
> > > > > > >
> > > > > > > shutdown)
> > > > > > > > > > > > > > > - Long I/O operations (should be a
> configurable
> > > >
> > > > timeout
> > > > > > >
> > > > > > > for each
> > > > > > > > > > > > type of
> > > > > > > > > > > > > > > usage)
> > > > > > > > > > > > > > > - Infinite loops
> > > > > > > > > > > > > > > - Stalled\deadlocked threads (and\or too many
> parked
> > > > > > >
> > > > > > > threads,
> > > > > > > > > > > > exclude
> > > > > > > > > > > > > > I/O)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Second,
> > > > > > > > > > > > > > > - The working queue is without progress (e.g.
> disco,
> > > > > > >
> > > > > > > exchange
> > > > > > > > > > > > queues)
> > > > > > > > > > > > > > > - Work hasn't been completed since the last
> > > >
> > > > heartbeat
> > > > > > >
> > > > > > > (checking
> > > > > > > > > > > > > > > milestones)
> > > > > > > > > > > > > > > - Too many system resources used by a thread
> for the
> > > > > >
> > > > > > long
> > > > > > > period
> > > > > > > > > > > of
> > > > > > > > > > > > time
> > > > > > > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > > > > > > - Timing fields associated with each thread
> status
> > > > > > >
> > > > > > > exceeded a
> > > > > > > > > > > > maximum
> > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > limit.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Third (not too many options here),
> > > > > > > > > > > > > > > - `log everything` should be the default
> behaviour
> > > >
> > > > in
> > > > > >
> > > > > > all
> > > > > > > these
> > > > > > > > > > > > cases,
> > > > > > > > > > > > > > > since it may be difficult to find the cause
> after the
> > > > > > >
> > > > > > > restart.
> > > > > > > > > > > > > > > - Wait some interval of time and kill the
> hanging
> > > >
> > > > node
> > > > > > >
> > > > > > > (cluster
> > > > > > > > > > > > should
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > configured stable enough)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Questions,
> > > > > > > > > > > > > > > - Not sure, but can workers miss their
> heartbeat
> > > > > > >
> > > > > > > deadlines if CPU
> > > > > > > > > > > > loads
> > > > > > > > > > > > > > up
> > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary overloads can
> be
> > > > > > > > > > > > > > > expected behaviour as a normal part of
> system
> > > > > > >
> > > > > > > operations.
> > > > > > > > > > > > > > > - Why do we decide that critical thread should
> > > >
> > > > monitor
> > > > > > >
> > > > > > > each other?
> > > > > > > > > > > > For
> > > > > > > > > > > > > > > instance, if all the tasks were blocked and
> unable to
> > > > > >
> > > > > > run,
> > > > > > > > > > > > > > > node reset would never occur. As for me, a
> better
> > > > > > >
> > > > > > > solution is
> > > > > > > > > > > to
> > > > > > > > > > > > use
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > separate monitor thread or pool (maybe both
> with
> > > >
> > > > software
> > > > > > > > > > > > > > > and hardware checks) that not only checks
> > > >
> > > > heartbeats
> > > > > > >
> > > > > > > but
> > > > > > > > > > > > monitors the
> > > > > > > > > > > > > > > other system as well.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <
> > > > > > >
> > > > > > > [hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It would be safer to restart the entire
> cluster
> > > >
> > > > than to
> > > > > > >
> > > > > > > remove
> > > > > > > > > > > the
> > > > > > > > > > > > last
> > > > > > > > > > > > > > > > node for a cache that should be redundant.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <
> > > > > > >
> > > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I agree with Yakov that we can provide some
> > > >
> > > > option
> > > > > > >
> > > > > > > that manage
> > > > > > > > > > > > worker
> > > > > > > > > > > > > > > > > liveness checker behavior in case of
> observing
> > > >
> > > > that
> > > > > > >
> > > > > > > some worker
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > > > > > > At least it will some workaround for
> cases when
> > > >
> > > > node
> > > > > > >
> > > > > > > fails is
> > > > > > > > > > > > too
> > > > > > > > > > > > > > > > > annoying.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Backups count threshold sounds good but I
> don't
> > > > > > >
> > > > > > > understand how
> > > > > > > > > > > it
> > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The simplest solution here is alert in
> cases of
> > > > > > >
> > > > > > > blocking of
> > > > > > > > > > > some
> > > > > > > > > > > > > > > > > critical worker (we can improve
> WorkersRegistry
> > > >
> > > > for
> > > > > > >
> > > > > > > this
> > > > > > > > > > > purpose
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > expose list of blocked workers) and
> optionally
> > > >
> > > > call
> > > > > > >
> > > > > > > system
> > > > > > > > > > > > configured
> > > > > > > > > > > > > > > > > failure processor. BTW, failure processor
> can be
> > > > > > >
> > > > > > > extended in
> > > > > > > > > > > > order to
> > > > > > > > > > > > > > > > > perform any checks (e.g. backup count) and
> decide
> > > > > > >
> > > > > > > whether it
> > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey
> Kuznetsov <
> > > > > > > > > > > >
> > > > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > David, Yakov, I understand your fears.
> But
> > > >
> > > > liveness
> > > > > > >
> > > > > > > checks
> > > > > > > > > > > deal
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > _critical_ conditions, i.e. when such a
> > > >
> > > > condition
> > > > > >
> > > > > > is
> > > > > > > met we
> > > > > > > > > > > > > > conclude
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > node as totally broken, and there is no
> sense
> > > >
> > > > to
> > > > > > >
> > > > > > > keep it
> > > > > > > > > > > alive
> > > > > > > > > > > > > > > > regardless
> > > > > > > > > > > > > > > > > > the data it contains. If we want to give
> it a
> > > > > > >
> > > > > > > chance, then
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > condition
> > > > > > > > > > > > > > > > > > (long fsync etc.) should not considered
> as
> > > >
> > > > critical
> > > > > > >
> > > > > > > at all.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov
> Zhdanov <
> > > > > > > > > > > >
> > > > > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Agree with David. We need to have an
> > > >
> > > > opporunity
> > > > > > >
> > > > > > > set backups
> > > > > > > > > > > > count
> > > > > > > > > > > > > > > > > threshold
> > > > > > > > > > > > > > > > > > > (at runtime also!) that will not allow
> any
> > > > > > >
> > > > > > > automatic stop
> > > > > > > > > > > if
> > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > will be
> > > > > > > > > > > > > > > > > > > a data loss. Andrey, what do you think?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Maxim Muzafarov
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best Regards, Vyacheslav D.
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > --
> > > > > > Maxim Muzafarov
> > > > > >
> > > > >
> > > > >
>

yzhdanov

Re: Critical worker threads liveness checking drawbacks

Config option + mbean access. Does that make sense?

Yakov

On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <[hidden email]> wrote:

> Then it should be config option.
>
> пт, 28 сент. 2018 г. в 13:15, Andrey Gura <[hidden email]>:
>
> > Guys,
> >
> > why we need both config option and system property? I believe one way is
> > enough.
> > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <[hidden email]>
> > wrote:
> > >
> > > Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
> > >
> > > Fixed version is 2.7.
> > >
> > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > > Nikolay, I agree, a user should be able to disable both thread
> liveness
> > > > check and checkpoint read lock timeout check from config and a system
> > > > property.
> > > >
> > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <[hidden email]>:
> > > >
> > > > > Hello, Igniters.
> > > > >
> > > > > I found that this feature can't be disabled from config.
> > > > > The only way to disable it is from JMX bean.
> > > > >
> > > > > I think it very dangerous: If we have some corner case or a bug in
> > this
> > > > > Watch Dog it can make Ignite unusable.
> > > > > I propose to implement possibility to disable this feature both -
> > from
> > > > > config and from JVM options.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > > Maxim,
> > > > > >
> > > > > > Thanks for being attentive! It's definitely a typo. Could you
> > please
> > > > >
> > > > > create
> > > > > > an issue?
> > > > > >
> > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> [hidden email]
> > >:
> > > > > >
> > > > > > > Folks,
> > > > > > >
> > > > > > > I've found in `GridCachePartitionExchangeManager:2684` [1]
> > (master
> > > > >
> > > > > branch)
> > > > > > > exchange future wrapped
> > > > > > > with double `blockingSectionEnd` method. Is it correct? I just
> > want to
> > > > > > > understand this change and
> > > > > > > how should I use this in the future.
> > > > > > >
> > > > > > > Should I file a new issue to fix this? I think here
> > > > >
> > > > > `blockingSectionBegin`
> > > > > > > method should be used.
> > > > > > >
> > > > > > > -------------
> > > > > > > blockingSectionEnd();
> > > > > > >
> > > > > > > try {
> > > > > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > > > > > } finally {
> > > > > > > blockingSectionEnd();
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > > > >
> > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Andrey Gura, thank you for the answer!
> > > > > > > >
> > > > > > > > I agree that wrapping of 'init' method reduces the profit of
> > watchdog
> > > > > > > > service in case of PME worker, but in other cases, we should
> > wrap all
> > > > > > > > possible long sections on GridDhtPartitionExchangeFuture. For
> > example
> > > > > > > > 'onCacheChangeRequest' method or
> > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because it may
> > take
> > > > > > > > significant time (reproducer attached).
> > > > > > > >
> > > > > > > > I only want to point out a possible issue which may allow to
> > end-user
> > > > > > > > halt the Ignite cluster accidentally.
> > > > > > > >
> > > > > > > > I'm sure that PME experts know how to fix this issue
> properly.
> > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
> [hidden email]
> > >
> > > > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > Vyacheslav,
> > > > > > > > >
> > > > > > > > > Exchange worker is strongly tied with
> > > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange
> > worker
> > > > >
> > > > > also
> > > > > > > > > shouldn't be blocked for long time but in reality it
> > happens.It
> > > > >
> > > > > also
> > > > > > > > > means that your change doesn't make sense.
> > > > > > > > >
> > > > > > > > > What actually make sense it is identification of places
> which
> > > > > > > > > intentionally blocking. May be some places/actions should
> be
> > > > >
> > > > > braced by
> > > > > > > > > blocking guards.
> > > > > > > > >
> > > > > > > > > If you have failing tests please make sure that your
> > > > >
> > > > > failureHandler is
> > > > > > > > > NoOpFailureHandler or any other handler with
> > ignoreFailureTypes =
> > > > > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > > > > >
> > > > > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Igniters!
> > > > > > > > > >
> > > > > > > > > > Thank you for this important improvement!
> > > > > > > > > >
> > > > > > > > > > I've looked through implementation and noticed that
> > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped
> > in
> > > > >
> > > > > blocked
> > > > > > > > > > section. This means it easy to halt the node in case of
> > > > >
> > > > > longrunning
> > > > > > > > > > actions during PME, for example when we create a cache
> with
> > > > > > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > > > > > >
> > > > > > > > > > I'm not sure that it is the right behavior.
> > > > > > > > > >
> > > > > > > > > > I filled the issue [1] and prepared the PR [2] with
> > reproducer
> > > > >
> > > > > and
> > > > > > > >
> > > > > > > > possible fix.
> > > > > > > > > >
> > > > > > > > > > Andrey, could you please look at and confirm that it
> makes
> > sense?
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> > > > >
> > > > > [hidden email]>
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Denis,
> > > > > > > > > > >
> > > > > > > > > > > I've created the ticket [1] with short description of
> the
> > > > > > > >
> > > > > > > > functionality.
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <
> > [hidden email]>:
> > > > > > > > > > >
> > > > > > > > > > > > Andrey K. and G.,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks, do we have a documentation ticket created?
> > Prachi
> > > > > > >
> > > > > > > (copied)
> > > > > > > > can help
> > > > > > > > > > > > with the documentation.
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Denis
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
> > > > >
> > > > > [hidden email]>
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Andrey,
> > > > > > > > > > > > >
> > > > > > > > > > > > > finally your change is merged to master branch.
> > > > >
> > > > > Congratulations
> > > > > > > >
> > > > > > > > and
> > > > > > > > > > > > > thank you very much! :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think that the next step is feature that will
> allow
> > > > >
> > > > > signal
> > > > > > > >
> > > > > > > > about
> > > > > > > > > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I hope you will continue development of this
> feature
> > and
> > > > > > >
> > > > > > > provide
> > > > > > > > your
> > > > > > > > > > > > > vision in new JIRA issue.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> > > > > > > >
> > > > > > > > [hidden email]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > David, Maxim!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks a lot for you ideas. Unfortunately, I
> can't
> > adopt
> > > > >
> > > > > all
> > > > > > > >
> > > > > > > > of them
> > > > > > > > > > > > > right
> > > > > > > > > > > > > > now: the scope is much broader than the scope of
> > the
> > > > >
> > > > > change I
> > > > > > > > > > > >
> > > > > > > > > > > > implement.
> > > > > > > > > > > > > I
> > > > > > > > > > > > > > have had a talk to a group of Ignite commiters,
> > and we
> > > > >
> > > > > agreed
> > > > > > > >
> > > > > > > > to
> > > > > > > > > > > > complete
> > > > > > > > > > > > > > the change as follows.
> > > > > > > > > > > > > > - Blocking instructions in system-critical which
> > may
> > > > > > >
> > > > > > > resonably
> > > > > > > > last
> > > > > > > > > > > > long
> > > > > > > > > > > > > > should be explicitly excluded from the
> monitoring.
> > > > > > > > > > > > > > - Failure handlers should have a setting to
> > suppress some
> > > > > > > >
> > > > > > > > failures on
> > > > > > > > > > > > > > per-failure-type basis.
> > > > > > > > > > > > > > According to this I have updated the
> > implementation: [1]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> > > > > > > >
> > > > > > > > [hidden email]>:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > When I've done this before,I've needed to find
> > the
> > > > >
> > > > > oldest
> > > > > > > >
> > > > > > > > thread,
> > > > > > > > > > > > and
> > > > > > > > > > > > > kill
> > > > > > > > > > > > > > > the node running that. From a language
> > standpoint,
> > > > > > >
> > > > > > > Maxim's
> > > > > > > > "without
> > > > > > > > > > > > > > > progress" better than "heartbeat". For
> > example, what
> > > > >
> > > > > I'm
> > > > > > > >
> > > > > > > > most
> > > > > > > > > > > > > interested
> > > > > > > > > > > > > > > in on a distributed system is which thread
> > started the
> > > > >
> > > > > work
> > > > > > > >
> > > > > > > > it has
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > completed the earliest, and when did that
> thread
> > last
> > > > >
> > > > > make
> > > > > > > >
> > > > > > > > forward
> > > > > > > > > > > > > > > process. You don't want to kill a node
> > because a
> > > > >
> > > > > thread
> > > > > > > >
> > > > > > > > is
> > > > > > > > > > > > waiting
> > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > lock held by a thread that went off-node and
> has
> > not
> > > > > > >
> > > > > > > gotten a
> > > > > > > > > > > > response.
> > > > > > > > > > > > > > > If you don't understand the dependency
> > relationships,
> > > > >
> > > > > you
> > > > > > > >
> > > > > > > > will make
> > > > > > > > > > > > > > > incorrect recovery decisions.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim
> Muzafarov <
> > > > > > > >
> > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think we should find exact answers to these
> > > > >
> > > > > questions:
> > > > > > > > > > > > > > > > 1. What `critical` issue exactly is?
> > > > > > > > > > > > > > > > 2. How can we find critical issues?
> > > > > > > > > > > > > > > > 3. How can we handle critical issues?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > First,
> > > > > > > > > > > > > > > > - Ignore uninterruptable actions (e.g.
> > > > >
> > > > > worker\service
> > > > > > > >
> > > > > > > > shutdown)
> > > > > > > > > > > > > > > > - Long I/O operations (should be a
> > configurable
> > > > >
> > > > > timeout
> > > > > > > >
> > > > > > > > for each
> > > > > > > > > > > > > type of
> > > > > > > > > > > > > > > > usage)
> > > > > > > > > > > > > > > > - Infinite loops
> > > > > > > > > > > > > > > > - Stalled\deadlocked threads (and\or too
> many
> > parked
> > > > > > > >
> > > > > > > > threads,
> > > > > > > > > > > > > exclude
> > > > > > > > > > > > > > > I/O)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Second,
> > > > > > > > > > > > > > > > - The working queue is without progress
> (e.g.
> > disco,
> > > > > > > >
> > > > > > > > exchange
> > > > > > > > > > > > > queues)
> > > > > > > > > > > > > > > > - Work hasn't been completed since the last
> > > > >
> > > > > heartbeat
> > > > > > > >
> > > > > > > > (checking
> > > > > > > > > > > > > > > > milestones)
> > > > > > > > > > > > > > > > - Too many system resources used by a thread
> > for the
> > > > > > >
> > > > > > > long
> > > > > > > > period
> > > > > > > > > > > > of
> > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > > > > > > > - Timing fields associated with each thread
> > status
> > > > > > > >
> > > > > > > > exceeded a
> > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > limit.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Third (not too many options here),
> > > > > > > > > > > > > > > > - `log everything` should be the default
> > behaviour
> > > > >
> > > > > in
> > > > > > >
> > > > > > > all
> > > > > > > > these
> > > > > > > > > > > > > cases,
> > > > > > > > > > > > > > > > since it may be difficult to find the cause
> > after the
> > > > > > > >
> > > > > > > > restart.
> > > > > > > > > > > > > > > > - Wait some interval of time and kill the
> > hanging
> > > > >
> > > > > node
> > > > > > > >
> > > > > > > > (cluster
> > > > > > > > > > > > > should
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > configured stable enough)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Questions,
> > > > > > > > > > > > > > > > - Not sure, but can workers miss their
> > heartbeat
> > > > > > > >
> > > > > > > > deadlines if CPU
> > > > > > > > > > > > > loads
> > > > > > > > > > > > > > > up
> > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary overloads can
> > be
> > > > > > > > > > > > > > > > expected behaviour as a normal part of
> > system
> > > > > > > >
> > > > > > > > operations.
> > > > > > > > > > > > > > > > - Why do we decide that critical thread
> should
> > > > >
> > > > > monitor
> > > > > > > >
> > > > > > > > each other?
> > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > instance, if all the tasks were blocked and
> > unable to
> > > > > > >
> > > > > > > run,
> > > > > > > > > > > > > > > > node reset would never occur. As for me,
> a
> > better
> > > > > > > >
> > > > > > > > solution is
> > > > > > > > > > > > to
> > > > > > > > > > > > > use
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > separate monitor thread or pool (maybe both
> > with
> > > > >
> > > > > software
> > > > > > > > > > > > > > > > and hardware checks) that not only checks
> > > > >
> > > > > heartbeats
> > > > > > > >
> > > > > > > > but
> > > > > > > > > > > > > monitors the
> > > > > > > > > > > > > > > > other system as well.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <
> > > > > > > >
> > > > > > > > [hidden email]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > It would be safer to restart the entire
> > cluster
> > > > >
> > > > > than to
> > > > > > > >
> > > > > > > > remove
> > > > > > > > > > > > the
> > > > > > > > > > > > > last
> > > > > > > > > > > > > > > > > node for a cache that should be redundant.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <
> > > > > > > >
> > > > > > > > [hidden email]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I agree with Yakov that we can provide
> some
> > > > >
> > > > > option
> > > > > > > >
> > > > > > > > that manage
> > > > > > > > > > > > > worker
> > > > > > > > > > > > > > > > > > liveness checker behavior in case of
> > observing
> > > > >
> > > > > that
> > > > > > > >
> > > > > > > > some worker
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > > > > > > > At least it will some workaround for
> > cases when
> > > > >
> > > > > node
> > > > > > > >
> > > > > > > > fails is
> > > > > > > > > > > > > too
> > > > > > > > > > > > > > > > > > annoying.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Backups count threshold sounds good but I
> > don't
> > > > > > > >
> > > > > > > > understand how
> > > > > > > > > > > > it
> > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > The simplest solution here is alert in
> > cases of
> > > > > > > >
> > > > > > > > blocking of
> > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > critical worker (we can improve
> > WorkersRegistry
> > > > >
> > > > > for
> > > > > > > >
> > > > > > > > this
> > > > > > > > > > > > purpose
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > expose list of blocked workers) and
> > optionally
> > > > >
> > > > > call
> > > > > > > >
> > > > > > > > system
> > > > > > > > > > > > > configured
> > > > > > > > > > > > > > > > > > failure processor. BTW, failure processor
> > can be
> > > > > > > >
> > > > > > > > extended in
> > > > > > > > > > > > > order to
> > > > > > > > > > > > > > > > > > perform any checks (e.g. backup count)
> and
> > decide
> > > > > > > >
> > > > > > > > whether it
> > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey
> > Kuznetsov <
> > > > > > > > > > > > >
> > > > > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > David, Yakov, I understand your fears.
> > But
> > > > >
> > > > > liveness
> > > > > > > >
> > > > > > > > checks
> > > > > > > > > > > > deal
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e. when such a
> > > > >
> > > > > condition
> > > > > > >
> > > > > > > is
> > > > > > > > met we
> > > > > > > > > > > > > > > conclude
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > node as totally broken, and there is no
> > sense
> > > > >
> > > > > to
> > > > > > > >
> > > > > > > > keep it
> > > > > > > > > > > > alive
> > > > > > > > > > > > > > > > > regardless
> > > > > > > > > > > > > > > > > > > the data it contains. If we want to
> give
> > it a
> > > > > > > >
> > > > > > > > chance, then
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > condition
> > > > > > > > > > > > > > > > > > > (long fsync etc.) should not considered
> > as
> > > > >
> > > > > critical
> > > > > > > >
> > > > > > > > at all.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov
> > Zhdanov <
> > > > > > > > > > > > >
> > > > > > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Agree with David. We need to have an
> > > > >
> > > > > opporunity
> > > > > > > >
> > > > > > > > set backups
> > > > > > > > > > > > > count
> > > > > > > > > > > > > > > > > > threshold
> > > > > > > > > > > > > > > > > > > > (at runtime also!) that will not
> allow
> > any
> > > > > > > >
> > > > > > > > automatic stop
> > > > > > > > > > > > if
> > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > > will be
> > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what do you
> think?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Maxim Muzafarov
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > --
> > > > > > > Maxim Muzafarov
> > > > > > >
> > > > > >
> > > > > >
> >
>

Mmuzaf

Re: Critical worker threads liveness checking drawbacks

Andrey, Andrey

> Thanks for being attentive! It's definitely a typo. Could you please
create
> an issue?

I've created an issue [1] and prepared PR [2].
Please, review this change.

[1] https://issues.apache.org/jira/browse/IGNITE-9723
[2] https://github.com/apache/ignite/pull/4862

On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <[hidden email]> wrote:

> Config option + mbean access. Does that make sense?
>
> Yakov
>
> On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <[hidden email]> wrote:
>
> > Then it should be config option.
> >
> > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <[hidden email]>:
> >
> > > Guys,
> > >
> > > why we need both config option and system property? I believe one way
> is
> > > enough.
> > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <[hidden email]>
> > > wrote:
> > > >
> > > > Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
> > > >
> > > > Fixed version is 2.7.
> > > >
> > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > > > Nikolay, I agree, a user should be able to disable both thread
> > liveness
> > > > > check and checkpoint read lock timeout check from config and a
> system
> > > > > property.
> > > > >
> > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <[hidden email]
> >:
> > > > >
> > > > > > Hello, Igniters.
> > > > > >
> > > > > > I found that this feature can't be disabled from config.
> > > > > > The only way to disable it is from JMX bean.
> > > > > >
> > > > > > I think it very dangerous: If we have some corner case or a bug
> in
> > > this
> > > > > > Watch Dog it can make Ignite unusable.
> > > > > > I propose to implement possibility to disable this feature both -
> > > from
> > > > > > config and from JVM options.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > > > Maxim,
> > > > > > >
> > > > > > > Thanks for being attentive! It's definitely a typo. Could you
> > > please
> > > > > >
> > > > > > create
> > > > > > > an issue?
> > > > > > >
> > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> > [hidden email]
> > > >:
> > > > > > >
> > > > > > > > Folks,
> > > > > > > >
> > > > > > > > I've found in `GridCachePartitionExchangeManager:2684` [1]
> > > (master
> > > > > >
> > > > > > branch)
> > > > > > > > exchange future wrapped
> > > > > > > > with double `blockingSectionEnd` method. Is it correct? I
> just
> > > want to
> > > > > > > > understand this change and
> > > > > > > > how should I use this in the future.
> > > > > > > >
> > > > > > > > Should I file a new issue to fix this? I think here
> > > > > >
> > > > > > `blockingSectionBegin`
> > > > > > > > method should be used.
> > > > > > > >
> > > > > > > > -------------
> > > > > > > > blockingSectionEnd();
> > > > > > > >
> > > > > > > > try {
> > > > > > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > > > > > > } finally {
> > > > > > > > blockingSectionEnd();
> > > > > > > > }
> > > > > > > >
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > >
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > > > > >
> > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Andrey Gura, thank you for the answer!
> > > > > > > > >
> > > > > > > > > I agree that wrapping of 'init' method reduces the profit
> of
> > > watchdog
> > > > > > > > > service in case of PME worker, but in other cases, we
> should
> > > wrap all
> > > > > > > > > possible long sections on GridDhtPartitionExchangeFuture.
> For
> > > example
> > > > > > > > > 'onCacheChangeRequest' method or
> > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because it
> may
> > > take
> > > > > > > > > significant time (reproducer attached).
> > > > > > > > >
> > > > > > > > > I only want to point out a possible issue which may allow
> to
> > > end-user
> > > > > > > > > halt the Ignite cluster accidentally.
> > > > > > > > >
> > > > > > > > > I'm sure that PME experts know how to fix this issue
> > properly.
> > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
> > [hidden email]
> > > >
> > > > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Vyacheslav,
> > > > > > > > > >
> > > > > > > > > > Exchange worker is strongly tied with
> > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok.
> Exchange
> > > worker
> > > > > >
> > > > > > also
> > > > > > > > > > shouldn't be blocked for long time but in reality it
> > > happens.It
> > > > > >
> > > > > > also
> > > > > > > > > > means that your change doesn't make sense.
> > > > > > > > > >
> > > > > > > > > > What actually make sense it is identification of places
> > which
> > > > > > > > > > intentionally blocking. May be some places/actions should
> > be
> > > > > >
> > > > > > braced by
> > > > > > > > > > blocking guards.
> > > > > > > > > >
> > > > > > > > > > If you have failing tests please make sure that your
> > > > > >
> > > > > > failureHandler is
> > > > > > > > > > NoOpFailureHandler or any other handler with
> > > ignoreFailureTypes =
> > > > > > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > > > > > >
> > > > > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Igniters!
> > > > > > > > > > >
> > > > > > > > > > > Thank you for this important improvement!
> > > > > > > > > > >
> > > > > > > > > > > I've looked through implementation and noticed that
> > > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not been
> wrapped
> > > in
> > > > > >
> > > > > > blocked
> > > > > > > > > > > section. This means it easy to halt the node in case of
> > > > > >
> > > > > > longrunning
> > > > > > > > > > > actions during PME, for example when we create a cache
> > with
> > > > > > > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > > > > > > >
> > > > > > > > > > > I'm not sure that it is the right behavior.
> > > > > > > > > > >
> > > > > > > > > > > I filled the issue [1] and prepared the PR [2] with
> > > reproducer
> > > > > >
> > > > > > and
> > > > > > > > >
> > > > > > > > > possible fix.
> > > > > > > > > > >
> > > > > > > > > > > Andrey, could you please look at and confirm that it
> > makes
> > > sense?
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> > > > > >
> > > > > > [hidden email]>
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Denis,
> > > > > > > > > > > >
> > > > > > > > > > > > I've created the ticket [1] with short description of
> > the
> > > > > > > > >
> > > > > > > > > functionality.
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <
> > > [hidden email]>:
> > > > > > > > > > > >
> > > > > > > > > > > > > Andrey K. and G.,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks, do we have a documentation ticket created?
> > > Prachi
> > > > > > > >
> > > > > > > > (copied)
> > > > > > > > > can help
> > > > > > > > > > > > > with the documentation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Denis
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
> > > > > >
> > > > > > [hidden email]>
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Andrey,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > finally your change is merged to master branch.
> > > > > >
> > > > > > Congratulations
> > > > > > > > >
> > > > > > > > > and
> > > > > > > > > > > > > > thank you very much! :)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think that the next step is feature that will
> > allow
> > > > > >
> > > > > > signal
> > > > > > > > >
> > > > > > > > > about
> > > > > > > > > > > > > > blocked threads to the monitoring tools via
> MXBean.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I hope you will continue development of this
> > feature
> > > and
> > > > > > > >
> > > > > > > > provide
> > > > > > > > > your
> > > > > > > > > > > > > > vision in new JIRA issue.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov
> <
> > > > > > > > >
> > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > David, Maxim!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks a lot for you ideas. Unfortunately, I
> > can't
> > > adopt
> > > > > >
> > > > > > all
> > > > > > > > >
> > > > > > > > > of them
> > > > > > > > > > > > > > right
> > > > > > > > > > > > > > > now: the scope is much broader than the scope
> of
> > > the
> > > > > >
> > > > > > change I
> > > > > > > > > > > > >
> > > > > > > > > > > > > implement.
> > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > have had a talk to a group of Ignite commiters,
> > > and we
> > > > > >
> > > > > > agreed
> > > > > > > > >
> > > > > > > > > to
> > > > > > > > > > > > > complete
> > > > > > > > > > > > > > > the change as follows.
> > > > > > > > > > > > > > > - Blocking instructions in system-critical
> which
> > > may
> > > > > > > >
> > > > > > > > resonably
> > > > > > > > > last
> > > > > > > > > > > > > long
> > > > > > > > > > > > > > > should be explicitly excluded from the
> > monitoring.
> > > > > > > > > > > > > > > - Failure handlers should have a setting to
> > > suppress some
> > > > > > > > >
> > > > > > > > > failures on
> > > > > > > > > > > > > > > per-failure-type basis.
> > > > > > > > > > > > > > > According to this I have updated the
> > > implementation: [1]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> > > > > > > > >
> > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When I've done this before,I've needed to
> find
> > > the
> > > > > >
> > > > > > oldest
> > > > > > > > >
> > > > > > > > > thread,
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > kill
> > > > > > > > > > > > > > > > the node running that. From a language
> > > standpoint,
> > > > > > > >
> > > > > > > > Maxim's
> > > > > > > > > "without
> > > > > > > > > > > > > > > > progress" better than "heartbeat". For
> > > example, what
> > > > > >
> > > > > > I'm
> > > > > > > > >
> > > > > > > > > most
> > > > > > > > > > > > > > interested
> > > > > > > > > > > > > > > > in on a distributed system is which thread
> > > started the
> > > > > >
> > > > > > work
> > > > > > > > >
> > > > > > > > > it has
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > completed the earliest, and when did that
> > thread
> > > last
> > > > > >
> > > > > > make
> > > > > > > > >
> > > > > > > > > forward
> > > > > > > > > > > > > > > > process. You don't want to kill a node
> > > because a
> > > > > >
> > > > > > thread
> > > > > > > > >
> > > > > > > > > is
> > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > > lock held by a thread that went off-node and
> > has
> > > not
> > > > > > > >
> > > > > > > > gotten a
> > > > > > > > > > > > > response.
> > > > > > > > > > > > > > > > If you don't understand the dependency
> > > relationships,
> > > > > >
> > > > > > you
> > > > > > > > >
> > > > > > > > > will make
> > > > > > > > > > > > > > > > incorrect recovery decisions.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim
> > Muzafarov <
> > > > > > > > >
> > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think we should find exact answers to
> these
> > > > > >
> > > > > > questions:
> > > > > > > > > > > > > > > > > 1. What `critical` issue exactly is?
> > > > > > > > > > > > > > > > > 2. How can we find critical issues?
> > > > > > > > > > > > > > > > > 3. How can we handle critical issues?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > First,
> > > > > > > > > > > > > > > > > - Ignore uninterruptable actions (e.g.
> > > > > >
> > > > > > worker\service
> > > > > > > > >
> > > > > > > > > shutdown)
> > > > > > > > > > > > > > > > > - Long I/O operations (should be a
> > > configurable
> > > > > >
> > > > > > timeout
> > > > > > > > >
> > > > > > > > > for each
> > > > > > > > > > > > > > type of
> > > > > > > > > > > > > > > > > usage)
> > > > > > > > > > > > > > > > > - Infinite loops
> > > > > > > > > > > > > > > > > - Stalled\deadlocked threads (and\or too
> > many
> > > parked
> > > > > > > > >
> > > > > > > > > threads,
> > > > > > > > > > > > > > exclude
> > > > > > > > > > > > > > > > I/O)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Second,
> > > > > > > > > > > > > > > > > - The working queue is without progress
> > (e.g.
> > > disco,
> > > > > > > > >
> > > > > > > > > exchange
> > > > > > > > > > > > > > queues)
> > > > > > > > > > > > > > > > > - Work hasn't been completed since the
> last
> > > > > >
> > > > > > heartbeat
> > > > > > > > >
> > > > > > > > > (checking
> > > > > > > > > > > > > > > > > milestones)
> > > > > > > > > > > > > > > > > - Too many system resources used by a
> thread
> > > for the
> > > > > > > >
> > > > > > > > long
> > > > > > > > > period
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > > > > > > > > - Timing fields associated with each
> thread
> > > status
> > > > > > > > >
> > > > > > > > > exceeded a
> > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > limit.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Third (not too many options here),
> > > > > > > > > > > > > > > > > - `log everything` should be the default
> > > behaviour
> > > > > >
> > > > > > in
> > > > > > > >
> > > > > > > > all
> > > > > > > > > these
> > > > > > > > > > > > > > cases,
> > > > > > > > > > > > > > > > > since it may be difficult to find the cause
> > > after the
> > > > > > > > >
> > > > > > > > > restart.
> > > > > > > > > > > > > > > > > - Wait some interval of time and kill the
> > > hanging
> > > > > >
> > > > > > node
> > > > > > > > >
> > > > > > > > > (cluster
> > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > configured stable enough)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Questions,
> > > > > > > > > > > > > > > > > - Not sure, but can workers miss their
> > > heartbeat
> > > > > > > > >
> > > > > > > > > deadlines if CPU
> > > > > > > > > > > > > > loads
> > > > > > > > > > > > > > > > up
> > > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary overloads
> can
> > > be
> > > > > > > > > > > > > > > > > expected behaviour as a normal part of
> > > system
> > > > > > > > >
> > > > > > > > > operations.
> > > > > > > > > > > > > > > > > - Why do we decide that critical thread
> > should
> > > > > >
> > > > > > monitor
> > > > > > > > >
> > > > > > > > > each other?
> > > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > > instance, if all the tasks were blocked and
> > > unable to
> > > > > > > >
> > > > > > > > run,
> > > > > > > > > > > > > > > > > node reset would never occur. As for
> me,
> > a
> > > better
> > > > > > > > >
> > > > > > > > > solution is
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > separate monitor thread or pool (maybe both
> > > with
> > > > > >
> > > > > > software
> > > > > > > > > > > > > > > > > and hardware checks) that not only
> checks
> > > > > >
> > > > > > heartbeats
> > > > > > > > >
> > > > > > > > > but
> > > > > > > > > > > > > > monitors the
> > > > > > > > > > > > > > > > > other system as well.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <
> > > > > > > > >
> > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > It would be safer to restart the entire
> > > cluster
> > > > > >
> > > > > > than to
> > > > > > > > >
> > > > > > > > > remove
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > last
> > > > > > > > > > > > > > > > > > node for a cache that should be
> redundant.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura
> <
> > > > > > > > >
> > > > > > > > > [hidden email]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I agree with Yakov that we can provide
> > some
> > > > > >
> > > > > > option
> > > > > > > > >
> > > > > > > > > that manage
> > > > > > > > > > > > > > worker
> > > > > > > > > > > > > > > > > > > liveness checker behavior in case of
> > > observing
> > > > > >
> > > > > > that
> > > > > > > > >
> > > > > > > > > some worker
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > > > > > > > > At least it will some workaround for
> > > cases when
> > > > > >
> > > > > > node
> > > > > > > > >
> > > > > > > > > fails is
> > > > > > > > > > > > > > too
> > > > > > > > > > > > > > > > > > > annoying.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Backups count threshold sounds good
> but I
> > > don't
> > > > > > > > >
> > > > > > > > > understand how
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > The simplest solution here is alert in
> > > cases of
> > > > > > > > >
> > > > > > > > > blocking of
> > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > > critical worker (we can improve
> > > WorkersRegistry
> > > > > >
> > > > > > for
> > > > > > > > >
> > > > > > > > > this
> > > > > > > > > > > > > purpose
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > expose list of blocked workers) and
> > > optionally
> > > > > >
> > > > > > call
> > > > > > > > >
> > > > > > > > > system
> > > > > > > > > > > > > > configured
> > > > > > > > > > > > > > > > > > > failure processor. BTW, failure
> processor
> > > can be
> > > > > > > > >
> > > > > > > > > extended in
> > > > > > > > > > > > > > order to
> > > > > > > > > > > > > > > > > > > perform any checks (e.g. backup count)
> > and
> > > decide
> > > > > > > > >
> > > > > > > > > whether it
> > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey
> > > Kuznetsov <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > David, Yakov, I understand your
> fears.
> > > But
> > > > > >
> > > > > > liveness
> > > > > > > > >
> > > > > > > > > checks
> > > > > > > > > > > > > deal
> > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e. when
> such a
> > > > > >
> > > > > > condition
> > > > > > > >
> > > > > > > > is
> > > > > > > > > met we
> > > > > > > > > > > > > > > > conclude
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > node as totally broken, and there is
> no
> > > sense
> > > > > >
> > > > > > to
> > > > > > > > >
> > > > > > > > > keep it
> > > > > > > > > > > > > alive
> > > > > > > > > > > > > > > > > > regardless
> > > > > > > > > > > > > > > > > > > > the data it contains. If we want to
> > give
> > > it a
> > > > > > > > >
> > > > > > > > > chance, then
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > condition
> > > > > > > > > > > > > > > > > > > > (long fsync etc.) should not
> considered
> > > as
> > > > > >
> > > > > > critical
> > > > > > > > >
> > > > > > > > > at all.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov
> > > Zhdanov <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Agree with David. We need to have
> an
> > > > > >
> > > > > > opporunity
> > > > > > > > >
> > > > > > > > > set backups
> > > > > > > > > > > > > > count
> > > > > > > > > > > > > > > > > > > threshold
> > > > > > > > > > > > > > > > > > > > > (at runtime also!) that will not
> > allow
> > > any
> > > > > > > > >
> > > > > > > > > automatic stop
> > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > > > will be
> > > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what do you
> > think?
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Maxim Muzafarov
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > --
> > > > > > > > Maxim Muzafarov
> > > > > > > >
> > > > > > >
> > > > > > >
> > >
> >
>

--
--
Maxim Muzafarov

Andrey Kuznetsov

Re: Critical worker threads liveness checking drawbacks

Igniters,

Now I spot blocking / long-running code arising from
{{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
thread, see [1]. Ideally, all blocking operations along all possible code
paths should be guarded implicitly from critical failure detector to avoid
the thread from being considered blocked. There is a pull request [2] that
provides shallow solution. I didn't change code outside
{{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any
upcoming change. Also, I didn't touch the code runnable by threads other
than partition-exchanger. So I have a number of guarded sections that are
wider than they could be, and this potentially hides issues from failure
detector. Does this PR make sense? Or maybe it's better to exclude
partition-exchanger from critical threads registry at all?

[1] https://issues.apache.org/jira/browse/IGNITE-9710
[2] https://github.com/apache/ignite/pull/4962

пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov <[hidden email]>:

> Andrey, Andrey
>
> > Thanks for being attentive! It's definitely a typo. Could you please
> create
> > an issue?
>
> I've created an issue [1] and prepared PR [2].
> Please, review this change.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9723
> [2] https://github.com/apache/ignite/pull/4862
>
> On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <[hidden email]> wrote:
>
> > Config option + mbean access. Does that make sense?
> >
> > Yakov
> >
> > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <[hidden email]>
> wrote:
> >
> > > Then it should be config option.
> > >
> > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <[hidden email]>:
> > >
> > > > Guys,
> > > >
> > > > why we need both config option and system property? I believe one way
> > is
> > > > enough.
> > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
> [hidden email]>
> > > > wrote:
> > > > >
> > > > > Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
> > > > >
> > > > > Fixed version is 2.7.
> > > > >
> > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > > > > Nikolay, I agree, a user should be able to disable both thread
> > > liveness
> > > > > > check and checkpoint read lock timeout check from config and a
> > system
> > > > > > property.
> > > > > >
> > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
> [hidden email]
> > >:
> > > > > >
> > > > > > > Hello, Igniters.
> > > > > > >
> > > > > > > I found that this feature can't be disabled from config.
> > > > > > > The only way to disable it is from JMX bean.
> > > > > > >
> > > > > > > I think it very dangerous: If we have some corner case or a bug
> > in
> > > > this
> > > > > > > Watch Dog it can make Ignite unusable.
> > > > > > > I propose to implement possibility to disable this feature
> both -
> > > > from
> > > > > > > config and from JVM options.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > > > > Maxim,
> > > > > > > >
> > > > > > > > Thanks for being attentive! It's definitely a typo. Could you
> > > > please
> > > > > > >
> > > > > > > create
> > > > > > > > an issue?
> > > > > > > >
> > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> > > [hidden email]
> > > > >:
> > > > > > > >
> > > > > > > > > Folks,
> > > > > > > > >
> > > > > > > > > I've found in `GridCachePartitionExchangeManager:2684` [1]
> > > > (master
> > > > > > >
> > > > > > > branch)
> > > > > > > > > exchange future wrapped
> > > > > > > > > with double `blockingSectionEnd` method. Is it correct? I
> > just
> > > > want to
> > > > > > > > > understand this change and
> > > > > > > > > how should I use this in the future.
> > > > > > > > >
> > > > > > > > > Should I file a new issue to fix this? I think here
> > > > > > >
> > > > > > > `blockingSectionBegin`
> > > > > > > > > method should be used.
> > > > > > > > >
> > > > > > > > > -------------
> > > > > > > > > blockingSectionEnd();
> > > > > > > > >
> > > > > > > > > try {
> > > > > > > > > resVer = exchFut.get(exchTimeout,
> TimeUnit.MILLISECONDS);
> > > > > > > > > } finally {
> > > > > > > > > blockingSectionEnd();
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > > > > > >
> > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Andrey Gura, thank you for the answer!
> > > > > > > > > >
> > > > > > > > > > I agree that wrapping of 'init' method reduces the profit
> > of
> > > > watchdog
> > > > > > > > > > service in case of PME worker, but in other cases, we
> > should
> > > > wrap all
> > > > > > > > > > possible long sections on GridDhtPartitionExchangeFuture.
> > For
> > > > example
> > > > > > > > > > 'onCacheChangeRequest' method or
> > > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because it
> > may
> > > > take
> > > > > > > > > > significant time (reproducer attached).
> > > > > > > > > >
> > > > > > > > > > I only want to point out a possible issue which may allow
> > to
> > > > end-user
> > > > > > > > > > halt the Ignite cluster accidentally.
> > > > > > > > > >
> > > > > > > > > > I'm sure that PME experts know how to fix this issue
> > > properly.
> > > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
> > > [hidden email]
> > > > >
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Vyacheslav,
> > > > > > > > > > >
> > > > > > > > > > > Exchange worker is strongly tied with
> > > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok.
> > Exchange
> > > > worker
> > > > > > >
> > > > > > > also
> > > > > > > > > > > shouldn't be blocked for long time but in reality it
> > > > happens.It
> > > > > > >
> > > > > > > also
> > > > > > > > > > > means that your change doesn't make sense.
> > > > > > > > > > >
> > > > > > > > > > > What actually make sense it is identification of places
> > > which
> > > > > > > > > > > intentionally blocking. May be some places/actions
> should
> > > be
> > > > > > >
> > > > > > > braced by
> > > > > > > > > > > blocking guards.
> > > > > > > > > > >
> > > > > > > > > > > If you have failing tests please make sure that your
> > > > > > >
> > > > > > > failureHandler is
> > > > > > > > > > > NoOpFailureHandler or any other handler with
> > > > ignoreFailureTypes =
> > > > > > > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > > > > > > >
> > > > > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Igniters!
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for this important improvement!
> > > > > > > > > > > >
> > > > > > > > > > > > I've looked through implementation and noticed that
> > > > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not been
> > wrapped
> > > > in
> > > > > > >
> > > > > > > blocked
> > > > > > > > > > > > section. This means it easy to halt the node in case
> of
> > > > > > >
> > > > > > > longrunning
> > > > > > > > > > > > actions during PME, for example when we create a
> cache
> > > with
> > > > > > > > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > > > > > > > >
> > > > > > > > > > > > I'm not sure that it is the right behavior.
> > > > > > > > > > > >
> > > > > > > > > > > > I filled the issue [1] and prepared the PR [2] with
> > > > reproducer
> > > > > > >
> > > > > > > and
> > > > > > > > > >
> > > > > > > > > > possible fix.
> > > > > > > > > > > >
> > > > > > > > > > > > Andrey, could you please look at and confirm that it
> > > makes
> > > > sense?
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> > > > > > >
> > > > > > > [hidden email]>
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Denis,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've created the ticket [1] with short description
> of
> > > the
> > > > > > > > > >
> > > > > > > > > > functionality.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <
> > > > [hidden email]>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Andrey K. and G.,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks, do we have a documentation ticket
> created?
> > > > Prachi
> > > > > > > > >
> > > > > > > > > (copied)
> > > > > > > > > > can help
> > > > > > > > > > > > > > with the documentation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
> > > > > > >
> > > > > > > [hidden email]>
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Andrey,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > finally your change is merged to master branch.
> > > > > > >
> > > > > > > Congratulations
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > > > > > > > thank you very much! :)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think that the next step is feature that will
> > > allow
> > > > > > >
> > > > > > > signal
> > > > > > > > > >
> > > > > > > > > > about
> > > > > > > > > > > > > > > blocked threads to the monitoring tools via
> > MXBean.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I hope you will continue development of this
> > > feature
> > > > and
> > > > > > > > >
> > > > > > > > > provide
> > > > > > > > > > your
> > > > > > > > > > > > > > > vision in new JIRA issue.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey
> Kuznetsov
> > <
> > > > > > > > > >
> > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > David, Maxim!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks a lot for you ideas. Unfortunately, I
> > > can't
> > > > adopt
> > > > > > >
> > > > > > > all
> > > > > > > > > >
> > > > > > > > > > of them
> > > > > > > > > > > > > > > right
> > > > > > > > > > > > > > > > now: the scope is much broader than the scope
> > of
> > > > the
> > > > > > >
> > > > > > > change I
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > implement.
> > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > have had a talk to a group of Ignite
> commiters,
> > > > and we
> > > > > > >
> > > > > > > agreed
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > > > > > > complete
> > > > > > > > > > > > > > > > the change as follows.
> > > > > > > > > > > > > > > > - Blocking instructions in system-critical
> > which
> > > > may
> > > > > > > > >
> > > > > > > > > resonably
> > > > > > > > > > last
> > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > should be explicitly excluded from the
> > > monitoring.
> > > > > > > > > > > > > > > > - Failure handlers should have a setting to
> > > > suppress some
> > > > > > > > > >
> > > > > > > > > > failures on
> > > > > > > > > > > > > > > > per-failure-type basis.
> > > > > > > > > > > > > > > > According to this I have updated the
> > > > implementation: [1]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [1]
> https://github.com/apache/ignite/pull/4089
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> > > > > > > > > >
> > > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > When I've done this before,I've needed to
> > find
> > > > the
> > > > > > >
> > > > > > > oldest
> > > > > > > > > >
> > > > > > > > > > thread,
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > kill
> > > > > > > > > > > > > > > > > the node running that. From a language
> > > > standpoint,
> > > > > > > > >
> > > > > > > > > Maxim's
> > > > > > > > > > "without
> > > > > > > > > > > > > > > > > progress" better than "heartbeat". For
> > > > example, what
> > > > > > >
> > > > > > > I'm
> > > > > > > > > >
> > > > > > > > > > most
> > > > > > > > > > > > > > > interested
> > > > > > > > > > > > > > > > > in on a distributed system is which thread
> > > > started the
> > > > > > >
> > > > > > > work
> > > > > > > > > >
> > > > > > > > > > it has
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > completed the earliest, and when did that
> > > thread
> > > > last
> > > > > > >
> > > > > > > make
> > > > > > > > > >
> > > > > > > > > > forward
> > > > > > > > > > > > > > > > > process. You don't want to kill a node
> > > > because a
> > > > > > >
> > > > > > > thread
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > > > lock held by a thread that went off-node
> and
> > > has
> > > > not
> > > > > > > > >
> > > > > > > > > gotten a
> > > > > > > > > > > > > > response.
> > > > > > > > > > > > > > > > > If you don't understand the dependency
> > > > relationships,
> > > > > > >
> > > > > > > you
> > > > > > > > > >
> > > > > > > > > > will make
> > > > > > > > > > > > > > > > > incorrect recovery decisions.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim
> > > Muzafarov <
> > > > > > > > > >
> > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I think we should find exact answers to
> > these
> > > > > > >
> > > > > > > questions:
> > > > > > > > > > > > > > > > > > 1. What `critical` issue exactly is?
> > > > > > > > > > > > > > > > > > 2. How can we find critical issues?
> > > > > > > > > > > > > > > > > > 3. How can we handle critical issues?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > First,
> > > > > > > > > > > > > > > > > > - Ignore uninterruptable actions (e.g.
> > > > > > >
> > > > > > > worker\service
> > > > > > > > > >
> > > > > > > > > > shutdown)
> > > > > > > > > > > > > > > > > > - Long I/O operations (should be a
> > > > configurable
> > > > > > >
> > > > > > > timeout
> > > > > > > > > >
> > > > > > > > > > for each
> > > > > > > > > > > > > > > type of
> > > > > > > > > > > > > > > > > > usage)
> > > > > > > > > > > > > > > > > > - Infinite loops
> > > > > > > > > > > > > > > > > > - Stalled\deadlocked threads (and\or too
> > > many
> > > > parked
> > > > > > > > > >
> > > > > > > > > > threads,
> > > > > > > > > > > > > > > exclude
> > > > > > > > > > > > > > > > > I/O)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Second,
> > > > > > > > > > > > > > > > > > - The working queue is without progress
> > > (e.g.
> > > > disco,
> > > > > > > > > >
> > > > > > > > > > exchange
> > > > > > > > > > > > > > > queues)
> > > > > > > > > > > > > > > > > > - Work hasn't been completed since the
> > last
> > > > > > >
> > > > > > > heartbeat
> > > > > > > > > >
> > > > > > > > > > (checking
> > > > > > > > > > > > > > > > > > milestones)
> > > > > > > > > > > > > > > > > > - Too many system resources used by a
> > thread
> > > > for the
> > > > > > > > >
> > > > > > > > > long
> > > > > > > > > > period
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > > > > > > > > > - Timing fields associated with each
> > thread
> > > > status
> > > > > > > > > >
> > > > > > > > > > exceeded a
> > > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > > limit.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Third (not too many options here),
> > > > > > > > > > > > > > > > > > - `log everything` should be the default
> > > > behaviour
> > > > > > >
> > > > > > > in
> > > > > > > > >
> > > > > > > > > all
> > > > > > > > > > these
> > > > > > > > > > > > > > > cases,
> > > > > > > > > > > > > > > > > > since it may be difficult to find the
> cause
> > > > after the
> > > > > > > > > >
> > > > > > > > > > restart.
> > > > > > > > > > > > > > > > > > - Wait some interval of time and kill
> the
> > > > hanging
> > > > > > >
> > > > > > > node
> > > > > > > > > >
> > > > > > > > > > (cluster
> > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > configured stable enough)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Questions,
> > > > > > > > > > > > > > > > > > - Not sure, but can workers miss their
> > > > heartbeat
> > > > > > > > > >
> > > > > > > > > > deadlines if CPU
> > > > > > > > > > > > > > > loads
> > > > > > > > > > > > > > > > > up
> > > > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary overloads
> > can
> > > > be
> > > > > > > > > > > > > > > > > > expected behaviour as a normal part
> of
> > > > system
> > > > > > > > > >
> > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > - Why do we decide that critical thread
> > > should
> > > > > > >
> > > > > > > monitor
> > > > > > > > > >
> > > > > > > > > > each other?
> > > > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > > > instance, if all the tasks were blocked
> and
> > > > unable to
> > > > > > > > >
> > > > > > > > > run,
> > > > > > > > > > > > > > > > > > node reset would never occur. As for
> > me,
> > > a
> > > > better
> > > > > > > > > >
> > > > > > > > > > solution is
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > separate monitor thread or pool (maybe
> both
> > > > with
> > > > > > >
> > > > > > > software
> > > > > > > > > > > > > > > > > > and hardware checks) that not only
> > checks
> > > > > > >
> > > > > > > heartbeats
> > > > > > > > > >
> > > > > > > > > > but
> > > > > > > > > > > > > > > monitors the
> > > > > > > > > > > > > > > > > > other system as well.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David
> Harvey <
> > > > > > > > > >
> > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > It would be safer to restart the entire
> > > > cluster
> > > > > > >
> > > > > > > than to
> > > > > > > > > >
> > > > > > > > > > remove
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > last
> > > > > > > > > > > > > > > > > > > node for a cache that should be
> > redundant.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey
> Gura
> > <
> > > > > > > > > >
> > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can
> provide
> > > some
> > > > > > >
> > > > > > > option
> > > > > > > > > >
> > > > > > > > > > that manage
> > > > > > > > > > > > > > > worker
> > > > > > > > > > > > > > > > > > > > liveness checker behavior in case of
> > > > observing
> > > > > > >
> > > > > > > that
> > > > > > > > > >
> > > > > > > > > > some worker
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > > > > > > > > > At least it will some workaround for
> > > > cases when
> > > > > > >
> > > > > > > node
> > > > > > > > > >
> > > > > > > > > > fails is
> > > > > > > > > > > > > > > too
> > > > > > > > > > > > > > > > > > > > annoying.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Backups count threshold sounds good
> > but I
> > > > don't
> > > > > > > > > >
> > > > > > > > > > understand how
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > The simplest solution here is alert
> in
> > > > cases of
> > > > > > > > > >
> > > > > > > > > > blocking of
> > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > > > critical worker (we can improve
> > > > WorkersRegistry
> > > > > > >
> > > > > > > for
> > > > > > > > > >
> > > > > > > > > > this
> > > > > > > > > > > > > > purpose
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > expose list of blocked workers) and
> > > > optionally
> > > > > > >
> > > > > > > call
> > > > > > > > > >
> > > > > > > > > > system
> > > > > > > > > > > > > > > configured
> > > > > > > > > > > > > > > > > > > > failure processor. BTW, failure
> > processor
> > > > can be
> > > > > > > > > >
> > > > > > > > > > extended in
> > > > > > > > > > > > > > > order to
> > > > > > > > > > > > > > > > > > > > perform any checks (e.g. backup
> count)
> > > and
> > > > decide
> > > > > > > > > >
> > > > > > > > > > whether it
> > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey
> > > > Kuznetsov <
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand your
> > fears.
> > > > But
> > > > > > >
> > > > > > > liveness
> > > > > > > > > >
> > > > > > > > > > checks
> > > > > > > > > > > > > > deal
> > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e. when
> > such a
> > > > > > >
> > > > > > > condition
> > > > > > > > >
> > > > > > > > > is
> > > > > > > > > > met we
> > > > > > > > > > > > > > > > > conclude
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > node as totally broken, and there
> is
> > no
> > > > sense
> > > > > > >
> > > > > > > to
> > > > > > > > > >
> > > > > > > > > > keep it
> > > > > > > > > > > > > > alive
> > > > > > > > > > > > > > > > > > > regardless
> > > > > > > > > > > > > > > > > > > > > the data it contains. If we want to
> > > give
> > > > it a
> > > > > > > > > >
> > > > > > > > > > chance, then
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > condition
> > > > > > > > > > > > > > > > > > > > > (long fsync etc.) should not
> > considered
> > > > as
> > > > > > >
> > > > > > > critical
> > > > > > > > > >
> > > > > > > > > > at all.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov
> > > > Zhdanov <
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Agree with David. We need to have
> > an
> > > > > > >
> > > > > > > opporunity
> > > > > > > > > >
> > > > > > > > > > set backups
> > > > > > > > > > > > > > > count
> > > > > > > > > > > > > > > > > > > > threshold
> > > > > > > > > > > > > > > > > > > > > > (at runtime also!) that will not
> > > allow
> > > > any
> > > > > > > > > >
> > > > > > > > > > automatic stop
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > > > > will be
> > > > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what do you
> > > think?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > Maxim Muzafarov
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > --
> > > > > > > > > Maxim Muzafarov
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > >
> > >
> >
> --
> --
> Maxim Muzafarov
>

--
Best regards,
Andrey Kuznetsov.

Alexey Goncharuk

Re: Critical worker threads liveness checking drawbacks

Andrey,

I still see that checkpoint read lock acquisition raises a CRITICAL_ERROR,
which by default will shut down local node. As far as I remember, we
decided that by default thread timeout should not trigger node failure.
Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events in
default configuration.

Should we introduce another critical failure type
CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for checkpoint
read lock acquire failure?

--AG

пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov <[hidden email]>:

> Igniters,
>
> Now I spot blocking / long-running code arising from
> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
> thread, see [1]. Ideally, all blocking operations along all possible code
> paths should be guarded implicitly from critical failure detector to avoid
> the thread from being considered blocked. There is a pull request [2] that
> provides shallow solution. I didn't change code outside
> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any
> upcoming change. Also, I didn't touch the code runnable by threads other
> than partition-exchanger. So I have a number of guarded sections that are
> wider than they could be, and this potentially hides issues from failure
> detector. Does this PR make sense? Or maybe it's better to exclude
> partition-exchanger from critical threads registry at all?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9710
> [2] https://github.com/apache/ignite/pull/4962
>
>
> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov <[hidden email]>:
>
> > Andrey, Andrey
> >
> > > Thanks for being attentive! It's definitely a typo. Could you please
> > create
> > > an issue?
> >
> > I've created an issue [1] and prepared PR [2].
> > Please, review this change.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-9723
> > [2] https://github.com/apache/ignite/pull/4862
> >
> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <[hidden email]> wrote:
> >
> > > Config option + mbean access. Does that make sense?
> > >
> > > Yakov
> > >
> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <[hidden email]>
> > wrote:
> > >
> > > > Then it should be config option.
> > > >
> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <[hidden email]>:
> > > >
> > > > > Guys,
> > > > >
> > > > > why we need both config option and system property? I believe one
> way
> > > is
> > > > > enough.
> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
> > [hidden email]>
> > > > > wrote:
> > > > > >
> > > > > > Ticket created -
> https://issues.apache.org/jira/browse/IGNITE-9737
> > > > > >
> > > > > > Fixed version is 2.7.
> > > > > >
> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > > > > > Nikolay, I agree, a user should be able to disable both thread
> > > > liveness
> > > > > > > check and checkpoint read lock timeout check from config and a
> > > system
> > > > > > > property.
> > > > > > >
> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
> > [hidden email]
> > > >:
> > > > > > >
> > > > > > > > Hello, Igniters.
> > > > > > > >
> > > > > > > > I found that this feature can't be disabled from config.
> > > > > > > > The only way to disable it is from JMX bean.
> > > > > > > >
> > > > > > > > I think it very dangerous: If we have some corner case or a
> bug
> > > in
> > > > > this
> > > > > > > > Watch Dog it can make Ignite unusable.
> > > > > > > > I propose to implement possibility to disable this feature
> > both -
> > > > > from
> > > > > > > > config and from JVM options.
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > > > > > Maxim,
> > > > > > > > >
> > > > > > > > > Thanks for being attentive! It's definitely a typo. Could
> you
> > > > > please
> > > > > > > >
> > > > > > > > create
> > > > > > > > > an issue?
> > > > > > > > >
> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> > > > [hidden email]
> > > > > >:
> > > > > > > > >
> > > > > > > > > > Folks,
> > > > > > > > > >
> > > > > > > > > > I've found in `GridCachePartitionExchangeManager:2684`
> [1]
> > > > > (master
> > > > > > > >
> > > > > > > > branch)
> > > > > > > > > > exchange future wrapped
> > > > > > > > > > with double `blockingSectionEnd` method. Is it correct? I
> > > just
> > > > > want to
> > > > > > > > > > understand this change and
> > > > > > > > > > how should I use this in the future.
> > > > > > > > > >
> > > > > > > > > > Should I file a new issue to fix this? I think here
> > > > > > > >
> > > > > > > > `blockingSectionBegin`
> > > > > > > > > > method should be used.
> > > > > > > > > >
> > > > > > > > > > -------------
> > > > > > > > > > blockingSectionEnd();
> > > > > > > > > >
> > > > > > > > > > try {
> > > > > > > > > > resVer = exchFut.get(exchTimeout,
> > TimeUnit.MILLISECONDS);
> > > > > > > > > > } finally {
> > > > > > > > > > blockingSectionEnd();
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > > > > > > >
> > > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Andrey Gura, thank you for the answer!
> > > > > > > > > > >
> > > > > > > > > > > I agree that wrapping of 'init' method reduces the
> profit
> > > of
> > > > > watchdog
> > > > > > > > > > > service in case of PME worker, but in other cases, we
> > > should
> > > > > wrap all
> > > > > > > > > > > possible long sections on
> GridDhtPartitionExchangeFuture.
> > > For
> > > > > example
> > > > > > > > > > > 'onCacheChangeRequest' method or
> > > > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because
> it
> > > may
> > > > > take
> > > > > > > > > > > significant time (reproducer attached).
> > > > > > > > > > >
> > > > > > > > > > > I only want to point out a possible issue which may
> allow
> > > to
> > > > > end-user
> > > > > > > > > > > halt the Ignite cluster accidentally.
> > > > > > > > > > >
> > > > > > > > > > > I'm sure that PME experts know how to fix this issue
> > > > properly.
> > > > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
> > > > [hidden email]
> > > > > >
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Vyacheslav,
> > > > > > > > > > > >
> > > > > > > > > > > > Exchange worker is strongly tied with
> > > > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok.
> > > Exchange
> > > > > worker
> > > > > > > >
> > > > > > > > also
> > > > > > > > > > > > shouldn't be blocked for long time but in reality it
> > > > > happens.It
> > > > > > > >
> > > > > > > > also
> > > > > > > > > > > > means that your change doesn't make sense.
> > > > > > > > > > > >
> > > > > > > > > > > > What actually make sense it is identification of
> places
> > > > which
> > > > > > > > > > > > intentionally blocking. May be some places/actions
> > should
> > > > be
> > > > > > > >
> > > > > > > > braced by
> > > > > > > > > > > > blocking guards.
> > > > > > > > > > > >
> > > > > > > > > > > > If you have failing tests please make sure that your
> > > > > > > >
> > > > > > > > failureHandler is
> > > > > > > > > > > > NoOpFailureHandler or any other handler with
> > > > > ignoreFailureTypes =
> > > > > > > > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > > > > > > > >
> > > > > > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Igniters!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you for this important improvement!
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've looked through implementation and noticed that
> > > > > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not been
> > > wrapped
> > > > > in
> > > > > > > >
> > > > > > > > blocked
> > > > > > > > > > > > > section. This means it easy to halt the node in
> case
> > of
> > > > > > > >
> > > > > > > > longrunning
> > > > > > > > > > > > > actions during PME, for example when we create a
> > cache
> > > > with
> > > > > > > > > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm not sure that it is the right behavior.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I filled the issue [1] and prepared the PR [2] with
> > > > > reproducer
> > > > > > > >
> > > > > > > > and
> > > > > > > > > > >
> > > > > > > > > > > possible fix.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Andrey, could you please look at and confirm that
> it
> > > > makes
> > > > > sense?
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> > > > > > > >
> > > > > > > > [hidden email]>
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Denis,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I've created the ticket [1] with short
> description
> > of
> > > > the
> > > > > > > > > > >
> > > > > > > > > > > functionality.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <
> > > > > [hidden email]>:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Andrey K. and G.,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks, do we have a documentation ticket
> > created?
> > > > > Prachi
> > > > > > > > > >
> > > > > > > > > > (copied)
> > > > > > > > > > > can help
> > > > > > > > > > > > > > > with the documentation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
> > > > > > > >
> > > > > > > > [hidden email]>
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Andrey,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > finally your change is merged to master
> branch.
> > > > > > > >
> > > > > > > > Congratulations
> > > > > > > > > > >
> > > > > > > > > > > and
> > > > > > > > > > > > > > > > thank you very much! :)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think that the next step is feature that
> will
> > > > allow
> > > > > > > >
> > > > > > > > signal
> > > > > > > > > > >
> > > > > > > > > > > about
> > > > > > > > > > > > > > > > blocked threads to the monitoring tools via
> > > MXBean.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I hope you will continue development of this
> > > > feature
> > > > > and
> > > > > > > > > >
> > > > > > > > > > provide
> > > > > > > > > > > your
> > > > > > > > > > > > > > > > vision in new JIRA issue.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey
> > Kuznetsov
> > > <
> > > > > > > > > > >
> > > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > David, Maxim!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks a lot for you ideas. Unfortunately,
> I
> > > > can't
> > > > > adopt
> > > > > > > >
> > > > > > > > all
> > > > > > > > > > >
> > > > > > > > > > > of them
> > > > > > > > > > > > > > > > right
> > > > > > > > > > > > > > > > > now: the scope is much broader than the
> scope
> > > of
> > > > > the
> > > > > > > >
> > > > > > > > change I
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > implement.
> > > > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > > have had a talk to a group of Ignite
> > commiters,
> > > > > and we
> > > > > > > >
> > > > > > > > agreed
> > > > > > > > > > >
> > > > > > > > > > > to
> > > > > > > > > > > > > > > complete
> > > > > > > > > > > > > > > > > the change as follows.
> > > > > > > > > > > > > > > > > - Blocking instructions in system-critical
> > > which
> > > > > may
> > > > > > > > > >
> > > > > > > > > > resonably
> > > > > > > > > > > last
> > > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > should be explicitly excluded from the
> > > > monitoring.
> > > > > > > > > > > > > > > > > - Failure handlers should have a setting to
> > > > > suppress some
> > > > > > > > > > >
> > > > > > > > > > > failures on
> > > > > > > > > > > > > > > > > per-failure-type basis.
> > > > > > > > > > > > > > > > > According to this I have updated the
> > > > > implementation: [1]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > [1]
> > https://github.com/apache/ignite/pull/4089
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey
> <
> > > > > > > > > > >
> > > > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > When I've done this before,I've needed to
> > > find
> > > > > the
> > > > > > > >
> > > > > > > > oldest
> > > > > > > > > > >
> > > > > > > > > > > thread,
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > kill
> > > > > > > > > > > > > > > > > > the node running that. From a language
> > > > > standpoint,
> > > > > > > > > >
> > > > > > > > > > Maxim's
> > > > > > > > > > > "without
> > > > > > > > > > > > > > > > > > progress" better than "heartbeat". For
> > > > > example, what
> > > > > > > >
> > > > > > > > I'm
> > > > > > > > > > >
> > > > > > > > > > > most
> > > > > > > > > > > > > > > > interested
> > > > > > > > > > > > > > > > > > in on a distributed system is which
> thread
> > > > > started the
> > > > > > > >
> > > > > > > > work
> > > > > > > > > > >
> > > > > > > > > > > it has
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > completed the earliest, and when did that
> > > > thread
> > > > > last
> > > > > > > >
> > > > > > > > make
> > > > > > > > > > >
> > > > > > > > > > > forward
> > > > > > > > > > > > > > > > > > process. You don't want to kill a
> node
> > > > > because a
> > > > > > > >
> > > > > > > > thread
> > > > > > > > > > >
> > > > > > > > > > > is
> > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > on a
> > > > > > > > > > > > > > > > > > lock held by a thread that went off-node
> > and
> > > > has
> > > > > not
> > > > > > > > > >
> > > > > > > > > > gotten a
> > > > > > > > > > > > > > > response.
> > > > > > > > > > > > > > > > > > If you don't understand the dependency
> > > > > relationships,
> > > > > > > >
> > > > > > > > you
> > > > > > > > > > >
> > > > > > > > > > > will make
> > > > > > > > > > > > > > > > > > incorrect recovery decisions.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim
> > > > Muzafarov <
> > > > > > > > > > >
> > > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I think we should find exact answers to
> > > these
> > > > > > > >
> > > > > > > > questions:
> > > > > > > > > > > > > > > > > > > 1. What `critical` issue exactly is?
> > > > > > > > > > > > > > > > > > > 2. How can we find critical issues?
> > > > > > > > > > > > > > > > > > > 3. How can we handle critical issues?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > First,
> > > > > > > > > > > > > > > > > > > - Ignore uninterruptable actions (e.g.
> > > > > > > >
> > > > > > > > worker\service
> > > > > > > > > > >
> > > > > > > > > > > shutdown)
> > > > > > > > > > > > > > > > > > > - Long I/O operations (should be a
> > > > > configurable
> > > > > > > >
> > > > > > > > timeout
> > > > > > > > > > >
> > > > > > > > > > > for each
> > > > > > > > > > > > > > > > type of
> > > > > > > > > > > > > > > > > > > usage)
> > > > > > > > > > > > > > > > > > > - Infinite loops
> > > > > > > > > > > > > > > > > > > - Stalled\deadlocked threads (and\or
> too
> > > > many
> > > > > parked
> > > > > > > > > > >
> > > > > > > > > > > threads,
> > > > > > > > > > > > > > > > exclude
> > > > > > > > > > > > > > > > > > I/O)
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Second,
> > > > > > > > > > > > > > > > > > > - The working queue is without
> progress
> > > > (e.g.
> > > > > disco,
> > > > > > > > > > >
> > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > queues)
> > > > > > > > > > > > > > > > > > > - Work hasn't been completed since the
> > > last
> > > > > > > >
> > > > > > > > heartbeat
> > > > > > > > > > >
> > > > > > > > > > > (checking
> > > > > > > > > > > > > > > > > > > milestones)
> > > > > > > > > > > > > > > > > > > - Too many system resources used by a
> > > thread
> > > > > for the
> > > > > > > > > >
> > > > > > > > > > long
> > > > > > > > > > > period
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > > > > > > > > > > - Timing fields associated with each
> > > thread
> > > > > status
> > > > > > > > > > >
> > > > > > > > > > > exceeded a
> > > > > > > > > > > > > > > > maximum
> > > > > > > > > > > > > > > > > > time
> > > > > > > > > > > > > > > > > > > limit.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Third (not too many options here),
> > > > > > > > > > > > > > > > > > > - `log everything` should be the
> default
> > > > > behaviour
> > > > > > > >
> > > > > > > > in
> > > > > > > > > >
> > > > > > > > > > all
> > > > > > > > > > > these
> > > > > > > > > > > > > > > > cases,
> > > > > > > > > > > > > > > > > > > since it may be difficult to find the
> > cause
> > > > > after the
> > > > > > > > > > >
> > > > > > > > > > > restart.
> > > > > > > > > > > > > > > > > > > - Wait some interval of time and kill
> > the
> > > > > hanging
> > > > > > > >
> > > > > > > > node
> > > > > > > > > > >
> > > > > > > > > > > (cluster
> > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > configured stable enough)
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Questions,
> > > > > > > > > > > > > > > > > > > - Not sure, but can workers miss their
> > > > > heartbeat
> > > > > > > > > > >
> > > > > > > > > > > deadlines if CPU
> > > > > > > > > > > > > > > > loads
> > > > > > > > > > > > > > > > > > up
> > > > > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary
> overloads
> > > can
> > > > > be
> > > > > > > > > > > > > > > > > > > expected behaviour as a normal part
> > of
> > > > > system
> > > > > > > > > > >
> > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > - Why do we decide that critical
> thread
> > > > should
> > > > > > > >
> > > > > > > > monitor
> > > > > > > > > > >
> > > > > > > > > > > each other?
> > > > > > > > > > > > > > > > For
> > > > > > > > > > > > > > > > > > > instance, if all the tasks were blocked
> > and
> > > > > unable to
> > > > > > > > > >
> > > > > > > > > > run,
> > > > > > > > > > > > > > > > > > > node reset would never occur. As
> for
> > > me,
> > > > a
> > > > > better
> > > > > > > > > > >
> > > > > > > > > > > solution is
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > separate monitor thread or pool (maybe
> > both
> > > > > with
> > > > > > > >
> > > > > > > > software
> > > > > > > > > > > > > > > > > > > and hardware checks) that not only
> > > checks
> > > > > > > >
> > > > > > > > heartbeats
> > > > > > > > > > >
> > > > > > > > > > > but
> > > > > > > > > > > > > > > > monitors the
> > > > > > > > > > > > > > > > > > > other system as well.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David
> > Harvey <
> > > > > > > > > > >
> > > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > It would be safer to restart the
> entire
> > > > > cluster
> > > > > > > >
> > > > > > > > than to
> > > > > > > > > > >
> > > > > > > > > > > remove
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > last
> > > > > > > > > > > > > > > > > > > > node for a cache that should be
> > > redundant.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey
> > Gura
> > > <
> > > > > > > > > > >
> > > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can
> > provide
> > > > some
> > > > > > > >
> > > > > > > > option
> > > > > > > > > > >
> > > > > > > > > > > that manage
> > > > > > > > > > > > > > > > worker
> > > > > > > > > > > > > > > > > > > > > liveness checker behavior in case
> of
> > > > > observing
> > > > > > > >
> > > > > > > > that
> > > > > > > > > > >
> > > > > > > > > > > some worker
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > > > > > > > > > > At least it will some workaround
> for
> > > > > cases when
> > > > > > > >
> > > > > > > > node
> > > > > > > > > > >
> > > > > > > > > > > fails is
> > > > > > > > > > > > > > > > too
> > > > > > > > > > > > > > > > > > > > > annoying.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Backups count threshold sounds good
> > > but I
> > > > > don't
> > > > > > > > > > >
> > > > > > > > > > > understand how
> > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > The simplest solution here is alert
> > in
> > > > > cases of
> > > > > > > > > > >
> > > > > > > > > > > blocking of
> > > > > > > > > > > > > > > some
> > > > > > > > > > > > > > > > > > > > > critical worker (we can improve
> > > > > WorkersRegistry
> > > > > > > >
> > > > > > > > for
> > > > > > > > > > >
> > > > > > > > > > > this
> > > > > > > > > > > > > > > purpose
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > expose list of blocked workers) and
> > > > > optionally
> > > > > > > >
> > > > > > > > call
> > > > > > > > > > >
> > > > > > > > > > > system
> > > > > > > > > > > > > > > > configured
> > > > > > > > > > > > > > > > > > > > > failure processor. BTW, failure
> > > processor
> > > > > can be
> > > > > > > > > > >
> > > > > > > > > > > extended in
> > > > > > > > > > > > > > > > order to
> > > > > > > > > > > > > > > > > > > > > perform any checks (e.g. backup
> > count)
> > > > and
> > > > > decide
> > > > > > > > > > >
> > > > > > > > > > > whether it
> > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM
> Andrey
> > > > > Kuznetsov <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [hidden email]>
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand your
> > > fears.
> > > > > But
> > > > > > > >
> > > > > > > > liveness
> > > > > > > > > > >
> > > > > > > > > > > checks
> > > > > > > > > > > > > > > deal
> > > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e. when
> > > such a
> > > > > > > >
> > > > > > > > condition
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > > > met we
> > > > > > > > > > > > > > > > > > conclude
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > node as totally broken, and there
> > is
> > > no
> > > > > sense
> > > > > > > >
> > > > > > > > to
> > > > > > > > > > >
> > > > > > > > > > > keep it
> > > > > > > > > > > > > > > alive
> > > > > > > > > > > > > > > > > > > > regardless
> > > > > > > > > > > > > > > > > > > > > > the data it contains. If we want
> to
> > > > give
> > > > > it a
> > > > > > > > > > >
> > > > > > > > > > > chance, then
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > condition
> > > > > > > > > > > > > > > > > > > > > > (long fsync etc.) should not
> > > considered
> > > > > as
> > > > > > > >
> > > > > > > > critical
> > > > > > > > > > >
> > > > > > > > > > > at all.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18,
> Yakov
> > > > > Zhdanov <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [hidden email]>:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Agree with David. We need to
> have
> > > an
> > > > > > > >
> > > > > > > > opporunity
> > > > > > > > > > >
> > > > > > > > > > > set backups
> > > > > > > > > > > > > > > > count
> > > > > > > > > > > > > > > > > > > > > threshold
> > > > > > > > > > > > > > > > > > > > > > > (at runtime also!) that will
> not
> > > > allow
> > > > > any
> > > > > > > > > > >
> > > > > > > > > > > automatic stop
> > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > > there
> > > > > > > > > > > > > > > > > > > > > will be
> > > > > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what do
> you
> > > > think?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > Maxim Muzafarov
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > Andrey Kuznetsov.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Best Regards, Vyacheslav D.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > --
> > > > > > > > > > Maxim Muzafarov
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > >
> > > >
> > >
> > --
> > --
> > Maxim Muzafarov
> >
>
>
> --
> Best regards,
> Andrey Kuznetsov.
>

Alexey Goncharuk

Re: Critical worker threads liveness checking drawbacks

Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope?
This causes an Ignite node to be stopped by default when checkpoint read
lock acquire times out. I expect a lot of Ignite 2.7 users will be affected
by this mistake.

We should at least update the documentation and make users aware of a
workaround.

чт, 25 окт. 2018 г. в 16:35, Alexey Goncharuk <[hidden email]>:

> Andrey,
>
> I still see that checkpoint read lock acquisition raises a CRITICAL_ERROR,
> which by default will shut down local node. As far as I remember, we
> decided that by default thread timeout should not trigger node failure.
> Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events in
> default configuration.
>
> Should we introduce another critical failure type
> CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for checkpoint
> read lock acquire failure?
>
> --AG
>
> пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov <[hidden email]>:
>
>> Igniters,
>>
>> Now I spot blocking / long-running code arising from
>> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
>> thread, see [1]. Ideally, all blocking operations along all possible code
>> paths should be guarded implicitly from critical failure detector to avoid
>> the thread from being considered blocked. There is a pull request [2] that
>> provides shallow solution. I didn't change code outside
>> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any
>> upcoming change. Also, I didn't touch the code runnable by threads other
>> than partition-exchanger. So I have a number of guarded sections that are
>> wider than they could be, and this potentially hides issues from failure
>> detector. Does this PR make sense? Or maybe it's better to exclude
>> partition-exchanger from critical threads registry at all?
>>
>> [1] https://issues.apache.org/jira/browse/IGNITE-9710
>> [2] https://github.com/apache/ignite/pull/4962
>>
>>
>> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov <[hidden email]>:
>>
>> > Andrey, Andrey
>> >
>> > > Thanks for being attentive! It's definitely a typo. Could you please
>> > create
>> > > an issue?
>> >
>> > I've created an issue [1] and prepared PR [2].
>> > Please, review this change.
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-9723
>> > [2] https://github.com/apache/ignite/pull/4862
>> >
>> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <[hidden email]> wrote:
>> >
>> > > Config option + mbean access. Does that make sense?
>> > >
>> > > Yakov
>> > >
>> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <[hidden email]>
>> > wrote:
>> > >
>> > > > Then it should be config option.
>> > > >
>> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <[hidden email]>:
>> > > >
>> > > > > Guys,
>> > > > >
>> > > > > why we need both config option and system property? I believe one
>> way
>> > > is
>> > > > > enough.
>> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
>> > [hidden email]>
>> > > > > wrote:
>> > > > > >
>> > > > > > Ticket created -
>> https://issues.apache.org/jira/browse/IGNITE-9737
>> > > > > >
>> > > > > > Fixed version is 2.7.
>> > > > > >
>> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
>> > > > > > > Nikolay, I agree, a user should be able to disable both thread
>> > > > liveness
>> > > > > > > check and checkpoint read lock timeout check from config and a
>> > > system
>> > > > > > > property.
>> > > > > > >
>> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
>> > [hidden email]
>> > > >:
>> > > > > > >
>> > > > > > > > Hello, Igniters.
>> > > > > > > >
>> > > > > > > > I found that this feature can't be disabled from config.
>> > > > > > > > The only way to disable it is from JMX bean.
>> > > > > > > >
>> > > > > > > > I think it very dangerous: If we have some corner case or a
>> bug
>> > > in
>> > > > > this
>> > > > > > > > Watch Dog it can make Ignite unusable.
>> > > > > > > > I propose to implement possibility to disable this feature
>> > both -
>> > > > > from
>> > > > > > > > config and from JVM options.
>> > > > > > > >
>> > > > > > > > What do you think?
>> > > > > > > >
>> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
>> > > > > > > > > Maxim,
>> > > > > > > > >
>> > > > > > > > > Thanks for being attentive! It's definitely a typo. Could
>> you
>> > > > > please
>> > > > > > > >
>> > > > > > > > create
>> > > > > > > > > an issue?
>> > > > > > > > >
>> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
>> > > > [hidden email]
>> > > > > >:
>> > > > > > > > >
>> > > > > > > > > > Folks,
>> > > > > > > > > >
>> > > > > > > > > > I've found in `GridCachePartitionExchangeManager:2684`
>> [1]
>> > > > > (master
>> > > > > > > >
>> > > > > > > > branch)
>> > > > > > > > > > exchange future wrapped
>> > > > > > > > > > with double `blockingSectionEnd` method. Is it correct?
>> I
>> > > just
>> > > > > want to
>> > > > > > > > > > understand this change and
>> > > > > > > > > > how should I use this in the future.
>> > > > > > > > > >
>> > > > > > > > > > Should I file a new issue to fix this? I think here
>> > > > > > > >
>> > > > > > > > `blockingSectionBegin`
>> > > > > > > > > > method should be used.
>> > > > > > > > > >
>> > > > > > > > > > -------------
>> > > > > > > > > > blockingSectionEnd();
>> > > > > > > > > >
>> > > > > > > > > > try {
>> > > > > > > > > > resVer = exchFut.get(exchTimeout,
>> > TimeUnit.MILLISECONDS);
>> > > > > > > > > > } finally {
>> > > > > > > > > > blockingSectionEnd();
>> > > > > > > > > > }
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > [1]
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
>> > > > > > > > > >
>> > > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
>> > > > > [hidden email]>
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Andrey Gura, thank you for the answer!
>> > > > > > > > > > >
>> > > > > > > > > > > I agree that wrapping of 'init' method reduces the
>> profit
>> > > of
>> > > > > watchdog
>> > > > > > > > > > > service in case of PME worker, but in other cases, we
>> > > should
>> > > > > wrap all
>> > > > > > > > > > > possible long sections on
>> GridDhtPartitionExchangeFuture.
>> > > For
>> > > > > example
>> > > > > > > > > > > 'onCacheChangeRequest' method or
>> > > > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because
>> it
>> > > may
>> > > > > take
>> > > > > > > > > > > significant time (reproducer attached).
>> > > > > > > > > > >
>> > > > > > > > > > > I only want to point out a possible issue which may
>> allow
>> > > to
>> > > > > end-user
>> > > > > > > > > > > halt the Ignite cluster accidentally.
>> > > > > > > > > > >
>> > > > > > > > > > > I'm sure that PME experts know how to fix this issue
>> > > > properly.
>> > > > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
>> > > > [hidden email]
>> > > > > >
>> > > > > > > >
>> > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > Vyacheslav,
>> > > > > > > > > > > >
>> > > > > > > > > > > > Exchange worker is strongly tied with
>> > > > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok.
>> > > Exchange
>> > > > > worker
>> > > > > > > >
>> > > > > > > > also
>> > > > > > > > > > > > shouldn't be blocked for long time but in reality it
>> > > > > happens.It
>> > > > > > > >
>> > > > > > > > also
>> > > > > > > > > > > > means that your change doesn't make sense.
>> > > > > > > > > > > >
>> > > > > > > > > > > > What actually make sense it is identification of
>> places
>> > > > which
>> > > > > > > > > > > > intentionally blocking. May be some places/actions
>> > should
>> > > > be
>> > > > > > > >
>> > > > > > > > braced by
>> > > > > > > > > > > > blocking guards.
>> > > > > > > > > > > >
>> > > > > > > > > > > > If you have failing tests please make sure that your
>> > > > > > > >
>> > > > > > > > failureHandler is
>> > > > > > > > > > > > NoOpFailureHandler or any other handler with
>> > > > > ignoreFailureTypes =
>> > > > > > > > > > > > [CRITICAL_WORKER_BLOCKED].
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
>> > > > > > > > > >
>> > > > > > > > > > [hidden email]>
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Hi Igniters!
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Thank you for this important improvement!
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I've looked through implementation and noticed
>> that
>> > > > > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not been
>> > > wrapped
>> > > > > in
>> > > > > > > >
>> > > > > > > > blocked
>> > > > > > > > > > > > > section. This means it easy to halt the node in
>> case
>> > of
>> > > > > > > >
>> > > > > > > > longrunning
>> > > > > > > > > > > > > actions during PME, for example when we create a
>> > cache
>> > > > with
>> > > > > > > > > > > > > StoreFactrory which connect to 3rd party DB.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I'm not sure that it is the right behavior.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I filled the issue [1] and prepared the PR [2]
>> with
>> > > > > reproducer
>> > > > > > > >
>> > > > > > > > and
>> > > > > > > > > > >
>> > > > > > > > > > > possible fix.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Andrey, could you please look at and confirm that
>> it
>> > > > makes
>> > > > > sense?
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > [1]
>> > https://issues.apache.org/jira/browse/IGNITE-9710
>> > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4845
>> > > > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
>> > > > > > > >
>> > > > > > > > [hidden email]>
>> > > > > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Denis,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > I've created the ticket [1] with short
>> description
>> > of
>> > > > the
>> > > > > > > > > > >
>> > > > > > > > > > > functionality.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > [1]
>> > > https://issues.apache.org/jira/browse/IGNITE-9679
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <
>> > > > > [hidden email]>:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Andrey K. and G.,
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Thanks, do we have a documentation ticket
>> > created?
>> > > > > Prachi
>> > > > > > > > > >
>> > > > > > > > > > (copied)
>> > > > > > > > > > > can help
>> > > > > > > > > > > > > > > with the documentation.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > Denis
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
>> > > > > > > >
>> > > > > > > > [hidden email]>
>> > > > > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Andrey,
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > finally your change is merged to master
>> branch.
>> > > > > > > >
>> > > > > > > > Congratulations
>> > > > > > > > > > >
>> > > > > > > > > > > and
>> > > > > > > > > > > > > > > > thank you very much! :)
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > I think that the next step is feature that
>> will
>> > > > allow
>> > > > > > > >
>> > > > > > > > signal
>> > > > > > > > > > >
>> > > > > > > > > > > about
>> > > > > > > > > > > > > > > > blocked threads to the monitoring tools via
>> > > MXBean.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > I hope you will continue development of this
>> > > > feature
>> > > > > and
>> > > > > > > > > >
>> > > > > > > > > > provide
>> > > > > > > > > > > your
>> > > > > > > > > > > > > > > > vision in new JIRA issue.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey
>> > Kuznetsov
>> > > <
>> > > > > > > > > > >
>> > > > > > > > > > > [hidden email]>
>> > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > David, Maxim!
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Thanks a lot for you ideas.
>> Unfortunately, I
>> > > > can't
>> > > > > adopt
>> > > > > > > >
>> > > > > > > > all
>> > > > > > > > > > >
>> > > > > > > > > > > of them
>> > > > > > > > > > > > > > > > right
>> > > > > > > > > > > > > > > > > now: the scope is much broader than the
>> scope
>> > > of
>> > > > > the
>> > > > > > > >
>> > > > > > > > change I
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > implement.
>> > > > > > > > > > > > > > > > I
>> > > > > > > > > > > > > > > > > have had a talk to a group of Ignite
>> > commiters,
>> > > > > and we
>> > > > > > > >
>> > > > > > > > agreed
>> > > > > > > > > > >
>> > > > > > > > > > > to
>> > > > > > > > > > > > > > > complete
>> > > > > > > > > > > > > > > > > the change as follows.
>> > > > > > > > > > > > > > > > > - Blocking instructions in system-critical
>> > > which
>> > > > > may
>> > > > > > > > > >
>> > > > > > > > > > resonably
>> > > > > > > > > > > last
>> > > > > > > > > > > > > > > long
>> > > > > > > > > > > > > > > > > should be explicitly excluded from the
>> > > > monitoring.
>> > > > > > > > > > > > > > > > > - Failure handlers should have a setting
>> to
>> > > > > suppress some
>> > > > > > > > > > >
>> > > > > > > > > > > failures on
>> > > > > > > > > > > > > > > > > per-failure-type basis.
>> > > > > > > > > > > > > > > > > According to this I have updated the
>> > > > > implementation: [1]
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > [1]
>> > https://github.com/apache/ignite/pull/4089
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David
>> Harvey <
>> > > > > > > > > > >
>> > > > > > > > > > > [hidden email]>:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > When I've done this before,I've needed
>> to
>> > > find
>> > > > > the
>> > > > > > > >
>> > > > > > > > oldest
>> > > > > > > > > > >
>> > > > > > > > > > > thread,
>> > > > > > > > > > > > > > > and
>> > > > > > > > > > > > > > > > kill
>> > > > > > > > > > > > > > > > > > the node running that. From a language
>> > > > > standpoint,
>> > > > > > > > > >
>> > > > > > > > > > Maxim's
>> > > > > > > > > > > "without
>> > > > > > > > > > > > > > > > > > progress" better than "heartbeat". For
>> > > > > example, what
>> > > > > > > >
>> > > > > > > > I'm
>> > > > > > > > > > >
>> > > > > > > > > > > most
>> > > > > > > > > > > > > > > > interested
>> > > > > > > > > > > > > > > > > > in on a distributed system is which
>> thread
>> > > > > started the
>> > > > > > > >
>> > > > > > > > work
>> > > > > > > > > > >
>> > > > > > > > > > > it has
>> > > > > > > > > > > > > > > not
>> > > > > > > > > > > > > > > > > > completed the earliest, and when did
>> that
>> > > > thread
>> > > > > last
>> > > > > > > >
>> > > > > > > > make
>> > > > > > > > > > >
>> > > > > > > > > > > forward
>> > > > > > > > > > > > > > > > > > process. You don't want to kill a
>> node
>> > > > > because a
>> > > > > > > >
>> > > > > > > > thread
>> > > > > > > > > > >
>> > > > > > > > > > > is
>> > > > > > > > > > > > > > > waiting
>> > > > > > > > > > > > > > > > on a
>> > > > > > > > > > > > > > > > > > lock held by a thread that went off-node
>> > and
>> > > > has
>> > > > > not
>> > > > > > > > > >
>> > > > > > > > > > gotten a
>> > > > > > > > > > > > > > > response.
>> > > > > > > > > > > > > > > > > > If you don't understand the dependency
>> > > > > relationships,
>> > > > > > > >
>> > > > > > > > you
>> > > > > > > > > > >
>> > > > > > > > > > > will make
>> > > > > > > > > > > > > > > > > > incorrect recovery decisions.
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim
>> > > > Muzafarov <
>> > > > > > > > > > >
>> > > > > > > > > > > [hidden email]>
>> > > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > I think we should find exact answers
>> to
>> > > these
>> > > > > > > >
>> > > > > > > > questions:
>> > > > > > > > > > > > > > > > > > > 1. What `critical` issue exactly is?
>> > > > > > > > > > > > > > > > > > > 2. How can we find critical issues?
>> > > > > > > > > > > > > > > > > > > 3. How can we handle critical issues?
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > First,
>> > > > > > > > > > > > > > > > > > > - Ignore uninterruptable actions
>> (e.g.
>> > > > > > > >
>> > > > > > > > worker\service
>> > > > > > > > > > >
>> > > > > > > > > > > shutdown)
>> > > > > > > > > > > > > > > > > > > - Long I/O operations (should be a
>> > > > > configurable
>> > > > > > > >
>> > > > > > > > timeout
>> > > > > > > > > > >
>> > > > > > > > > > > for each
>> > > > > > > > > > > > > > > > type of
>> > > > > > > > > > > > > > > > > > > usage)
>> > > > > > > > > > > > > > > > > > > - Infinite loops
>> > > > > > > > > > > > > > > > > > > - Stalled\deadlocked threads (and\or
>> too
>> > > > many
>> > > > > parked
>> > > > > > > > > > >
>> > > > > > > > > > > threads,
>> > > > > > > > > > > > > > > > exclude
>> > > > > > > > > > > > > > > > > > I/O)
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Second,
>> > > > > > > > > > > > > > > > > > > - The working queue is without
>> progress
>> > > > (e.g.
>> > > > > disco,
>> > > > > > > > > > >
>> > > > > > > > > > > exchange
>> > > > > > > > > > > > > > > > queues)
>> > > > > > > > > > > > > > > > > > > - Work hasn't been completed since
>> the
>> > > last
>> > > > > > > >
>> > > > > > > > heartbeat
>> > > > > > > > > > >
>> > > > > > > > > > > (checking
>> > > > > > > > > > > > > > > > > > > milestones)
>> > > > > > > > > > > > > > > > > > > - Too many system resources used by a
>> > > thread
>> > > > > for the
>> > > > > > > > > >
>> > > > > > > > > > long
>> > > > > > > > > > > period
>> > > > > > > > > > > > > > > of
>> > > > > > > > > > > > > > > > time
>> > > > > > > > > > > > > > > > > > > (allocated memory, CPU)
>> > > > > > > > > > > > > > > > > > > - Timing fields associated with each
>> > > thread
>> > > > > status
>> > > > > > > > > > >
>> > > > > > > > > > > exceeded a
>> > > > > > > > > > > > > > > > maximum
>> > > > > > > > > > > > > > > > > > time
>> > > > > > > > > > > > > > > > > > > limit.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Third (not too many options here),
>> > > > > > > > > > > > > > > > > > > - `log everything` should be the
>> default
>> > > > > behaviour
>> > > > > > > >
>> > > > > > > > in
>> > > > > > > > > >
>> > > > > > > > > > all
>> > > > > > > > > > > these
>> > > > > > > > > > > > > > > > cases,
>> > > > > > > > > > > > > > > > > > > since it may be difficult to find the
>> > cause
>> > > > > after the
>> > > > > > > > > > >
>> > > > > > > > > > > restart.
>> > > > > > > > > > > > > > > > > > > - Wait some interval of time and kill
>> > the
>> > > > > hanging
>> > > > > > > >
>> > > > > > > > node
>> > > > > > > > > > >
>> > > > > > > > > > > (cluster
>> > > > > > > > > > > > > > > > should
>> > > > > > > > > > > > > > > > > > be
>> > > > > > > > > > > > > > > > > > > configured stable enough)
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Questions,
>> > > > > > > > > > > > > > > > > > > - Not sure, but can workers miss
>> their
>> > > > > heartbeat
>> > > > > > > > > > >
>> > > > > > > > > > > deadlines if CPU
>> > > > > > > > > > > > > > > > loads
>> > > > > > > > > > > > > > > > > > up
>> > > > > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary
>> overloads
>> > > can
>> > > > > be
>> > > > > > > > > > > > > > > > > > > expected behaviour as a normal
>> part
>> > of
>> > > > > system
>> > > > > > > > > > >
>> > > > > > > > > > > operations.
>> > > > > > > > > > > > > > > > > > > - Why do we decide that critical
>> thread
>> > > > should
>> > > > > > > >
>> > > > > > > > monitor
>> > > > > > > > > > >
>> > > > > > > > > > > each other?
>> > > > > > > > > > > > > > > > For
>> > > > > > > > > > > > > > > > > > > instance, if all the tasks were
>> blocked
>> > and
>> > > > > unable to
>> > > > > > > > > >
>> > > > > > > > > > run,
>> > > > > > > > > > > > > > > > > > > node reset would never occur. As
>> for
>> > > me,
>> > > > a
>> > > > > better
>> > > > > > > > > > >
>> > > > > > > > > > > solution is
>> > > > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > > use
>> > > > > > > > > > > > > > > > > > a
>> > > > > > > > > > > > > > > > > > > separate monitor thread or pool (maybe
>> > both
>> > > > > with
>> > > > > > > >
>> > > > > > > > software
>> > > > > > > > > > > > > > > > > > > and hardware checks) that not only
>> > > checks
>> > > > > > > >
>> > > > > > > > heartbeats
>> > > > > > > > > > >
>> > > > > > > > > > > but
>> > > > > > > > > > > > > > > > monitors the
>> > > > > > > > > > > > > > > > > > > other system as well.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David
>> > Harvey <
>> > > > > > > > > > >
>> > > > > > > > > > > [hidden email]>
>> > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > It would be safer to restart the
>> entire
>> > > > > cluster
>> > > > > > > >
>> > > > > > > > than to
>> > > > > > > > > > >
>> > > > > > > > > > > remove
>> > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > last
>> > > > > > > > > > > > > > > > > > > > node for a cache that should be
>> > > redundant.
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey
>> > Gura
>> > > <
>> > > > > > > > > > >
>> > > > > > > > > > > [hidden email]>
>> > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Hi,
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can
>> > provide
>> > > > some
>> > > > > > > >
>> > > > > > > > option
>> > > > > > > > > > >
>> > > > > > > > > > > that manage
>> > > > > > > > > > > > > > > > worker
>> > > > > > > > > > > > > > > > > > > > > liveness checker behavior in case
>> of
>> > > > > observing
>> > > > > > > >
>> > > > > > > > that
>> > > > > > > > > > >
>> > > > > > > > > > > some worker
>> > > > > > > > > > > > > > > > is
>> > > > > > > > > > > > > > > > > > > > > blocked too long.
>> > > > > > > > > > > > > > > > > > > > > At least it will some workaround
>> for
>> > > > > cases when
>> > > > > > > >
>> > > > > > > > node
>> > > > > > > > > > >
>> > > > > > > > > > > fails is
>> > > > > > > > > > > > > > > > too
>> > > > > > > > > > > > > > > > > > > > > annoying.
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Backups count threshold sounds
>> good
>> > > but I
>> > > > > don't
>> > > > > > > > > > >
>> > > > > > > > > > > understand how
>> > > > > > > > > > > > > > > it
>> > > > > > > > > > > > > > > > > > will
>> > > > > > > > > > > > > > > > > > > > > help in case of cluster hanging.
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > The simplest solution here is
>> alert
>> > in
>> > > > > cases of
>> > > > > > > > > > >
>> > > > > > > > > > > blocking of
>> > > > > > > > > > > > > > > some
>> > > > > > > > > > > > > > > > > > > > > critical worker (we can improve
>> > > > > WorkersRegistry
>> > > > > > > >
>> > > > > > > > for
>> > > > > > > > > > >
>> > > > > > > > > > > this
>> > > > > > > > > > > > > > > purpose
>> > > > > > > > > > > > > > > > and
>> > > > > > > > > > > > > > > > > > > > > expose list of blocked workers)
>> and
>> > > > > optionally
>> > > > > > > >
>> > > > > > > > call
>> > > > > > > > > > >
>> > > > > > > > > > > system
>> > > > > > > > > > > > > > > > configured
>> > > > > > > > > > > > > > > > > > > > > failure processor. BTW, failure
>> > > processor
>> > > > > can be
>> > > > > > > > > > >
>> > > > > > > > > > > extended in
>> > > > > > > > > > > > > > > > order to
>> > > > > > > > > > > > > > > > > > > > > perform any checks (e.g. backup
>> > count)
>> > > > and
>> > > > > decide
>> > > > > > > > > > >
>> > > > > > > > > > > whether it
>> > > > > > > > > > > > > > > > should
>> > > > > > > > > > > > > > > > > > > > > stop node or not.
>> > > > > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM
>> Andrey
>> > > > > Kuznetsov <
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > [hidden email]>
>> > > > > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand your
>> > > fears.
>> > > > > But
>> > > > > > > >
>> > > > > > > > liveness
>> > > > > > > > > > >
>> > > > > > > > > > > checks
>> > > > > > > > > > > > > > > deal
>> > > > > > > > > > > > > > > > > > with
>> > > > > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e. when
>> > > such a
>> > > > > > > >
>> > > > > > > > condition
>> > > > > > > > > >
>> > > > > > > > > > is
>> > > > > > > > > > > met we
>> > > > > > > > > > > > > > > > > > conclude
>> > > > > > > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > > > > > node as totally broken, and
>> there
>> > is
>> > > no
>> > > > > sense
>> > > > > > > >
>> > > > > > > > to
>> > > > > > > > > > >
>> > > > > > > > > > > keep it
>> > > > > > > > > > > > > > > alive
>> > > > > > > > > > > > > > > > > > > > regardless
>> > > > > > > > > > > > > > > > > > > > > > the data it contains. If we
>> want to
>> > > > give
>> > > > > it a
>> > > > > > > > > > >
>> > > > > > > > > > > chance, then
>> > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > > > condition
>> > > > > > > > > > > > > > > > > > > > > > (long fsync etc.) should not
>> > > considered
>> > > > > as
>> > > > > > > >
>> > > > > > > > critical
>> > > > > > > > > > >
>> > > > > > > > > > > at all.
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18,
>> Yakov
>> > > > > Zhdanov <
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > [hidden email]>:
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > > Agree with David. We need to
>> have
>> > > an
>> > > > > > > >
>> > > > > > > > opporunity
>> > > > > > > > > > >
>> > > > > > > > > > > set backups
>> > > > > > > > > > > > > > > > count
>> > > > > > > > > > > > > > > > > > > > > threshold
>> > > > > > > > > > > > > > > > > > > > > > > (at runtime also!) that will
>> not
>> > > > allow
>> > > > > any
>> > > > > > > > > > >
>> > > > > > > > > > > automatic stop
>> > > > > > > > > > > > > > > if
>> > > > > > > > > > > > > > > > > > there
>> > > > > > > > > > > > > > > > > > > > > will be
>> > > > > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what do
>> you
>> > > > think?
>> > > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > > --Yakov
>> > > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > > > > > > Best regards,
>> > > > > > > > > > > > > > > > > > > > > > Andrey Kuznetsov.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > > > Maxim Muzafarov
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > Best regards,
>> > > > > > > > > > > > > > > > > Andrey Kuznetsov.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > Best regards,
>> > > > > > > > > > > > > > Andrey Kuznetsov.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > --
>> > > > > > > > > > > > > Best Regards, Vyacheslav D.
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > --
>> > > > > > > > > > > Best Regards, Vyacheslav D.
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > --
>> > > > > > > > > > --
>> > > > > > > > > > Maxim Muzafarov
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > >
>> > > >
>> > >
>> > --
>> > --
>> > Maxim Muzafarov
>> >
>>
>>
>> --
>> Best regards,
>> Andrey Kuznetsov.
>>
>

Nikolay Izhikov-2

Re: Critical worker threads liveness checking drawbacks

Hello, Alexey.

No, we don't include this ticket to 2.7.
Should we?

ср, 19 дек. 2018 г. в 12:55, Alexey Goncharuk <[hidden email]>:

> Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope?
> This causes an Ignite node to be stopped by default when checkpoint read
> lock acquire times out. I expect a lot of Ignite 2.7 users will be affected
> by this mistake.
>
> We should at least update the documentation and make users aware of a
> workaround.
>
> чт, 25 окт. 2018 г. в 16:35, Alexey Goncharuk <[hidden email]
> >:
>
> > Andrey,
> >
> > I still see that checkpoint read lock acquisition raises a
> CRITICAL_ERROR,
> > which by default will shut down local node. As far as I remember, we
> > decided that by default thread timeout should not trigger node failure.
> > Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events in
> > default configuration.
> >
> > Should we introduce another critical failure type
> > CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for checkpoint
> > read lock acquire failure?
> >
> > --AG
> >
> > пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov <[hidden email]>:
> >
> >> Igniters,
> >>
> >> Now I spot blocking / long-running code arising from
> >> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
> >> thread, see [1]. Ideally, all blocking operations along all possible
> code
> >> paths should be guarded implicitly from critical failure detector to
> avoid
> >> the thread from being considered blocked. There is a pull request [2]
> that
> >> provides shallow solution. I didn't change code outside
> >> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any
> >> upcoming change. Also, I didn't touch the code runnable by threads other
> >> than partition-exchanger. So I have a number of guarded sections that
> are
> >> wider than they could be, and this potentially hides issues from failure
> >> detector. Does this PR make sense? Or maybe it's better to exclude
> >> partition-exchanger from critical threads registry at all?
> >>
> >> [1] https://issues.apache.org/jira/browse/IGNITE-9710
> >> [2] https://github.com/apache/ignite/pull/4962
> >>
> >>
> >> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov <[hidden email]>:
> >>
> >> > Andrey, Andrey
> >> >
> >> > > Thanks for being attentive! It's definitely a typo. Could you please
> >> > create
> >> > > an issue?
> >> >
> >> > I've created an issue [1] and prepared PR [2].
> >> > Please, review this change.
> >> >
> >> > [1] https://issues.apache.org/jira/browse/IGNITE-9723
> >> > [2] https://github.com/apache/ignite/pull/4862
> >> >
> >> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <[hidden email]>
> wrote:
> >> >
> >> > > Config option + mbean access. Does that make sense?
> >> > >
> >> > > Yakov
> >> > >
> >> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <[hidden email]>
> >> > wrote:
> >> > >
> >> > > > Then it should be config option.
> >> > > >
> >> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <[hidden email]>:
> >> > > >
> >> > > > > Guys,
> >> > > > >
> >> > > > > why we need both config option and system property? I believe
> one
> >> way
> >> > > is
> >> > > > > enough.
> >> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
> >> > [hidden email]>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > Ticket created -
> >> https://issues.apache.org/jira/browse/IGNITE-9737
> >> > > > > >
> >> > > > > > Fixed version is 2.7.
> >> > > > > >
> >> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> >> > > > > > > Nikolay, I agree, a user should be able to disable both
> thread
> >> > > > liveness
> >> > > > > > > check and checkpoint read lock timeout check from config
> and a
> >> > > system
> >> > > > > > > property.
> >> > > > > > >
> >> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
> >> > [hidden email]
> >> > > >:
> >> > > > > > >
> >> > > > > > > > Hello, Igniters.
> >> > > > > > > >
> >> > > > > > > > I found that this feature can't be disabled from config.
> >> > > > > > > > The only way to disable it is from JMX bean.
> >> > > > > > > >
> >> > > > > > > > I think it very dangerous: If we have some corner case or
> a
> >> bug
> >> > > in
> >> > > > > this
> >> > > > > > > > Watch Dog it can make Ignite unusable.
> >> > > > > > > > I propose to implement possibility to disable this feature
> >> > both -
> >> > > > > from
> >> > > > > > > > config and from JVM options.
> >> > > > > > > >
> >> > > > > > > > What do you think?
> >> > > > > > > >
> >> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> >> > > > > > > > > Maxim,
> >> > > > > > > > >
> >> > > > > > > > > Thanks for being attentive! It's definitely a typo.
> Could
> >> you
> >> > > > > please
> >> > > > > > > >
> >> > > > > > > > create
> >> > > > > > > > > an issue?
> >> > > > > > > > >
> >> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> >> > > > [hidden email]
> >> > > > > >:
> >> > > > > > > > >
> >> > > > > > > > > > Folks,
> >> > > > > > > > > >
> >> > > > > > > > > > I've found in `GridCachePartitionExchangeManager:2684`
> >> [1]
> >> > > > > (master
> >> > > > > > > >
> >> > > > > > > > branch)
> >> > > > > > > > > > exchange future wrapped
> >> > > > > > > > > > with double `blockingSectionEnd` method. Is it
> correct?
> >> I
> >> > > just
> >> > > > > want to
> >> > > > > > > > > > understand this change and
> >> > > > > > > > > > how should I use this in the future.
> >> > > > > > > > > >
> >> > > > > > > > > > Should I file a new issue to fix this? I think here
> >> > > > > > > >
> >> > > > > > > > `blockingSectionBegin`
> >> > > > > > > > > > method should be used.
> >> > > > > > > > > >
> >> > > > > > > > > > -------------
> >> > > > > > > > > > blockingSectionEnd();
> >> > > > > > > > > >
> >> > > > > > > > > > try {
> >> > > > > > > > > > resVer = exchFut.get(exchTimeout,
> >> > TimeUnit.MILLISECONDS);
> >> > > > > > > > > > } finally {
> >> > > > > > > > > > blockingSectionEnd();
> >> > > > > > > > > > }
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > [1]
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> >> > > > > > > > > >
> >> > > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> >> > > > > [hidden email]>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Andrey Gura, thank you for the answer!
> >> > > > > > > > > > >
> >> > > > > > > > > > > I agree that wrapping of 'init' method reduces the
> >> profit
> >> > > of
> >> > > > > watchdog
> >> > > > > > > > > > > service in case of PME worker, but in other cases,
> we
> >> > > should
> >> > > > > wrap all
> >> > > > > > > > > > > possible long sections on
> >> GridDhtPartitionExchangeFuture.
> >> > > For
> >> > > > > example
> >> > > > > > > > > > > 'onCacheChangeRequest' method or
> >> > > > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside
> because
> >> it
> >> > > may
> >> > > > > take
> >> > > > > > > > > > > significant time (reproducer attached).
> >> > > > > > > > > > >
> >> > > > > > > > > > > I only want to point out a possible issue which may
> >> allow
> >> > > to
> >> > > > > end-user
> >> > > > > > > > > > > halt the Ignite cluster accidentally.
> >> > > > > > > > > > >
> >> > > > > > > > > > > I'm sure that PME experts know how to fix this issue
> >> > > > properly.
> >> > > > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
> >> > > > [hidden email]
> >> > > > > >
> >> > > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Vyacheslav,
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Exchange worker is strongly tied with
> >> > > > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok.
> >> > > Exchange
> >> > > > > worker
> >> > > > > > > >
> >> > > > > > > > also
> >> > > > > > > > > > > > shouldn't be blocked for long time but in reality
> it
> >> > > > > happens.It
> >> > > > > > > >
> >> > > > > > > > also
> >> > > > > > > > > > > > means that your change doesn't make sense.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > What actually make sense it is identification of
> >> places
> >> > > > which
> >> > > > > > > > > > > > intentionally blocking. May be some places/actions
> >> > should
> >> > > > be
> >> > > > > > > >
> >> > > > > > > > braced by
> >> > > > > > > > > > > > blocking guards.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > If you have failing tests please make sure that
> your
> >> > > > > > > >
> >> > > > > > > > failureHandler is
> >> > > > > > > > > > > > NoOpFailureHandler or any other handler with
> >> > > > > ignoreFailureTypes =
> >> > > > > > > > > > > > [CRITICAL_WORKER_BLOCKED].
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav
> Daradur <
> >> > > > > > > > > >
> >> > > > > > > > > > [hidden email]>
> >> > > > > > > > > > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Hi Igniters!
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Thank you for this important improvement!
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I've looked through implementation and noticed
> >> that
> >> > > > > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not
> been
> >> > > wrapped
> >> > > > > in
> >> > > > > > > >
> >> > > > > > > > blocked
> >> > > > > > > > > > > > > section. This means it easy to halt the node in
> >> case
> >> > of
> >> > > > > > > >
> >> > > > > > > > longrunning
> >> > > > > > > > > > > > > actions during PME, for example when we create a
> >> > cache
> >> > > > with
> >> > > > > > > > > > > > > StoreFactrory which connect to 3rd party DB.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I'm not sure that it is the right behavior.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I filled the issue [1] and prepared the PR [2]
> >> with
> >> > > > > reproducer
> >> > > > > > > >
> >> > > > > > > > and
> >> > > > > > > > > > >
> >> > > > > > > > > > > possible fix.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Andrey, could you please look at and confirm
> that
> >> it
> >> > > > makes
> >> > > > > sense?
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > [1]
> >> > https://issues.apache.org/jira/browse/IGNITE-9710
> >> > > > > > > > > > > > > [2] https://github.com/apache/ignite/pull/4845
> >> > > > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey
> Kuznetsov <
> >> > > > > > > >
> >> > > > > > > > [hidden email]>
> >> > > > > > > > > > >
> >> > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Denis,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I've created the ticket [1] with short
> >> description
> >> > of
> >> > > > the
> >> > > > > > > > > > >
> >> > > > > > > > > > > functionality.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > [1]
> >> > > https://issues.apache.org/jira/browse/IGNITE-9679
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <
> >> > > > > [hidden email]>:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Andrey K. and G.,
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Thanks, do we have a documentation ticket
> >> > created?
> >> > > > > Prachi
> >> > > > > > > > > >
> >> > > > > > > > > > (copied)
> >> > > > > > > > > > > can help
> >> > > > > > > > > > > > > > > with the documentation.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > Denis
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura
> <
> >> > > > > > > >
> >> > > > > > > > [hidden email]>
> >> > > > > > > > > > >
> >> > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Andrey,
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > finally your change is merged to master
> >> branch.
> >> > > > > > > >
> >> > > > > > > > Congratulations
> >> > > > > > > > > > >
> >> > > > > > > > > > > and
> >> > > > > > > > > > > > > > > > thank you very much! :)
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > I think that the next step is feature that
> >> will
> >> > > > allow
> >> > > > > > > >
> >> > > > > > > > signal
> >> > > > > > > > > > >
> >> > > > > > > > > > > about
> >> > > > > > > > > > > > > > > > blocked threads to the monitoring tools
> via
> >> > > MXBean.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > I hope you will continue development of
> this
> >> > > > feature
> >> > > > > and
> >> > > > > > > > > >
> >> > > > > > > > > > provide
> >> > > > > > > > > > > your
> >> > > > > > > > > > > > > > > > vision in new JIRA issue.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey
> >> > Kuznetsov
> >> > > <
> >> > > > > > > > > > >
> >> > > > > > > > > > > [hidden email]>
> >> > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > David, Maxim!
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Thanks a lot for you ideas.
> >> Unfortunately, I
> >> > > > can't
> >> > > > > adopt
> >> > > > > > > >
> >> > > > > > > > all
> >> > > > > > > > > > >
> >> > > > > > > > > > > of them
> >> > > > > > > > > > > > > > > > right
> >> > > > > > > > > > > > > > > > > now: the scope is much broader than the
> >> scope
> >> > > of
> >> > > > > the
> >> > > > > > > >
> >> > > > > > > > change I
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > implement.
> >> > > > > > > > > > > > > > > > I
> >> > > > > > > > > > > > > > > > > have had a talk to a group of Ignite
> >> > commiters,
> >> > > > > and we
> >> > > > > > > >
> >> > > > > > > > agreed
> >> > > > > > > > > > >
> >> > > > > > > > > > > to
> >> > > > > > > > > > > > > > > complete
> >> > > > > > > > > > > > > > > > > the change as follows.
> >> > > > > > > > > > > > > > > > > - Blocking instructions in
> system-critical
> >> > > which
> >> > > > > may
> >> > > > > > > > > >
> >> > > > > > > > > > resonably
> >> > > > > > > > > > > last
> >> > > > > > > > > > > > > > > long
> >> > > > > > > > > > > > > > > > > should be explicitly excluded from the
> >> > > > monitoring.
> >> > > > > > > > > > > > > > > > > - Failure handlers should have a setting
> >> to
> >> > > > > suppress some
> >> > > > > > > > > > >
> >> > > > > > > > > > > failures on
> >> > > > > > > > > > > > > > > > > per-failure-type basis.
> >> > > > > > > > > > > > > > > > > According to this I have updated the
> >> > > > > implementation: [1]
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > [1]
> >> > https://github.com/apache/ignite/pull/4089
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David
> >> Harvey <
> >> > > > > > > > > > >
> >> > > > > > > > > > > [hidden email]>:
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > When I've done this before,I've needed
> >> to
> >> > > find
> >> > > > > the
> >> > > > > > > >
> >> > > > > > > > oldest
> >> > > > > > > > > > >
> >> > > > > > > > > > > thread,
> >> > > > > > > > > > > > > > > and
> >> > > > > > > > > > > > > > > > kill
> >> > > > > > > > > > > > > > > > > > the node running that. From a
> language
> >> > > > > standpoint,
> >> > > > > > > > > >
> >> > > > > > > > > > Maxim's
> >> > > > > > > > > > > "without
> >> > > > > > > > > > > > > > > > > > progress" better than "heartbeat".
> For
> >> > > > > example, what
> >> > > > > > > >
> >> > > > > > > > I'm
> >> > > > > > > > > > >
> >> > > > > > > > > > > most
> >> > > > > > > > > > > > > > > > interested
> >> > > > > > > > > > > > > > > > > > in on a distributed system is which
> >> thread
> >> > > > > started the
> >> > > > > > > >
> >> > > > > > > > work
> >> > > > > > > > > > >
> >> > > > > > > > > > > it has
> >> > > > > > > > > > > > > > > not
> >> > > > > > > > > > > > > > > > > > completed the earliest, and when did
> >> that
> >> > > > thread
> >> > > > > last
> >> > > > > > > >
> >> > > > > > > > make
> >> > > > > > > > > > >
> >> > > > > > > > > > > forward
> >> > > > > > > > > > > > > > > > > > process. You don't want to kill a
> >> node
> >> > > > > because a
> >> > > > > > > >
> >> > > > > > > > thread
> >> > > > > > > > > > >
> >> > > > > > > > > > > is
> >> > > > > > > > > > > > > > > waiting
> >> > > > > > > > > > > > > > > > on a
> >> > > > > > > > > > > > > > > > > > lock held by a thread that went
> off-node
> >> > and
> >> > > > has
> >> > > > > not
> >> > > > > > > > > >
> >> > > > > > > > > > gotten a
> >> > > > > > > > > > > > > > > response.
> >> > > > > > > > > > > > > > > > > > If you don't understand the dependency
> >> > > > > relationships,
> >> > > > > > > >
> >> > > > > > > > you
> >> > > > > > > > > > >
> >> > > > > > > > > > > will make
> >> > > > > > > > > > > > > > > > > > incorrect recovery decisions.
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim
> >> > > > Muzafarov <
> >> > > > > > > > > > >
> >> > > > > > > > > > > [hidden email]>
> >> > > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > I think we should find exact answers
> >> to
> >> > > these
> >> > > > > > > >
> >> > > > > > > > questions:
> >> > > > > > > > > > > > > > > > > > > 1. What `critical` issue exactly
> is?
> >> > > > > > > > > > > > > > > > > > > 2. How can we find critical issues?
> >> > > > > > > > > > > > > > > > > > > 3. How can we handle critical
> issues?
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > First,
> >> > > > > > > > > > > > > > > > > > > - Ignore uninterruptable actions
> >> (e.g.
> >> > > > > > > >
> >> > > > > > > > worker\service
> >> > > > > > > > > > >
> >> > > > > > > > > > > shutdown)
> >> > > > > > > > > > > > > > > > > > > - Long I/O operations (should be a
> >> > > > > configurable
> >> > > > > > > >
> >> > > > > > > > timeout
> >> > > > > > > > > > >
> >> > > > > > > > > > > for each
> >> > > > > > > > > > > > > > > > type of
> >> > > > > > > > > > > > > > > > > > > usage)
> >> > > > > > > > > > > > > > > > > > > - Infinite loops
> >> > > > > > > > > > > > > > > > > > > - Stalled\deadlocked threads
> (and\or
> >> too
> >> > > > many
> >> > > > > parked
> >> > > > > > > > > > >
> >> > > > > > > > > > > threads,
> >> > > > > > > > > > > > > > > > exclude
> >> > > > > > > > > > > > > > > > > > I/O)
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > Second,
> >> > > > > > > > > > > > > > > > > > > - The working queue is without
> >> progress
> >> > > > (e.g.
> >> > > > > disco,
> >> > > > > > > > > > >
> >> > > > > > > > > > > exchange
> >> > > > > > > > > > > > > > > > queues)
> >> > > > > > > > > > > > > > > > > > > - Work hasn't been completed since
> >> the
> >> > > last
> >> > > > > > > >
> >> > > > > > > > heartbeat
> >> > > > > > > > > > >
> >> > > > > > > > > > > (checking
> >> > > > > > > > > > > > > > > > > > > milestones)
> >> > > > > > > > > > > > > > > > > > > - Too many system resources used
> by a
> >> > > thread
> >> > > > > for the
> >> > > > > > > > > >
> >> > > > > > > > > > long
> >> > > > > > > > > > > period
> >> > > > > > > > > > > > > > > of
> >> > > > > > > > > > > > > > > > time
> >> > > > > > > > > > > > > > > > > > > (allocated memory, CPU)
> >> > > > > > > > > > > > > > > > > > > - Timing fields associated with
> each
> >> > > thread
> >> > > > > status
> >> > > > > > > > > > >
> >> > > > > > > > > > > exceeded a
> >> > > > > > > > > > > > > > > > maximum
> >> > > > > > > > > > > > > > > > > > time
> >> > > > > > > > > > > > > > > > > > > limit.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > Third (not too many options here),
> >> > > > > > > > > > > > > > > > > > > - `log everything` should be the
> >> default
> >> > > > > behaviour
> >> > > > > > > >
> >> > > > > > > > in
> >> > > > > > > > > >
> >> > > > > > > > > > all
> >> > > > > > > > > > > these
> >> > > > > > > > > > > > > > > > cases,
> >> > > > > > > > > > > > > > > > > > > since it may be difficult to find
> the
> >> > cause
> >> > > > > after the
> >> > > > > > > > > > >
> >> > > > > > > > > > > restart.
> >> > > > > > > > > > > > > > > > > > > - Wait some interval of time and
> kill
> >> > the
> >> > > > > hanging
> >> > > > > > > >
> >> > > > > > > > node
> >> > > > > > > > > > >
> >> > > > > > > > > > > (cluster
> >> > > > > > > > > > > > > > > > should
> >> > > > > > > > > > > > > > > > > > be
> >> > > > > > > > > > > > > > > > > > > configured stable enough)
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > Questions,
> >> > > > > > > > > > > > > > > > > > > - Not sure, but can workers miss
> >> their
> >> > > > > heartbeat
> >> > > > > > > > > > >
> >> > > > > > > > > > > deadlines if CPU
> >> > > > > > > > > > > > > > > > loads
> >> > > > > > > > > > > > > > > > > > up
> >> > > > > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary
> >> overloads
> >> > > can
> >> > > > > be
> >> > > > > > > > > > > > > > > > > > > expected behaviour as a normal
> >> part
> >> > of
> >> > > > > system
> >> > > > > > > > > > >
> >> > > > > > > > > > > operations.
> >> > > > > > > > > > > > > > > > > > > - Why do we decide that critical
> >> thread
> >> > > > should
> >> > > > > > > >
> >> > > > > > > > monitor
> >> > > > > > > > > > >
> >> > > > > > > > > > > each other?
> >> > > > > > > > > > > > > > > > For
> >> > > > > > > > > > > > > > > > > > > instance, if all the tasks were
> >> blocked
> >> > and
> >> > > > > unable to
> >> > > > > > > > > >
> >> > > > > > > > > > run,
> >> > > > > > > > > > > > > > > > > > > node reset would never occur. As
> >> for
> >> > > me,
> >> > > > a
> >> > > > > better
> >> > > > > > > > > > >
> >> > > > > > > > > > > solution is
> >> > > > > > > > > > > > > > > to
> >> > > > > > > > > > > > > > > > use
> >> > > > > > > > > > > > > > > > > > a
> >> > > > > > > > > > > > > > > > > > > separate monitor thread or pool
> (maybe
> >> > both
> >> > > > > with
> >> > > > > > > >
> >> > > > > > > > software
> >> > > > > > > > > > > > > > > > > > > and hardware checks) that not
> only
> >> > > checks
> >> > > > > > > >
> >> > > > > > > > heartbeats
> >> > > > > > > > > > >
> >> > > > > > > > > > > but
> >> > > > > > > > > > > > > > > > monitors the
> >> > > > > > > > > > > > > > > > > > > other system as well.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David
> >> > Harvey <
> >> > > > > > > > > > >
> >> > > > > > > > > > > [hidden email]>
> >> > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > It would be safer to restart the
> >> entire
> >> > > > > cluster
> >> > > > > > > >
> >> > > > > > > > than to
> >> > > > > > > > > > >
> >> > > > > > > > > > > remove
> >> > > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > last
> >> > > > > > > > > > > > > > > > > > > > node for a cache that should be
> >> > > redundant.
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM
> Andrey
> >> > Gura
> >> > > <
> >> > > > > > > > > > >
> >> > > > > > > > > > > [hidden email]>
> >> > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > Hi,
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can
> >> > provide
> >> > > > some
> >> > > > > > > >
> >> > > > > > > > option
> >> > > > > > > > > > >
> >> > > > > > > > > > > that manage
> >> > > > > > > > > > > > > > > > worker
> >> > > > > > > > > > > > > > > > > > > > > liveness checker behavior in
> case
> >> of
> >> > > > > observing
> >> > > > > > > >
> >> > > > > > > > that
> >> > > > > > > > > > >
> >> > > > > > > > > > > some worker
> >> > > > > > > > > > > > > > > > is
> >> > > > > > > > > > > > > > > > > > > > > blocked too long.
> >> > > > > > > > > > > > > > > > > > > > > At least it will some
> workaround
> >> for
> >> > > > > cases when
> >> > > > > > > >
> >> > > > > > > > node
> >> > > > > > > > > > >
> >> > > > > > > > > > > fails is
> >> > > > > > > > > > > > > > > > too
> >> > > > > > > > > > > > > > > > > > > > > annoying.
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > Backups count threshold sounds
> >> good
> >> > > but I
> >> > > > > don't
> >> > > > > > > > > > >
> >> > > > > > > > > > > understand how
> >> > > > > > > > > > > > > > > it
> >> > > > > > > > > > > > > > > > > > will
> >> > > > > > > > > > > > > > > > > > > > > help in case of cluster hanging.
> >> > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > The simplest solution here is
> >> alert
> >> > in
> >> > > > > cases of
> >> > > > > > > > > > >
> >> > > > > > > > > > > blocking of
> >> > > > > > > > > > > > > > > some
> >> > > > > > > > > > > > > > > > > > > > > critical worker (we can improve
> >> > > > > WorkersRegistry
> >> > > > > > > >
> >> > > > > > > > for
> >> > > > > > > > > > >
> >> > > > > > > > > > > this
> >> > > > > > > > > > > > > > > purpose
> >> > > > > > > > > > > > > > > > and
> >> > > > > > > > > > > > > > > > > > > > > expose list of blocked workers)
> >> and
> >> > > > > optionally
> >> > > > > > > >
> >> > > > > > > > call
> >> > > > > > > > > > >
> >> > > > > > > > > > > system
> >> > > > > > > > > > > > > > > > configured
> >> > > > > > > > > > > > > > > > > > > > > failure processor. BTW, failure
> >> > > processor
> >> > > > > can be
> >> > > > > > > > > > >
> >> > > > > > > > > > > extended in
> >> > > > > > > > > > > > > > > > order to
> >> > > > > > > > > > > > > > > > > > > > > perform any checks (e.g. backup
> >> > count)
> >> > > > and
> >> > > > > decide
> >> > > > > > > > > > >
> >> > > > > > > > > > > whether it
> >> > > > > > > > > > > > > > > > should
> >> > > > > > > > > > > > > > > > > > > > > stop node or not.
> >> > > > > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM
> >> Andrey
> >> > > > > Kuznetsov <
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > [hidden email]>
> >> > > > > > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand
> your
> >> > > fears.
> >> > > > > But
> >> > > > > > > >
> >> > > > > > > > liveness
> >> > > > > > > > > > >
> >> > > > > > > > > > > checks
> >> > > > > > > > > > > > > > > deal
> >> > > > > > > > > > > > > > > > > > with
> >> > > > > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e.
> when
> >> > > such a
> >> > > > > > > >
> >> > > > > > > > condition
> >> > > > > > > > > >
> >> > > > > > > > > > is
> >> > > > > > > > > > > met we
> >> > > > > > > > > > > > > > > > > > conclude
> >> > > > > > > > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > > > > > > > node as totally broken, and
> >> there
> >> > is
> >> > > no
> >> > > > > sense
> >> > > > > > > >
> >> > > > > > > > to
> >> > > > > > > > > > >
> >> > > > > > > > > > > keep it
> >> > > > > > > > > > > > > > > alive
> >> > > > > > > > > > > > > > > > > > > > regardless
> >> > > > > > > > > > > > > > > > > > > > > > the data it contains. If we
> >> want to
> >> > > > give
> >> > > > > it a
> >> > > > > > > > > > >
> >> > > > > > > > > > > chance, then
> >> > > > > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > > > > > condition
> >> > > > > > > > > > > > > > > > > > > > > > (long fsync etc.) should not
> >> > > considered
> >> > > > > as
> >> > > > > > > >
> >> > > > > > > > critical
> >> > > > > > > > > > >
> >> > > > > > > > > > > at all.
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18,
> >> Yakov
> >> > > > > Zhdanov <
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > [hidden email]>:
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > > Agree with David. We need to
> >> have
> >> > > an
> >> > > > > > > >
> >> > > > > > > > opporunity
> >> > > > > > > > > > >
> >> > > > > > > > > > > set backups
> >> > > > > > > > > > > > > > > > count
> >> > > > > > > > > > > > > > > > > > > > > threshold
> >> > > > > > > > > > > > > > > > > > > > > > > (at runtime also!) that will
> >> not
> >> > > > allow
> >> > > > > any
> >> > > > > > > > > > >
> >> > > > > > > > > > > automatic stop
> >> > > > > > > > > > > > > > > if
> >> > > > > > > > > > > > > > > > > > there
> >> > > > > > > > > > > > > > > > > > > > > will be
> >> > > > > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what do
> >> you
> >> > > > think?
> >> > > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > > --Yakov
> >> > > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > > > > > > Best regards,
> >> > > > > > > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > > > Maxim Muzafarov
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > Best regards,
> >> > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > Best regards,
> >> > > > > > > > > > > > > > Andrey Kuznetsov.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > --
> >> > > > > > > > > > > > > Best Regards, Vyacheslav D.
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > --
> >> > > > > > > > > > > Best Regards, Vyacheslav D.
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > --
> >> > > > > > > > > > --
> >> > > > > > > > > > Maxim Muzafarov
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > --
> >> > --
> >> > Maxim Muzafarov
> >> >
> >>
> >>
> >> --
> >> Best regards,
> >> Andrey Kuznetsov.
> >>
> >
>

Dmitry Pavlov

Re: Critical worker threads liveness checking drawbacks

Hi,

Sorry for being too formal here, but IGNITE-10003
<https://issues.apache.org/jira/browse/IGNITE-10003> is in progress.

Also, I've tried to find anything related to it in the list. So according
to the list, no one was asking to include.

Sincerely,
Dmitriy Pavlov

ср, 19 дек. 2018 г. в 13:24, Nikolay Izhikov <[hidden email]>:

> Hello, Alexey.
>
> No, we don't include this ticket to 2.7.
> Should we?
>
> ср, 19 дек. 2018 г. в 12:55, Alexey Goncharuk <[hidden email]
> >:
>
> > Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope?
> > This causes an Ignite node to be stopped by default when checkpoint read
> > lock acquire times out. I expect a lot of Ignite 2.7 users will be
> affected
> > by this mistake.
> >
> > We should at least update the documentation and make users aware of a
> > workaround.
> >
> > чт, 25 окт. 2018 г. в 16:35, Alexey Goncharuk <
> [hidden email]
> > >:
> >
> > > Andrey,
> > >
> > > I still see that checkpoint read lock acquisition raises a
> > CRITICAL_ERROR,
> > > which by default will shut down local node. As far as I remember, we
> > > decided that by default thread timeout should not trigger node failure.
> > > Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events
> in
> > > default configuration.
> > >
> > > Should we introduce another critical failure type
> > > CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for
> checkpoint
> > > read lock acquire failure?
> > >
> > > --AG
> > >
> > > пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov <[hidden email]>:
> > >
> > >> Igniters,
> > >>
> > >> Now I spot blocking / long-running code arising from
> > >> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
> > >> thread, see [1]. Ideally, all blocking operations along all possible
> > code
> > >> paths should be guarded implicitly from critical failure detector to
> > avoid
> > >> the thread from being considered blocked. There is a pull request [2]
> > that
> > >> provides shallow solution. I didn't change code outside
> > >> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by
> any
> > >> upcoming change. Also, I didn't touch the code runnable by threads
> other
> > >> than partition-exchanger. So I have a number of guarded sections that
> > are
> > >> wider than they could be, and this potentially hides issues from
> failure
> > >> detector. Does this PR make sense? Or maybe it's better to exclude
> > >> partition-exchanger from critical threads registry at all?
> > >>
> > >> [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > >> [2] https://github.com/apache/ignite/pull/4962
> > >>
> > >>
> > >> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov <[hidden email]>:
> > >>
> > >> > Andrey, Andrey
> > >> >
> > >> > > Thanks for being attentive! It's definitely a typo. Could you
> please
> > >> > create
> > >> > > an issue?
> > >> >
> > >> > I've created an issue [1] and prepared PR [2].
> > >> > Please, review this change.
> > >> >
> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-9723
> > >> > [2] https://github.com/apache/ignite/pull/4862
> > >> >
> > >> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov <[hidden email]>
> > wrote:
> > >> >
> > >> > > Config option + mbean access. Does that make sense?
> > >> > >
> > >> > > Yakov
> > >> > >
> > >> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov <[hidden email]
> >
> > >> > wrote:
> > >> > >
> > >> > > > Then it should be config option.
> > >> > > >
> > >> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura <[hidden email]>:
> > >> > > >
> > >> > > > > Guys,
> > >> > > > >
> > >> > > > > why we need both config option and system property? I believe
> > one
> > >> way
> > >> > > is
> > >> > > > > enough.
> > >> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
> > >> > [hidden email]>
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > Ticket created -
> > >> https://issues.apache.org/jira/browse/IGNITE-9737
> > >> > > > > >
> > >> > > > > > Fixed version is 2.7.
> > >> > > > > >
> > >> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > >> > > > > > > Nikolay, I agree, a user should be able to disable both
> > thread
> > >> > > > liveness
> > >> > > > > > > check and checkpoint read lock timeout check from config
> > and a
> > >> > > system
> > >> > > > > > > property.
> > >> > > > > > >
> > >> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
> > >> > [hidden email]
> > >> > > >:
> > >> > > > > > >
> > >> > > > > > > > Hello, Igniters.
> > >> > > > > > > >
> > >> > > > > > > > I found that this feature can't be disabled from config.
> > >> > > > > > > > The only way to disable it is from JMX bean.
> > >> > > > > > > >
> > >> > > > > > > > I think it very dangerous: If we have some corner case
> or
> > a
> > >> bug
> > >> > > in
> > >> > > > > this
> > >> > > > > > > > Watch Dog it can make Ignite unusable.
> > >> > > > > > > > I propose to implement possibility to disable this
> feature
> > >> > both -
> > >> > > > > from
> > >> > > > > > > > config and from JVM options.
> > >> > > > > > > >
> > >> > > > > > > > What do you think?
> > >> > > > > > > >
> > >> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > >> > > > > > > > > Maxim,
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks for being attentive! It's definitely a typo.
> > Could
> > >> you
> > >> > > > > please
> > >> > > > > > > >
> > >> > > > > > > > create
> > >> > > > > > > > > an issue?
> > >> > > > > > > > >
> > >> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> > >> > > > [hidden email]
> > >> > > > > >:
> > >> > > > > > > > >
> > >> > > > > > > > > > Folks,
> > >> > > > > > > > > >
> > >> > > > > > > > > > I've found in
> `GridCachePartitionExchangeManager:2684`
> > >> [1]
> > >> > > > > (master
> > >> > > > > > > >
> > >> > > > > > > > branch)
> > >> > > > > > > > > > exchange future wrapped
> > >> > > > > > > > > > with double `blockingSectionEnd` method. Is it
> > correct?
> > >> I
> > >> > > just
> > >> > > > > want to
> > >> > > > > > > > > > understand this change and
> > >> > > > > > > > > > how should I use this in the future.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Should I file a new issue to fix this? I think here
> > >> > > > > > > >
> > >> > > > > > > > `blockingSectionBegin`
> > >> > > > > > > > > > method should be used.
> > >> > > > > > > > > >
> > >> > > > > > > > > > -------------
> > >> > > > > > > > > > blockingSectionEnd();
> > >> > > > > > > > > >
> > >> > > > > > > > > > try {
> > >> > > > > > > > > > resVer = exchFut.get(exchTimeout,
> > >> > TimeUnit.MILLISECONDS);
> > >> > > > > > > > > > } finally {
> > >> > > > > > > > > > blockingSectionEnd();
> > >> > > > > > > > > > }
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > > [1]
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> > >> > > > > [hidden email]>
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Andrey Gura, thank you for the answer!
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > I agree that wrapping of 'init' method reduces the
> > >> profit
> > >> > > of
> > >> > > > > watchdog
> > >> > > > > > > > > > > service in case of PME worker, but in other cases,
> > we
> > >> > > should
> > >> > > > > wrap all
> > >> > > > > > > > > > > possible long sections on
> > >> GridDhtPartitionExchangeFuture.
> > >> > > For
> > >> > > > > example
> > >> > > > > > > > > > > 'onCacheChangeRequest' method or
> > >> > > > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside
> > because
> > >> it
> > >> > > may
> > >> > > > > take
> > >> > > > > > > > > > > significant time (reproducer attached).
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > I only want to point out a possible issue which
> may
> > >> allow
> > >> > > to
> > >> > > > > end-user
> > >> > > > > > > > > > > halt the Ignite cluster accidentally.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > I'm sure that PME experts know how to fix this
> issue
> > >> > > > properly.
> > >> > > > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
> > >> > > > [hidden email]
> > >> > > > > >
> > >> > > > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Vyacheslav,
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Exchange worker is strongly tied with
> > >> > > > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is
> ok.
> > >> > > Exchange
> > >> > > > > worker
> > >> > > > > > > >
> > >> > > > > > > > also
> > >> > > > > > > > > > > > shouldn't be blocked for long time but in
> reality
> > it
> > >> > > > > happens.It
> > >> > > > > > > >
> > >> > > > > > > > also
> > >> > > > > > > > > > > > means that your change doesn't make sense.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > What actually make sense it is identification of
> > >> places
> > >> > > > which
> > >> > > > > > > > > > > > intentionally blocking. May be some
> places/actions
> > >> > should
> > >> > > > be
> > >> > > > > > > >
> > >> > > > > > > > braced by
> > >> > > > > > > > > > > > blocking guards.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > If you have failing tests please make sure that
> > your
> > >> > > > > > > >
> > >> > > > > > > > failureHandler is
> > >> > > > > > > > > > > > NoOpFailureHandler or any other handler with
> > >> > > > > ignoreFailureTypes =
> > >> > > > > > > > > > > > [CRITICAL_WORKER_BLOCKED].
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav
> > Daradur <
> > >> > > > > > > > > >
> > >> > > > > > > > > > [hidden email]>
> > >> > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Hi Igniters!
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Thank you for this important improvement!
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > I've looked through implementation and noticed
> > >> that
> > >> > > > > > > > > > > > > GridDhtPartitionsExchangeFuture#init has not
> > been
> > >> > > wrapped
> > >> > > > > in
> > >> > > > > > > >
> > >> > > > > > > > blocked
> > >> > > > > > > > > > > > > section. This means it easy to halt the node
> in
> > >> case
> > >> > of
> > >> > > > > > > >
> > >> > > > > > > > longrunning
> > >> > > > > > > > > > > > > actions during PME, for example when we
> create a
> > >> > cache
> > >> > > > with
> > >> > > > > > > > > > > > > StoreFactrory which connect to 3rd party DB.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > I'm not sure that it is the right behavior.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > I filled the issue [1] and prepared the PR [2]
> > >> with
> > >> > > > > reproducer
> > >> > > > > > > >
> > >> > > > > > > > and
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > possible fix.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Andrey, could you please look at and confirm
> > that
> > >> it
> > >> > > > makes
> > >> > > > > sense?
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > [1]
> > >> > https://issues.apache.org/jira/browse/IGNITE-9710
> > >> > > > > > > > > > > > > [2]
> https://github.com/apache/ignite/pull/4845
> > >> > > > > > > > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey
> > Kuznetsov <
> > >> > > > > > > >
> > >> > > > > > > > [hidden email]>
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Denis,
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > I've created the ticket [1] with short
> > >> description
> > >> > of
> > >> > > > the
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > functionality.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > [1]
> > >> > > https://issues.apache.org/jira/browse/IGNITE-9679
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <
> > >> > > > > [hidden email]>:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Andrey K. and G.,
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Thanks, do we have a documentation ticket
> > >> > created?
> > >> > > > > Prachi
> > >> > > > > > > > > >
> > >> > > > > > > > > > (copied)
> > >> > > > > > > > > > > can help
> > >> > > > > > > > > > > > > > > with the documentation.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > Denis
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey
> Gura
> > <
> > >> > > > > > > >
> > >> > > > > > > > [hidden email]>
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Andrey,
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > finally your change is merged to master
> > >> branch.
> > >> > > > > > > >
> > >> > > > > > > > Congratulations
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > and
> > >> > > > > > > > > > > > > > > > thank you very much! :)
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > I think that the next step is feature
> that
> > >> will
> > >> > > > allow
> > >> > > > > > > >
> > >> > > > > > > > signal
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > about
> > >> > > > > > > > > > > > > > > > blocked threads to the monitoring tools
> > via
> > >> > > MXBean.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > I hope you will continue development of
> > this
> > >> > > > feature
> > >> > > > > and
> > >> > > > > > > > > >
> > >> > > > > > > > > > provide
> > >> > > > > > > > > > > your
> > >> > > > > > > > > > > > > > > > vision in new JIRA issue.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey
> > >> > Kuznetsov
> > >> > > <
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [hidden email]>
> > >> > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > David, Maxim!
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Thanks a lot for you ideas.
> > >> Unfortunately, I
> > >> > > > can't
> > >> > > > > adopt
> > >> > > > > > > >
> > >> > > > > > > > all
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > of them
> > >> > > > > > > > > > > > > > > > right
> > >> > > > > > > > > > > > > > > > > now: the scope is much broader than
> the
> > >> scope
> > >> > > of
> > >> > > > > the
> > >> > > > > > > >
> > >> > > > > > > > change I
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > implement.
> > >> > > > > > > > > > > > > > > > I
> > >> > > > > > > > > > > > > > > > > have had a talk to a group of Ignite
> > >> > commiters,
> > >> > > > > and we
> > >> > > > > > > >
> > >> > > > > > > > agreed
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > to
> > >> > > > > > > > > > > > > > > complete
> > >> > > > > > > > > > > > > > > > > the change as follows.
> > >> > > > > > > > > > > > > > > > > - Blocking instructions in
> > system-critical
> > >> > > which
> > >> > > > > may
> > >> > > > > > > > > >
> > >> > > > > > > > > > resonably
> > >> > > > > > > > > > > last
> > >> > > > > > > > > > > > > > > long
> > >> > > > > > > > > > > > > > > > > should be explicitly excluded from the
> > >> > > > monitoring.
> > >> > > > > > > > > > > > > > > > > - Failure handlers should have a
> setting
> > >> to
> > >> > > > > suppress some
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > failures on
> > >> > > > > > > > > > > > > > > > > per-failure-type basis.
> > >> > > > > > > > > > > > > > > > > According to this I have updated the
> > >> > > > > implementation: [1]
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > [1]
> > >> > https://github.com/apache/ignite/pull/4089
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David
> > >> Harvey <
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [hidden email]>:
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > When I've done this before,I've
> needed
> > >> to
> > >> > > find
> > >> > > > > the
> > >> > > > > > > >
> > >> > > > > > > > oldest
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > thread,
> > >> > > > > > > > > > > > > > > and
> > >> > > > > > > > > > > > > > > > kill
> > >> > > > > > > > > > > > > > > > > > the node running that. From a
> > language
> > >> > > > > standpoint,
> > >> > > > > > > > > >
> > >> > > > > > > > > > Maxim's
> > >> > > > > > > > > > > "without
> > >> > > > > > > > > > > > > > > > > > progress" better than "heartbeat".
> > For
> > >> > > > > example, what
> > >> > > > > > > >
> > >> > > > > > > > I'm
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > most
> > >> > > > > > > > > > > > > > > > interested
> > >> > > > > > > > > > > > > > > > > > in on a distributed system is which
> > >> thread
> > >> > > > > started the
> > >> > > > > > > >
> > >> > > > > > > > work
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > it has
> > >> > > > > > > > > > > > > > > not
> > >> > > > > > > > > > > > > > > > > > completed the earliest, and when did
> > >> that
> > >> > > > thread
> > >> > > > > last
> > >> > > > > > > >
> > >> > > > > > > > make
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > forward
> > >> > > > > > > > > > > > > > > > > > process. You don't want to kill
> a
> > >> node
> > >> > > > > because a
> > >> > > > > > > >
> > >> > > > > > > > thread
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > is
> > >> > > > > > > > > > > > > > > waiting
> > >> > > > > > > > > > > > > > > > on a
> > >> > > > > > > > > > > > > > > > > > lock held by a thread that went
> > off-node
> > >> > and
> > >> > > > has
> > >> > > > > not
> > >> > > > > > > > > >
> > >> > > > > > > > > > gotten a
> > >> > > > > > > > > > > > > > > response.
> > >> > > > > > > > > > > > > > > > > > If you don't understand the
> dependency
> > >> > > > > relationships,
> > >> > > > > > > >
> > >> > > > > > > > you
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > will make
> > >> > > > > > > > > > > > > > > > > > incorrect recovery decisions.
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM
> Maxim
> > >> > > > Muzafarov <
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [hidden email]>
> > >> > > > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > I think we should find exact
> answers
> > >> to
> > >> > > these
> > >> > > > > > > >
> > >> > > > > > > > questions:
> > >> > > > > > > > > > > > > > > > > > > 1. What `critical` issue exactly
> > is?
> > >> > > > > > > > > > > > > > > > > > > 2. How can we find critical
> issues?
> > >> > > > > > > > > > > > > > > > > > > 3. How can we handle critical
> > issues?
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > First,
> > >> > > > > > > > > > > > > > > > > > > - Ignore uninterruptable actions
> > >> (e.g.
> > >> > > > > > > >
> > >> > > > > > > > worker\service
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > shutdown)
> > >> > > > > > > > > > > > > > > > > > > - Long I/O operations (should be
> a
> > >> > > > > configurable
> > >> > > > > > > >
> > >> > > > > > > > timeout
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > for each
> > >> > > > > > > > > > > > > > > > type of
> > >> > > > > > > > > > > > > > > > > > > usage)
> > >> > > > > > > > > > > > > > > > > > > - Infinite loops
> > >> > > > > > > > > > > > > > > > > > > - Stalled\deadlocked threads
> > (and\or
> > >> too
> > >> > > > many
> > >> > > > > parked
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > threads,
> > >> > > > > > > > > > > > > > > > exclude
> > >> > > > > > > > > > > > > > > > > > I/O)
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Second,
> > >> > > > > > > > > > > > > > > > > > > - The working queue is without
> > >> progress
> > >> > > > (e.g.
> > >> > > > > disco,
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > exchange
> > >> > > > > > > > > > > > > > > > queues)
> > >> > > > > > > > > > > > > > > > > > > - Work hasn't been completed
> since
> > >> the
> > >> > > last
> > >> > > > > > > >
> > >> > > > > > > > heartbeat
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > (checking
> > >> > > > > > > > > > > > > > > > > > > milestones)
> > >> > > > > > > > > > > > > > > > > > > - Too many system resources used
> > by a
> > >> > > thread
> > >> > > > > for the
> > >> > > > > > > > > >
> > >> > > > > > > > > > long
> > >> > > > > > > > > > > period
> > >> > > > > > > > > > > > > > > of
> > >> > > > > > > > > > > > > > > > time
> > >> > > > > > > > > > > > > > > > > > > (allocated memory, CPU)
> > >> > > > > > > > > > > > > > > > > > > - Timing fields associated with
> > each
> > >> > > thread
> > >> > > > > status
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > exceeded a
> > >> > > > > > > > > > > > > > > > maximum
> > >> > > > > > > > > > > > > > > > > > time
> > >> > > > > > > > > > > > > > > > > > > limit.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Third (not too many options here),
> > >> > > > > > > > > > > > > > > > > > > - `log everything` should be the
> > >> default
> > >> > > > > behaviour
> > >> > > > > > > >
> > >> > > > > > > > in
> > >> > > > > > > > > >
> > >> > > > > > > > > > all
> > >> > > > > > > > > > > these
> > >> > > > > > > > > > > > > > > > cases,
> > >> > > > > > > > > > > > > > > > > > > since it may be difficult to find
> > the
> > >> > cause
> > >> > > > > after the
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > restart.
> > >> > > > > > > > > > > > > > > > > > > - Wait some interval of time and
> > kill
> > >> > the
> > >> > > > > hanging
> > >> > > > > > > >
> > >> > > > > > > > node
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > (cluster
> > >> > > > > > > > > > > > > > > > should
> > >> > > > > > > > > > > > > > > > > > be
> > >> > > > > > > > > > > > > > > > > > > configured stable enough)
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Questions,
> > >> > > > > > > > > > > > > > > > > > > - Not sure, but can workers miss
> > >> their
> > >> > > > > heartbeat
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > deadlines if CPU
> > >> > > > > > > > > > > > > > > > loads
> > >> > > > > > > > > > > > > > > > > > up
> > >> > > > > > > > > > > > > > > > > > > to 80%-90%? Bursts of momentary
> > >> overloads
> > >> > > can
> > >> > > > > be
> > >> > > > > > > > > > > > > > > > > > > expected behaviour as a normal
> > >> part
> > >> > of
> > >> > > > > system
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > operations.
> > >> > > > > > > > > > > > > > > > > > > - Why do we decide that critical
> > >> thread
> > >> > > > should
> > >> > > > > > > >
> > >> > > > > > > > monitor
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > each other?
> > >> > > > > > > > > > > > > > > > For
> > >> > > > > > > > > > > > > > > > > > > instance, if all the tasks were
> > >> blocked
> > >> > and
> > >> > > > > unable to
> > >> > > > > > > > > >
> > >> > > > > > > > > > run,
> > >> > > > > > > > > > > > > > > > > > > node reset would never occur.
> As
> > >> for
> > >> > > me,
> > >> > > > a
> > >> > > > > better
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > solution is
> > >> > > > > > > > > > > > > > > to
> > >> > > > > > > > > > > > > > > > use
> > >> > > > > > > > > > > > > > > > > > a
> > >> > > > > > > > > > > > > > > > > > > separate monitor thread or pool
> > (maybe
> > >> > both
> > >> > > > > with
> > >> > > > > > > >
> > >> > > > > > > > software
> > >> > > > > > > > > > > > > > > > > > > and hardware checks) that not
> > only
> > >> > > checks
> > >> > > > > > > >
> > >> > > > > > > > heartbeats
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > but
> > >> > > > > > > > > > > > > > > > monitors the
> > >> > > > > > > > > > > > > > > > > > > other system as well.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David
> > >> > Harvey <
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [hidden email]>
> > >> > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > It would be safer to restart the
> > >> entire
> > >> > > > > cluster
> > >> > > > > > > >
> > >> > > > > > > > than to
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > remove
> > >> > > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > last
> > >> > > > > > > > > > > > > > > > > > > > node for a cache that should be
> > >> > > redundant.
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM
> > Andrey
> > >> > Gura
> > >> > > <
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [hidden email]>
> > >> > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > Hi,
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can
> > >> > provide
> > >> > > > some
> > >> > > > > > > >
> > >> > > > > > > > option
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > that manage
> > >> > > > > > > > > > > > > > > > worker
> > >> > > > > > > > > > > > > > > > > > > > > liveness checker behavior in
> > case
> > >> of
> > >> > > > > observing
> > >> > > > > > > >
> > >> > > > > > > > that
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > some worker
> > >> > > > > > > > > > > > > > > > is
> > >> > > > > > > > > > > > > > > > > > > > > blocked too long.
> > >> > > > > > > > > > > > > > > > > > > > > At least it will some
> > workaround
> > >> for
> > >> > > > > cases when
> > >> > > > > > > >
> > >> > > > > > > > node
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > fails is
> > >> > > > > > > > > > > > > > > > too
> > >> > > > > > > > > > > > > > > > > > > > > annoying.
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > Backups count threshold sounds
> > >> good
> > >> > > but I
> > >> > > > > don't
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > understand how
> > >> > > > > > > > > > > > > > > it
> > >> > > > > > > > > > > > > > > > > > will
> > >> > > > > > > > > > > > > > > > > > > > > help in case of cluster
> hanging.
> > >> > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > The simplest solution here is
> > >> alert
> > >> > in
> > >> > > > > cases of
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > blocking of
> > >> > > > > > > > > > > > > > > some
> > >> > > > > > > > > > > > > > > > > > > > > critical worker (we can
> improve
> > >> > > > > WorkersRegistry
> > >> > > > > > > >
> > >> > > > > > > > for
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > this
> > >> > > > > > > > > > > > > > > purpose
> > >> > > > > > > > > > > > > > > > and
> > >> > > > > > > > > > > > > > > > > > > > > expose list of blocked
> workers)
> > >> and
> > >> > > > > optionally
> > >> > > > > > > >
> > >> > > > > > > > call
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > system
> > >> > > > > > > > > > > > > > > > configured
> > >> > > > > > > > > > > > > > > > > > > > > failure processor. BTW,
> failure
> > >> > > processor
> > >> > > > > can be
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > extended in
> > >> > > > > > > > > > > > > > > > order to
> > >> > > > > > > > > > > > > > > > > > > > > perform any checks (e.g.
> backup
> > >> > count)
> > >> > > > and
> > >> > > > > decide
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > whether it
> > >> > > > > > > > > > > > > > > > should
> > >> > > > > > > > > > > > > > > > > > > > > stop node or not.
> > >> > > > > > > > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM
> > >> Andrey
> > >> > > > > Kuznetsov <
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > [hidden email]>
> > >> > > > > > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand
> > your
> > >> > > fears.
> > >> > > > > But
> > >> > > > > > > >
> > >> > > > > > > > liveness
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > checks
> > >> > > > > > > > > > > > > > > deal
> > >> > > > > > > > > > > > > > > > > > with
> > >> > > > > > > > > > > > > > > > > > > > > > _critical_ conditions, i.e.
> > when
> > >> > > such a
> > >> > > > > > > >
> > >> > > > > > > > condition
> > >> > > > > > > > > >
> > >> > > > > > > > > > is
> > >> > > > > > > > > > > met we
> > >> > > > > > > > > > > > > > > > > > conclude
> > >> > > > > > > > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > > > > > > > node as totally broken, and
> > >> there
> > >> > is
> > >> > > no
> > >> > > > > sense
> > >> > > > > > > >
> > >> > > > > > > > to
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > keep it
> > >> > > > > > > > > > > > > > > alive
> > >> > > > > > > > > > > > > > > > > > > > regardless
> > >> > > > > > > > > > > > > > > > > > > > > > the data it contains. If we
> > >> want to
> > >> > > > give
> > >> > > > > it a
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > chance, then
> > >> > > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > > > > > condition
> > >> > > > > > > > > > > > > > > > > > > > > > (long fsync etc.) should not
> > >> > > considered
> > >> > > > > as
> > >> > > > > > > >
> > >> > > > > > > > critical
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > at all.
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18,
> > >> Yakov
> > >> > > > > Zhdanov <
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > [hidden email]>:
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > > Agree with David. We need
> to
> > >> have
> > >> > > an
> > >> > > > > > > >
> > >> > > > > > > > opporunity
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > set backups
> > >> > > > > > > > > > > > > > > > count
> > >> > > > > > > > > > > > > > > > > > > > > threshold
> > >> > > > > > > > > > > > > > > > > > > > > > > (at runtime also!) that
> will
> > >> not
> > >> > > > allow
> > >> > > > > any
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > automatic stop
> > >> > > > > > > > > > > > > > > if
> > >> > > > > > > > > > > > > > > > > > there
> > >> > > > > > > > > > > > > > > > > > > > > will be
> > >> > > > > > > > > > > > > > > > > > > > > > > a data loss. Andrey, what
> do
> > >> you
> > >> > > > think?
> > >> > > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > > --Yakov
> > >> > > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > > > > > > Best regards,
> > >> > > > > > > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > > > Maxim Muzafarov
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > Best regards,
> > >> > > > > > > > > > > > > > > > > Andrey Kuznetsov.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > Best regards,
> > >> > > > > > > > > > > > > > Andrey Kuznetsov.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > Best Regards, Vyacheslav D.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > --
> > >> > > > > > > > > > > Best Regards, Vyacheslav D.
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > > > --
> > >> > > > > > > > > > --
> > >> > > > > > > > > > Maxim Muzafarov
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> > --
> > >> > --
> > >> > Maxim Muzafarov
> > >> >
> > >>
> > >>
> > >> --
> > >> Best regards,
> > >> Andrey Kuznetsov.
> > >>
> > >
> >
>