Apache Ignite Developers - Legacy Mail Archive

Add emergency node closing handler to public Ignite API

Classic

List

Threaded

21 messages Options

Andrey Kuznetsov

Add emergency node closing handler to public Ignite API

Hi Igniters!

When some node detects critical error, e.g. OOME, deadlock, etc, it should
invoke some user-defined callback and then attempt to close itself
gracefully. In order to make this possible we need to enhance Ignite
interface by adding something like Ignite.onEmergencyClose(SomeClosure).

First, I'd like to get your feedback on this potential change. Then we can
refine SomeClosure structure.

--
Best regards,
Andrey Kuznetsov.

Vladimir Ozerov

Re: Add emergency node closing handler to public Ignite API

I am not sure this makes sense. First, in general case we do not have
access to Java. E.g. in case of very long GC pause all Java threads are
stuck and it is impossible to invoke anything. Second, some other
conditions may be unrecoverable, such as OOME, where there is no guarantee
that any operation succeed. So this is not graceful shutdown. We should
kill the node forcefully IMO.

On Tue, Nov 14, 2017 at 7:23 PM, Andrey Kuznetsov <[hidden email]> wrote:

> Hi Igniters!
>
> When some node detects critical error, e.g. OOME, deadlock, etc, it should
> invoke some user-defined callback and then attempt to close itself
> gracefully. In order to make this possible we need to enhance Ignite
> interface by adding something like Ignite.onEmergencyClose(SomeClosure).
>
> First, I'd like to get your feedback on this potential change. Then we can
> refine SomeClosure structure.
>
> --
> Best regards,
> Andrey Kuznetsov.
>

Anton Vinogradov

Re: Add emergency node closing handler to public Ignite API

Vova,

That's not about "kill -9" or OOM, that's about case when node detected
something and decided to stop itself (eg. persistence errors,
IgniteOutOfMemoryException, ExchangeWorker died)
Sure, we can't handle OOM or 100% CPU utilization by GC it that way, but we
can handle some logical problems.

Andrey,

I propose to refactor method to ignite.onClose(SomeClosure<SomeReason>)
In this case user will be able to register callback on all graceful stops, and
detect it's reason.

On Tue, Nov 14, 2017 at 7:29 PM, Vladimir Ozerov <[hidden email]>
wrote:

> I am not sure this makes sense. First, in general case we do not have
> access to Java. E.g. in case of very long GC pause all Java threads are
> stuck and it is impossible to invoke anything. Second, some other
> conditions may be unrecoverable, such as OOME, where there is no guarantee
> that any operation succeed. So this is not graceful shutdown. We should
> kill the node forcefully IMO.
>
> On Tue, Nov 14, 2017 at 7:23 PM, Andrey Kuznetsov <[hidden email]>
> wrote:
>
> > Hi Igniters!
> >
> > When some node detects critical error, e.g. OOME, deadlock, etc, it
> should
> > invoke some user-defined callback and then attempt to close itself
> > gracefully. In order to make this possible we need to enhance Ignite
> > interface by adding something like Ignite.onEmergencyClose(SomeClosure).
> >
> > First, I'd like to get your feedback on this potential change. Then we
> can
> > refine SomeClosure structure.
> >
> > --
> > Best regards,
> > Andrey Kuznetsov.
> >
>

Vladimir Ozerov

Re: Add emergency node closing handler to public Ignite API

Can you explain what kind of logic could be placed there? And why do we
need another configuration property and/or interface? We already have
LifecycleBean, where Ignite instance could be injected, so user is already
able to perform anything there.

On Tue, Nov 14, 2017 at 7:46 PM, Anton Vinogradov <[hidden email]>
wrote:

> Vova,
>
> That's not about "kill -9" or OOM, that's about case when node detected
> something and decided to stop itself (eg. persistence errors,
> IgniteOutOfMemoryException, ExchangeWorker died)
> Sure, we can't handle OOM or 100% CPU utilization by GC it that way, but we
> can handle some logical problems.
>
> Andrey,
>
> I propose to refactor method to ignite.onClose(SomeClosure<SomeReason>)
> In this case user will be able to register callback on all graceful stops,
> and
> detect it's reason.
>
>
> On Tue, Nov 14, 2017 at 7:29 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > I am not sure this makes sense. First, in general case we do not have
> > access to Java. E.g. in case of very long GC pause all Java threads are
> > stuck and it is impossible to invoke anything. Second, some other
> > conditions may be unrecoverable, such as OOME, where there is no
> guarantee
> > that any operation succeed. So this is not graceful shutdown. We should
> > kill the node forcefully IMO.
> >
> > On Tue, Nov 14, 2017 at 7:23 PM, Andrey Kuznetsov <[hidden email]>
> > wrote:
> >
> > > Hi Igniters!
> > >
> > > When some node detects critical error, e.g. OOME, deadlock, etc, it
> > should
> > > invoke some user-defined callback and then attempt to close itself
> > > gracefully. In order to make this possible we need to enhance Ignite
> > > interface by adding something like Ignite.onEmergencyClose(
> SomeClosure).
> > >
> > > First, I'd like to get your feedback on this potential change. Then we
> > can
> > > refine SomeClosure structure.
> > >
> > > --
> > > Best regards,
> > > Andrey Kuznetsov.
> > >
> >
>

Anton Vinogradov

Re: Add emergency node closing handler to public Ignite API

Vova,

We should provide user ability to be notified in case some node decided to
stop itself.
Only user know how he want to be notified, so we should provide ability to
register custom callback(eg. send sms or call rest service)
This will cover cases when node stops gracefuly.

Please, see Semen's comment at
https://issues.apache.org/jira/browse/IGNITE-5811 for details.

P.s. Cases when node stops without ability to do something should be
covered by external watchdog.

Вт, 14 нояб. 2017 г. в 20:08, Vladimir Ozerov <[hidden email]>:

> Can you explain what kind of logic could be placed there? And why do we
> need another configuration property and/or interface? We already have
> LifecycleBean, where Ignite instance could be injected, so user is already
> able to perform anything there.
>
> On Tue, Nov 14, 2017 at 7:46 PM, Anton Vinogradov <
> [hidden email]>
> wrote:
>
> > Vova,
> >
> > That's not about "kill -9" or OOM, that's about case when node detected
> > something and decided to stop itself (eg. persistence errors,
> > IgniteOutOfMemoryException, ExchangeWorker died)
> > Sure, we can't handle OOM or 100% CPU utilization by GC it that way, but
> we
> > can handle some logical problems.
> >
> > Andrey,
> >
> > I propose to refactor method to ignite.onClose(SomeClosure<SomeReason>)
> > In this case user will be able to register callback on all graceful
> stops,
> > and
> > detect it's reason.
> >
> >
> > On Tue, Nov 14, 2017 at 7:29 PM, Vladimir Ozerov <[hidden email]>
> > wrote:
> >
> > > I am not sure this makes sense. First, in general case we do not have
> > > access to Java. E.g. in case of very long GC pause all Java threads are
> > > stuck and it is impossible to invoke anything. Second, some other
> > > conditions may be unrecoverable, such as OOME, where there is no
> > guarantee
> > > that any operation succeed. So this is not graceful shutdown. We should
> > > kill the node forcefully IMO.
> > >
> > > On Tue, Nov 14, 2017 at 7:23 PM, Andrey Kuznetsov <[hidden email]>
> > > wrote:
> > >
> > > > Hi Igniters!
> > > >
> > > > When some node detects critical error, e.g. OOME, deadlock, etc, it
> > > should
> > > > invoke some user-defined callback and then attempt to close itself
> > > > gracefully. In order to make this possible we need to enhance Ignite
> > > > interface by adding something like Ignite.onEmergencyClose(
> > SomeClosure).
> > > >
> > > > First, I'd like to get your feedback on this potential change. Then
> we
> > > can
> > > > refine SomeClosure structure.
> > > >
> > > > --
> > > > Best regards,
> > > > Andrey Kuznetsov.
> > > >
> > >
> >
>

Valentin Kulichenko

Re: Add emergency node closing handler to public Ignite API

Anton,

I agree with Vova - we already have lifecycle bean. Why do we need anything
on top of that?

-Val

On Tue, Nov 14, 2017 at 10:05 AM, Anton Vinogradov <[hidden email]
> wrote:

> Vova,
>
> We should provide user ability to be notified in case some node decided to
> stop itself.
> Only user know how he want to be notified, so we should provide ability to
> register custom callback(eg. send sms or call rest service)
> This will cover cases when node stops gracefuly.
>
> Please, see Semen's comment at
> https://issues.apache.org/jira/browse/IGNITE-5811 for details.
>
> P.s. Cases when node stops without ability to do something should be
> covered by external watchdog.
>
> Вт, 14 нояб. 2017 г. в 20:08, Vladimir Ozerov <[hidden email]>:
>
> > Can you explain what kind of logic could be placed there? And why do we
> > need another configuration property and/or interface? We already have
> > LifecycleBean, where Ignite instance could be injected, so user is
> already
> > able to perform anything there.
> >
> > On Tue, Nov 14, 2017 at 7:46 PM, Anton Vinogradov <
> > [hidden email]>
> > wrote:
> >
> > > Vova,
> > >
> > > That's not about "kill -9" or OOM, that's about case when node detected
> > > something and decided to stop itself (eg. persistence errors,
> > > IgniteOutOfMemoryException, ExchangeWorker died)
> > > Sure, we can't handle OOM or 100% CPU utilization by GC it that way,
> but
> > we
> > > can handle some logical problems.
> > >
> > > Andrey,
> > >
> > > I propose to refactor method to ignite.onClose(SomeClosure<
> SomeReason>)
> > > In this case user will be able to register callback on all graceful
> > stops,
> > > and
> > > detect it's reason.
> > >
> > >
> > > On Tue, Nov 14, 2017 at 7:29 PM, Vladimir Ozerov <[hidden email]
> >
> > > wrote:
> > >
> > > > I am not sure this makes sense. First, in general case we do not have
> > > > access to Java. E.g. in case of very long GC pause all Java threads
> are
> > > > stuck and it is impossible to invoke anything. Second, some other
> > > > conditions may be unrecoverable, such as OOME, where there is no
> > > guarantee
> > > > that any operation succeed. So this is not graceful shutdown. We
> should
> > > > kill the node forcefully IMO.
> > > >
> > > > On Tue, Nov 14, 2017 at 7:23 PM, Andrey Kuznetsov <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Hi Igniters!
> > > > >
> > > > > When some node detects critical error, e.g. OOME, deadlock, etc, it
> > > > should
> > > > > invoke some user-defined callback and then attempt to close itself
> > > > > gracefully. In order to make this possible we need to enhance
> Ignite
> > > > > interface by adding something like Ignite.onEmergencyClose(
> > > SomeClosure).
> > > > >
> > > > > First, I'd like to get your feedback on this potential change. Then
> > we
> > > > can
> > > > > refine SomeClosure structure.
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey Kuznetsov.
> > > > >
> > > >
> > >
> >
>

Andrey Kuznetsov

Re: Add emergency node closing handler to public Ignite API

Lifecycle beans are ok, but they does not accept any info on the Reason in
case of emergency node stop.

2017-11-14 21:16 GMT+03:00 Valentin Kulichenko <
[hidden email]>:

> Anton,
>
> I agree with Vova - we already have lifecycle bean. Why do we need anything
> on top of that?
>
> -Val
>
>

Vladimir Ozerov

Re: Add emergency node closing handler to public Ignite API

You can get this info from injected Ignite instance.

On Tue, Nov 14, 2017 at 10:13 PM, Andrey Kuznetsov <[hidden email]>
wrote:

> Lifecycle beans are ok, but they does not accept any info on the Reason in
> case of emergency node stop.
>
> 2017-11-14 21:16 GMT+03:00 Valentin Kulichenko <
> [hidden email]>:
>
> > Anton,
> >
> > I agree with Vova - we already have lifecycle bean. Why do we need
> anything
> > on top of that?
> >
> > -Val
> >
> >
>

Andrey Kuznetsov

Re: Add emergency node closing handler to public Ignite API

Vladimir, Ignite instance won't tell me whether deadlock occurred or some
critical thread has died.

14 нояб. 2017 г. 22:28 пользователь "Vladimir Ozerov" <[hidden email]>
написал:

You can get this info from injected Ignite instance.

Valentin Kulichenko

Re: Add emergency node closing handler to public Ignite API

Andrey,

Then let's add API to get this information. There is no need to add another
callback as we already have one.

-Val

On Tue, Nov 14, 2017 at 11:34 AM, Andrey Kuznetsov <[hidden email]>
wrote:

> Vladimir, Ignite instance won't tell me whether deadlock occurred or some
> critical thread has died.
>
> 14 нояб. 2017 г. 22:28 пользователь "Vladimir Ozerov" <
> [hidden email]>
> написал:
>
> You can get this info from injected Ignite instance.
>

Vladimir Ozerov

Re: Add emergency node closing handler to public Ignite API

This information should be available through local metrics, so that it is
accessible from Ignite instance.

вт, 14 нояб. 2017 г. в 22:37, Valentin Kulichenko <
[hidden email]>:

> Andrey,
>
> Then let's add API to get this information. There is no need to add another
> callback as we already have one.
>
> -Val
>
> On Tue, Nov 14, 2017 at 11:34 AM, Andrey Kuznetsov <[hidden email]>
> wrote:
>
> > Vladimir, Ignite instance won't tell me whether deadlock occurred or some
> > critical thread has died.
> >
> > 14 нояб. 2017 г. 22:28 пользователь "Vladimir Ozerov" <
> > [hidden email]>
> > написал:
> >
> > You can get this info from injected Ignite instance.
> >
>

Andrey Kuznetsov

Re: Add emergency node closing handler to public Ignite API

Vladimir,

Could you please refine, what are local metrics? Should I extend Ignite
interface by adding something similar to dataRegionMetrics() or there is
some universal mechanism to handle metrics?

2017-11-15 8:30 GMT+03:00 Vladimir Ozerov <[hidden email]>:
>
> This information should be available through local metrics, so that it is
> accessible from Ignite instance.
>

Anton Vinogradov

Re: Add emergency node closing handler to public Ignite API

Vova,

Could you point to metric you're talking about?

On Wed, Nov 15, 2017 at 1:06 PM, Andrey Kuznetsov <[hidden email]> wrote:

> Vladimir,
>
> Could you please refine, what are local metrics? Should I extend Ignite
> interface by adding something similar to dataRegionMetrics() or there is
> some universal mechanism to handle metrics?
>
> 2017-11-15 8:30 GMT+03:00 Vladimir Ozerov <[hidden email]>:
> >
> > This information should be available through local metrics, so that it is
> > accessible from Ignite instance.
> >
>

Vladimir Ozerov

Re: Add emergency node closing handler to public Ignite API

AFAIK the idea was not only to shutdown the node, but also to give user
(e.g. administrator) ability to observe the problem from the outside, e.g.
through JMX. E.g. if we detect Java-level deadlock, it doesn't mean that
the only possible solution is node shutdown. In addition it could be no-op,
e.g. to give user chance to collect additional system info, or simply
because this particular deadlock is resolvable (e.g.
Lock.lockInterruptibly()). So as we need to expose health info through JMX
anyway, we could also give user programmatic access to it as well.
Alternatively, we can expose this info through JMX only and ask user to get
instance of that bean manually.

On Wed, Nov 15, 2017 at 1:19 PM, Anton Vinogradov <[hidden email]>
wrote:

> Vova,
>
> Could you point to metric you're talking about?
>
> On Wed, Nov 15, 2017 at 1:06 PM, Andrey Kuznetsov <[hidden email]>
> wrote:
>
> > Vladimir,
> >
> > Could you please refine, what are local metrics? Should I extend Ignite
> > interface by adding something similar to dataRegionMetrics() or there is
> > some universal mechanism to handle metrics?
> >
> > 2017-11-15 8:30 GMT+03:00 Vladimir Ozerov <[hidden email]>:
> > >
> > > This information should be available through local metrics, so that it
> is
> > > accessible from Ignite instance.
> > >
> >
>

Anton Vinogradov

Re: Add emergency node closing handler to public Ignite API

Vova,

Currently we have a lot IEPs to improve grid monitoring and behavior.

Let's split tasks to:

1) Graceful shutdown.
In this case we'd like to provide user ability to do something,
LifecycleBean is what we looking for, thanks for tips!
But, we have to keep shutdown reason somewhere.
In case you know where it already kept , please let us know.

2) OOM or any other reason cause node crash.
In this case some watchdog (like [1] or [2]) should monitor node alive

3) GC and deadlock(java and tx) issues
Should be monitored by special thread [3] or published by metrics [4]

4) Throughput, latency and space issues
Special metrics should be developed according to [5]

Andrey asking about case #1 (graceful shutdown), lets discuss only this
case.

[1] https://issues.apache.org/jira/browse/IGNITE-6587
[2] https://wrapper.tanukisoftware.com/doc/english/download.jsp
[3] https://issues.apache.org/jira/browse/IGNITE-6171
[4]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-7%3A+Ignite+internal+problems+detection
[5]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-6%3A+Metrics+improvements

On Wed, Nov 15, 2017 at 1:34 PM, Vladimir Ozerov <[hidden email]>
wrote:

> AFAIK the idea was not only to shutdown the node, but also to give user
> (e.g. administrator) ability to observe the problem from the outside, e.g.
> through JMX. E.g. if we detect Java-level deadlock, it doesn't mean that
> the only possible solution is node shutdown. In addition it could be no-op,
> e.g. to give user chance to collect additional system info, or simply
> because this particular deadlock is resolvable (e.g.
> Lock.lockInterruptibly()). So as we need to expose health info through JMX
> anyway, we could also give user programmatic access to it as well.
> Alternatively, we can expose this info through JMX only and ask user to get
> instance of that bean manually.
>
> On Wed, Nov 15, 2017 at 1:19 PM, Anton Vinogradov <
> [hidden email]>
> wrote:
>
> > Vova,
> >
> > Could you point to metric you're talking about?
> >
> > On Wed, Nov 15, 2017 at 1:06 PM, Andrey Kuznetsov <[hidden email]>
> > wrote:
> >
> > > Vladimir,
> > >
> > > Could you please refine, what are local metrics? Should I extend Ignite
> > > interface by adding something similar to dataRegionMetrics() or there
> is
> > > some universal mechanism to handle metrics?
> > >
> > > 2017-11-15 8:30 GMT+03:00 Vladimir Ozerov <[hidden email]>:
> > > >
> > > > This information should be available through local metrics, so that
> it
> > is
> > > > accessible from Ignite instance.
> > > >
> > >
> >
>

Vladimir Ozerov

Re: Add emergency node closing handler to public Ignite API

I am not quite I understand how tasks are split. How can we discuss
graceful shutdown without discussing the reasons of this shutdown? What
leads to it?

On Wed, Nov 15, 2017 at 2:10 PM, Anton Vinogradov <[hidden email]>
wrote:

> Vova,
>
> Currently we have a lot IEPs to improve grid monitoring and behavior.
>
> Let's split tasks to:
>
> 1) Graceful shutdown.
> In this case we'd like to provide user ability to do something,
> LifecycleBean is what we looking for, thanks for tips!
> But, we have to keep shutdown reason somewhere.
> In case you know where it already kept , please let us know.
>
> 2) OOM or any other reason cause node crash.
> In this case some watchdog (like [1] or [2]) should monitor node alive
>
> 3) GC and deadlock(java and tx) issues
> Should be monitored by special thread [3] or published by metrics [4]
>
> 4) Throughput, latency and space issues
> Special metrics should be developed according to [5]
>
> Andrey asking about case #1 (graceful shutdown), lets discuss only this
> case.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-6587
> [2] https://wrapper.tanukisoftware.com/doc/english/download.jsp
> [3] https://issues.apache.org/jira/browse/IGNITE-6171
> [4]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 7%3A+Ignite+internal+problems+detection
> [5]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 6%3A+Metrics+improvements
>
>
> On Wed, Nov 15, 2017 at 1:34 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > AFAIK the idea was not only to shutdown the node, but also to give user
> > (e.g. administrator) ability to observe the problem from the outside,
> e.g.
> > through JMX. E.g. if we detect Java-level deadlock, it doesn't mean that
> > the only possible solution is node shutdown. In addition it could be
> no-op,
> > e.g. to give user chance to collect additional system info, or simply
> > because this particular deadlock is resolvable (e.g.
> > Lock.lockInterruptibly()). So as we need to expose health info through
> JMX
> > anyway, we could also give user programmatic access to it as well.
> > Alternatively, we can expose this info through JMX only and ask user to
> get
> > instance of that bean manually.
> >
> > On Wed, Nov 15, 2017 at 1:19 PM, Anton Vinogradov <
> > [hidden email]>
> > wrote:
> >
> > > Vova,
> > >
> > > Could you point to metric you're talking about?
> > >
> > > On Wed, Nov 15, 2017 at 1:06 PM, Andrey Kuznetsov <[hidden email]>
> > > wrote:
> > >
> > > > Vladimir,
> > > >
> > > > Could you please refine, what are local metrics? Should I extend
> Ignite
> > > > interface by adding something similar to dataRegionMetrics() or there
> > is
> > > > some universal mechanism to handle metrics?
> > > >
> > > > 2017-11-15 8:30 GMT+03:00 Vladimir Ozerov <[hidden email]>:
> > > > >
> > > > > This information should be available through local metrics, so that
> > it
> > > is
> > > > > accessible from Ignite instance.
> > > > >
> > > >
> > >
> >
>

Anton Vinogradov

Re: Add emergency node closing handler to public Ignite API

According to [1]

Reasons are:
- IgniteOutOfMemoryException
- Persistence errors
- ExchangeWorker exits with error

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-7%3A+Ignite+internal+problems+detection

On Wed, Nov 15, 2017 at 2:24 PM, Vladimir Ozerov <[hidden email]>
wrote:

> I am not quite I understand how tasks are split. How can we discuss
> graceful shutdown without discussing the reasons of this shutdown? What
> leads to it?
>
> On Wed, Nov 15, 2017 at 2:10 PM, Anton Vinogradov <
> [hidden email]>
> wrote:
>
> > Vova,
> >
> > Currently we have a lot IEPs to improve grid monitoring and behavior.
> >
> > Let's split tasks to:
> >
> > 1) Graceful shutdown.
> > In this case we'd like to provide user ability to do something,
> > LifecycleBean is what we looking for, thanks for tips!
> > But, we have to keep shutdown reason somewhere.
> > In case you know where it already kept , please let us know.
> >
> > 2) OOM or any other reason cause node crash.
> > In this case some watchdog (like [1] or [2]) should monitor node alive
> >
> > 3) GC and deadlock(java and tx) issues
> > Should be monitored by special thread [3] or published by metrics [4]
> >
> > 4) Throughput, latency and space issues
> > Special metrics should be developed according to [5]
> >
> > Andrey asking about case #1 (graceful shutdown), lets discuss only this
> > case.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-6587
> > [2] https://wrapper.tanukisoftware.com/doc/english/download.jsp
> > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > [4]
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 7%3A+Ignite+internal+problems+detection
> > [5]
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 6%3A+Metrics+improvements
> >
> >
> > On Wed, Nov 15, 2017 at 1:34 PM, Vladimir Ozerov <[hidden email]>
> > wrote:
> >
> > > AFAIK the idea was not only to shutdown the node, but also to give user
> > > (e.g. administrator) ability to observe the problem from the outside,
> > e.g.
> > > through JMX. E.g. if we detect Java-level deadlock, it doesn't mean
> that
> > > the only possible solution is node shutdown. In addition it could be
> > no-op,
> > > e.g. to give user chance to collect additional system info, or simply
> > > because this particular deadlock is resolvable (e.g.
> > > Lock.lockInterruptibly()). So as we need to expose health info through
> > JMX
> > > anyway, we could also give user programmatic access to it as well.
> > > Alternatively, we can expose this info through JMX only and ask user to
> > get
> > > instance of that bean manually.
> > >
> > > On Wed, Nov 15, 2017 at 1:19 PM, Anton Vinogradov <
> > > [hidden email]>
> > > wrote:
> > >
> > > > Vova,
> > > >
> > > > Could you point to metric you're talking about?
> > > >
> > > > On Wed, Nov 15, 2017 at 1:06 PM, Andrey Kuznetsov <[hidden email]
> >
> > > > wrote:
> > > >
> > > > > Vladimir,
> > > > >
> > > > > Could you please refine, what are local metrics? Should I extend
> > Ignite
> > > > > interface by adding something similar to dataRegionMetrics() or
> there
> > > is
> > > > > some universal mechanism to handle metrics?
> > > > >
> > > > > 2017-11-15 8:30 GMT+03:00 Vladimir Ozerov <[hidden email]>:
> > > > > >
> > > > > > This information should be available through local metrics, so
> that
> > > it
> > > > is
> > > > > > accessible from Ignite instance.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Vladimir Ozerov

Re: Add emergency node closing handler to public Ignite API

It would be nice to see the whole design first before going into low-level
details. Without it we are jumping from topic to topic. Were the list
events and reaction to these events discussed previously? At this point it
is not clear why nodes should be forcefully stopped without any
alternative.

For example, consider the following cases:
1) Exchange thread died. This is critical situation. But as a part of
analysis administrator might want to dump threads before killing the node.
He can do that programmatically, which is difficult and require knowledge
of Java, or can do that through management utilities, such as jstack or
VisualVM. What is more user friendly?
2) We start a service with multiple data regions. One data region is
configured incorrectly, what causes IOOME on multiple nodes. Why do you
think that the whole cluster (or many nodes) should be restarted? This is
potential data loss in all caches (not only in affected) and interruption
of service. Instead, administrator might decide to gradually reconfigure
and restart nodes one by one, instead of killing them all immediately.

This is why we need the design first.

On Wed, Nov 15, 2017 at 2:39 PM, Anton Vinogradov <[hidden email]>
wrote:

> According to [1]
>
> Reasons are:
> - IgniteOutOfMemoryException
> - Persistence errors
> - ExchangeWorker exits with error
>
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 7%3A+Ignite+internal+problems+detection
>
> On Wed, Nov 15, 2017 at 2:24 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > I am not quite I understand how tasks are split. How can we discuss
> > graceful shutdown without discussing the reasons of this shutdown? What
> > leads to it?
> >
> > On Wed, Nov 15, 2017 at 2:10 PM, Anton Vinogradov <
> > [hidden email]>
> > wrote:
> >
> > > Vova,
> > >
> > > Currently we have a lot IEPs to improve grid monitoring and behavior.
> > >
> > > Let's split tasks to:
> > >
> > > 1) Graceful shutdown.
> > > In this case we'd like to provide user ability to do something,
> > > LifecycleBean is what we looking for, thanks for tips!
> > > But, we have to keep shutdown reason somewhere.
> > > In case you know where it already kept , please let us know.
> > >
> > > 2) OOM or any other reason cause node crash.
> > > In this case some watchdog (like [1] or [2]) should monitor node alive
> > >
> > > 3) GC and deadlock(java and tx) issues
> > > Should be monitored by special thread [3] or published by metrics [4]
> > >
> > > 4) Throughput, latency and space issues
> > > Special metrics should be developed according to [5]
> > >
> > > Andrey asking about case #1 (graceful shutdown), lets discuss only this
> > > case.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-6587
> > > [2] https://wrapper.tanukisoftware.com/doc/english/download.jsp
> > > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > > [4]
> > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > 7%3A+Ignite+internal+problems+detection
> > > [5]
> > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > 6%3A+Metrics+improvements
> > >
> > >
> > > On Wed, Nov 15, 2017 at 1:34 PM, Vladimir Ozerov <[hidden email]
> >
> > > wrote:
> > >
> > > > AFAIK the idea was not only to shutdown the node, but also to give
> user
> > > > (e.g. administrator) ability to observe the problem from the outside,
> > > e.g.
> > > > through JMX. E.g. if we detect Java-level deadlock, it doesn't mean
> > that
> > > > the only possible solution is node shutdown. In addition it could be
> > > no-op,
> > > > e.g. to give user chance to collect additional system info, or simply
> > > > because this particular deadlock is resolvable (e.g.
> > > > Lock.lockInterruptibly()). So as we need to expose health info
> through
> > > JMX
> > > > anyway, we could also give user programmatic access to it as well.
> > > > Alternatively, we can expose this info through JMX only and ask user
> to
> > > get
> > > > instance of that bean manually.
> > > >
> > > > On Wed, Nov 15, 2017 at 1:19 PM, Anton Vinogradov <
> > > > [hidden email]>
> > > > wrote:
> > > >
> > > > > Vova,
> > > > >
> > > > > Could you point to metric you're talking about?
> > > > >
> > > > > On Wed, Nov 15, 2017 at 1:06 PM, Andrey Kuznetsov <
> [hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > > Vladimir,
> > > > > >
> > > > > > Could you please refine, what are local metrics? Should I extend
> > > Ignite
> > > > > > interface by adding something similar to dataRegionMetrics() or
> > there
> > > > is
> > > > > > some universal mechanism to handle metrics?
> > > > > >
> > > > > > 2017-11-15 8:30 GMT+03:00 Vladimir Ozerov <[hidden email]
> >:
> > > > > > >
> > > > > > > This information should be available through local metrics, so
> > that
> > > > it
> > > > > is
> > > > > > > accessible from Ignite instance.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Anton Vinogradov

Re: Add emergency node closing handler to public Ignite API

Vova,

I'll refactor IEP-7 [1], most likely merge it with IEP-5 [2], and let you
know that overall design ready and clear :)

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-7%3A+Ignite+internal+problems+detection#IEP-7:Igniteinternalproblemsdetection-SystemThreadRegestry
.
[2]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-5+Cluster+reaction+if+node+detects+an+extraordinary+situations

On Wed, Nov 15, 2017 at 3:21 PM, Vladimir Ozerov <[hidden email]>
wrote:

> It would be nice to see the whole design first before going into low-level
> details. Without it we are jumping from topic to topic. Were the list
> events and reaction to these events discussed previously? At this point it
> is not clear why nodes should be forcefully stopped without any
> alternative.
>
> For example, consider the following cases:
> 1) Exchange thread died. This is critical situation. But as a part of
> analysis administrator might want to dump threads before killing the node.
> He can do that programmatically, which is difficult and require knowledge
> of Java, or can do that through management utilities, such as jstack or
> VisualVM. What is more user friendly?
> 2) We start a service with multiple data regions. One data region is
> configured incorrectly, what causes IOOME on multiple nodes. Why do you
> think that the whole cluster (or many nodes) should be restarted? This is
> potential data loss in all caches (not only in affected) and interruption
> of service. Instead, administrator might decide to gradually reconfigure
> and restart nodes one by one, instead of killing them all immediately.
>
> This is why we need the design first.
>
> On Wed, Nov 15, 2017 at 2:39 PM, Anton Vinogradov <
> [hidden email]>
> wrote:
>
> > According to [1]
> >
> > Reasons are:
> > - IgniteOutOfMemoryException
> > - Persistence errors
> > - ExchangeWorker exits with error
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 7%3A+Ignite+internal+problems+detection
> >
> > On Wed, Nov 15, 2017 at 2:24 PM, Vladimir Ozerov <[hidden email]>
> > wrote:
> >
> > > I am not quite I understand how tasks are split. How can we discuss
> > > graceful shutdown without discussing the reasons of this shutdown? What
> > > leads to it?
> > >
> > > On Wed, Nov 15, 2017 at 2:10 PM, Anton Vinogradov <
> > > [hidden email]>
> > > wrote:
> > >
> > > > Vova,
> > > >
> > > > Currently we have a lot IEPs to improve grid monitoring and behavior.
> > > >
> > > > Let's split tasks to:
> > > >
> > > > 1) Graceful shutdown.
> > > > In this case we'd like to provide user ability to do something,
> > > > LifecycleBean is what we looking for, thanks for tips!
> > > > But, we have to keep shutdown reason somewhere.
> > > > In case you know where it already kept , please let us know.
> > > >
> > > > 2) OOM or any other reason cause node crash.
> > > > In this case some watchdog (like [1] or [2]) should monitor node
> alive
> > > >
> > > > 3) GC and deadlock(java and tx) issues
> > > > Should be monitored by special thread [3] or published by metrics [4]
> > > >
> > > > 4) Throughput, latency and space issues
> > > > Special metrics should be developed according to [5]
> > > >
> > > > Andrey asking about case #1 (graceful shutdown), lets discuss only
> this
> > > > case.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-6587
> > > > [2] https://wrapper.tanukisoftware.com/doc/english/download.jsp
> > > > [3] https://issues.apache.org/jira/browse/IGNITE-6171
> > > > [4]
> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > 7%3A+Ignite+internal+problems+detection
> > > > [5]
> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > 6%3A+Metrics+improvements
> > > >
> > > >
> > > > On Wed, Nov 15, 2017 at 1:34 PM, Vladimir Ozerov <
> [hidden email]
> > >
> > > > wrote:
> > > >
> > > > > AFAIK the idea was not only to shutdown the node, but also to give
> > user
> > > > > (e.g. administrator) ability to observe the problem from the
> outside,
> > > > e.g.
> > > > > through JMX. E.g. if we detect Java-level deadlock, it doesn't mean
> > > that
> > > > > the only possible solution is node shutdown. In addition it could
> be
> > > > no-op,
> > > > > e.g. to give user chance to collect additional system info, or
> simply
> > > > > because this particular deadlock is resolvable (e.g.
> > > > > Lock.lockInterruptibly()). So as we need to expose health info
> > through
> > > > JMX
> > > > > anyway, we could also give user programmatic access to it as well.
> > > > > Alternatively, we can expose this info through JMX only and ask
> user
> > to
> > > > get
> > > > > instance of that bean manually.
> > > > >
> > > > > On Wed, Nov 15, 2017 at 1:19 PM, Anton Vinogradov <
> > > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Vova,
> > > > > >
> > > > > > Could you point to metric you're talking about?
> > > > > >
> > > > > > On Wed, Nov 15, 2017 at 1:06 PM, Andrey Kuznetsov <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Vladimir,
> > > > > > >
> > > > > > > Could you please refine, what are local metrics? Should I
> extend
> > > > Ignite
> > > > > > > interface by adding something similar to dataRegionMetrics() or
> > > there
> > > > > is
> > > > > > > some universal mechanism to handle metrics?
> > > > > > >
> > > > > > > 2017-11-15 8:30 GMT+03:00 Vladimir Ozerov <
> [hidden email]
> > >:
> > > > > > > >
> > > > > > > > This information should be available through local metrics,
> so
> > > that
> > > > > it
> > > > > > is
> > > > > > > > accessible from Ignite instance.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Tom Diederich

Meetup tonight in Santa Clara CA: "In-Memory Computing Essentials for Java Developers"

In reply to this post by Anton Vinogradov

Igniters, those of you in the San Francisco Bay Area are invited to tonight's meetup in Santa Clara. It features Ignite PMC Chair Denis Magda who will deliver a 90-minute hands-on workshop titled : "In-Memory Computing Essentials for Java Developers <http://bit.ly/2ikS0ts>"

More info and RSVP here <http://bit.ly/2ikS0ts>:
Tom