Apache Ignite Developers - Legacy Mail Archive

Facility to detect long STW pauses and other system response degradations

Classic

List

Threaded

19 messages Options

Dmitriy_Sorokin

Facility to detect long STW pauses and other system response degradations

Hi, Igniters!

This discussion thread related to
https://issues.apache.org/jira/browse/IGNITE-6171.

Currently there are no JVM performance monitoring tools in AI, for example
the impact of GC (eg STW) on the operation of the node. I think we should
add this functionality.

1) It is useful to know that STW duration increased or any other situations
leads to similar consequences.
This will allow system administrators to solve issues prior they become
problems.

I propose to add a special thread that will record current time every N
milliseconds and check the difference with the latest recorded value.
The maximum and total pause values for a certain period can be published in
the special metrics available through JMX.

2) If the pause reaches a critical value, we need to stop the node, without
waiting for end of the pause.

The thread (from the first part of the proposed solution) is able to
estimate the pause duration, but only after its completion.
So, we need an external thread (in another JVM or native) that is able to
recognize that the pause duration has passed the critical mark.

We can estimate (STW or similar) pause duration by
a) reading value updated by the first thread, somehow (eg via JMX, shmem or
shared file)
or
b) by using JVM diagnostic tools. Does anybody know crossplatform
solutions?

Feel free to suggest ideas or tips, especially about second part of
proposal.

Thoughts?

--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/

Anton Vinogradov

Re: Facility to detect long STW pauses and other system response degradations

Igniters,

Since no one rejected proposal, let's start from part one.

> I propose to add a special thread that will record current time every N
> milliseconds and check the difference with the latest recorded value.
> The maximum and total pause values for a certain period can be published
in
> the special metrics available through JMX.

On Fri, Nov 17, 2017 at 4:08 PM, Dmitriy_Sorokin <[hidden email]>
wrote:

> Hi, Igniters!
>
> This discussion thread related to
> https://issues.apache.org/jira/browse/IGNITE-6171.
>
> Currently there are no JVM performance monitoring tools in AI, for example
> the impact of GC (eg STW) on the operation of the node. I think we should
> add this functionality.
>
> 1) It is useful to know that STW duration increased or any other situations
> leads to similar consequences.
> This will allow system administrators to solve issues prior they become
> problems.
>
> I propose to add a special thread that will record current time every N
> milliseconds and check the difference with the latest recorded value.
> The maximum and total pause values for a certain period can be published in
> the special metrics available through JMX.
>
> 2) If the pause reaches a critical value, we need to stop the node, without
> waiting for end of the pause.
>
> The thread (from the first part of the proposed solution) is able to
> estimate the pause duration, but only after its completion.
> So, we need an external thread (in another JVM or native) that is able to
> recognize that the pause duration has passed the critical mark.
>
> We can estimate (STW or similar) pause duration by
> a) reading value updated by the first thread, somehow (eg via JMX, shmem
> or
> shared file)
> or
> b) by using JVM diagnostic tools. Does anybody know crossplatform
> solutions?
>
> Feel free to suggest ideas or tips, especially about second part of
> proposal.
>
> Thoughts?
>
>
>
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>

Dmitriy Pavlov

Re: Facility to detect long STW pauses and other system response degradations

Sleeping Java Thread IMO is not an option, because thread can be in
Timed_Watiting logner than timeout.

Did I understand correctly that the native stream is proposed? And our goal
now is to select best framework for this?

Can we limit this oppotunity with several popular OS (Win,Linux), and do
not implement this feature for all operation systems?

пн, 20 нояб. 2017 г. в 14:55, Anton Vinogradov <[hidden email]>:

> Igniters,
>
> Since no one rejected proposal, let's start from part one.
>
> > I propose to add a special thread that will record current time every N
> > milliseconds and check the difference with the latest recorded value.
> > The maximum and total pause values for a certain period can be published
> in
> > the special metrics available through JMX.
>
> On Fri, Nov 17, 2017 at 4:08 PM, Dmitriy_Sorokin <
> [hidden email]>
> wrote:
>
> > Hi, Igniters!
> >
> > This discussion thread related to
> > https://issues.apache.org/jira/browse/IGNITE-6171.
> >
> > Currently there are no JVM performance monitoring tools in AI, for
> example
> > the impact of GC (eg STW) on the operation of the node. I think we should
> > add this functionality.
> >
> > 1) It is useful to know that STW duration increased or any other
> situations
> > leads to similar consequences.
> > This will allow system administrators to solve issues prior they become
> > problems.
> >
> > I propose to add a special thread that will record current time every N
> > milliseconds and check the difference with the latest recorded value.
> > The maximum and total pause values for a certain period can be published
> in
> > the special metrics available through JMX.
> >
> > 2) If the pause reaches a critical value, we need to stop the node,
> without
> > waiting for end of the pause.
> >
> > The thread (from the first part of the proposed solution) is able to
> > estimate the pause duration, but only after its completion.
> > So, we need an external thread (in another JVM or native) that is able to
> > recognize that the pause duration has passed the critical mark.
> >
> > We can estimate (STW or similar) pause duration by
> > a) reading value updated by the first thread, somehow (eg via JMX, shmem
> > or
> > shared file)
> > or
> > b) by using JVM diagnostic tools. Does anybody know crossplatform
> > solutions?
> >
> > Feel free to suggest ideas or tips, especially about second part of
> > proposal.
> >
> > Thoughts?
> >
> >
> >
> > --
> > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> >
>

Anton Vinogradov

Re: Facility to detect long STW pauses and other system response degradations

Dmitriy,

> Sleeping Java Thread IMO is not an option, because thread can be in
> Timed_Watiting logner than timeout.

That's the only one idea we have, and, according to tests, it works!

> Did I understand correctly that the native stream is proposed? And our
goal
> now is to select best framework for this?

That's one of possible cases.
We can replace native thread by another JVM, this should solve
compatibility issues.

On Mon, Nov 20, 2017 at 4:24 PM, Dmitry Pavlov <[hidden email]>
wrote:

> Sleeping Java Thread IMO is not an option, because thread can be in
> Timed_Watiting logner than timeout.
>
> Did I understand correctly that the native stream is proposed? And our goal
> now is to select best framework for this?
>
> Can we limit this oppotunity with several popular OS (Win,Linux), and do
> not implement this feature for all operation systems?
>
>
> пн, 20 нояб. 2017 г. в 14:55, Anton Vinogradov <[hidden email]>:
>
> > Igniters,
> >
> > Since no one rejected proposal, let's start from part one.
> >
> > > I propose to add a special thread that will record current time every N
> > > milliseconds and check the difference with the latest recorded value.
> > > The maximum and total pause values for a certain period can be
> published
> > in
> > > the special metrics available through JMX.
> >
> > On Fri, Nov 17, 2017 at 4:08 PM, Dmitriy_Sorokin <
> > [hidden email]>
> > wrote:
> >
> > > Hi, Igniters!
> > >
> > > This discussion thread related to
> > > https://issues.apache.org/jira/browse/IGNITE-6171.
> > >
> > > Currently there are no JVM performance monitoring tools in AI, for
> > example
> > > the impact of GC (eg STW) on the operation of the node. I think we
> should
> > > add this functionality.
> > >
> > > 1) It is useful to know that STW duration increased or any other
> > situations
> > > leads to similar consequences.
> > > This will allow system administrators to solve issues prior they become
> > > problems.
> > >
> > > I propose to add a special thread that will record current time every N
> > > milliseconds and check the difference with the latest recorded value.
> > > The maximum and total pause values for a certain period can be
> published
> > in
> > > the special metrics available through JMX.
> > >
> > > 2) If the pause reaches a critical value, we need to stop the node,
> > without
> > > waiting for end of the pause.
> > >
> > > The thread (from the first part of the proposed solution) is able to
> > > estimate the pause duration, but only after its completion.
> > > So, we need an external thread (in another JVM or native) that is able
> to
> > > recognize that the pause duration has passed the critical mark.
> > >
> > > We can estimate (STW or similar) pause duration by
> > > a) reading value updated by the first thread, somehow (eg via JMX,
> shmem
> > > or
> > > shared file)
> > > or
> > > b) by using JVM diagnostic tools. Does anybody know crossplatform
> > > solutions?
> > >
> > > Feel free to suggest ideas or tips, especially about second part of
> > > proposal.
> > >
> > > Thoughts?
> > >
> > >
> > >
> > > --
> > > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > >
> >
>

Dmitriy Pavlov

Re: Facility to detect long STW pauses and other system response degradations

Yes, we need some timestamp from Java code. But I think JVM thread could
update TS with delays not related to GC and will have same effect with
IgniteUtils#currentTimeMillis().

Is this new test compares result from java timestamps difference with GC
logs?

пн, 20 нояб. 2017 г. в 16:39, Anton Vinogradov <[hidden email]>:

> Dmitriy,
>
> > Sleeping Java Thread IMO is not an option, because thread can be in
> > Timed_Watiting logner than timeout.
>
> That's the only one idea we have, and, according to tests, it works!
>
> > Did I understand correctly that the native stream is proposed? And our
> goal
> > now is to select best framework for this?
>
> That's one of possible cases.
> We can replace native thread by another JVM, this should solve
> compatibility issues.
>
>
> On Mon, Nov 20, 2017 at 4:24 PM, Dmitry Pavlov <[hidden email]>
> wrote:
>
> > Sleeping Java Thread IMO is not an option, because thread can be in
> > Timed_Watiting logner than timeout.
> >
> > Did I understand correctly that the native stream is proposed? And our
> goal
> > now is to select best framework for this?
> >
> > Can we limit this oppotunity with several popular OS (Win,Linux), and do
> > not implement this feature for all operation systems?
> >
> >
> > пн, 20 нояб. 2017 г. в 14:55, Anton Vinogradov <[hidden email]
> >:
> >
> > > Igniters,
> > >
> > > Since no one rejected proposal, let's start from part one.
> > >
> > > > I propose to add a special thread that will record current time
> every N
> > > > milliseconds and check the difference with the latest recorded value.
> > > > The maximum and total pause values for a certain period can be
> > published
> > > in
> > > > the special metrics available through JMX.
> > >
> > > On Fri, Nov 17, 2017 at 4:08 PM, Dmitriy_Sorokin <
> > > [hidden email]>
> > > wrote:
> > >
> > > > Hi, Igniters!
> > > >
> > > > This discussion thread related to
> > > > https://issues.apache.org/jira/browse/IGNITE-6171.
> > > >
> > > > Currently there are no JVM performance monitoring tools in AI, for
> > > example
> > > > the impact of GC (eg STW) on the operation of the node. I think we
> > should
> > > > add this functionality.
> > > >
> > > > 1) It is useful to know that STW duration increased or any other
> > > situations
> > > > leads to similar consequences.
> > > > This will allow system administrators to solve issues prior they
> become
> > > > problems.
> > > >
> > > > I propose to add a special thread that will record current time
> every N
> > > > milliseconds and check the difference with the latest recorded value.
> > > > The maximum and total pause values for a certain period can be
> > published
> > > in
> > > > the special metrics available through JMX.
> > > >
> > > > 2) If the pause reaches a critical value, we need to stop the node,
> > > without
> > > > waiting for end of the pause.
> > > >
> > > > The thread (from the first part of the proposed solution) is able to
> > > > estimate the pause duration, but only after its completion.
> > > > So, we need an external thread (in another JVM or native) that is
> able
> > to
> > > > recognize that the pause duration has passed the critical mark.
> > > >
> > > > We can estimate (STW or similar) pause duration by
> > > > a) reading value updated by the first thread, somehow (eg via JMX,
> > shmem
> > > > or
> > > > shared file)
> > > > or
> > > > b) by using JVM diagnostic tools. Does anybody know crossplatform
> > > > solutions?
> > > >
> > > > Feel free to suggest ideas or tips, especially about second part of
> > > > proposal.
> > > >
> > > > Thoughts?
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > > >
> > >
> >
>

yzhdanov

Re: Facility to detect long STW pauses and other system response degradations

Guys, how about having 2 native threads - one calling some java method,
another one monitoring that the first one is active and is not stuck on
safepoint (which points to GC pause)? Testing GC pause with java thread is
a bit strange and can give info only after GC pause finishes. Native
threads can give info immediately. This was suggested by V. Ozerov some
time ago. Vova, I believe we already have a ticket. Can you please provide
a link?

--Yakov

Dmitriy_Sorokin

Re: Facility to detect long STW pauses and other system response degradations

In reply to this post by Anton Vinogradov

I have tested solution with java-thread and GC logs had contain same pause
values of thread stopping which was detected by java-thread.

My log (contains pauses > 100ms):
[2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
pause: 507 milliseconds.
[2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
pause: 5595 milliseconds.
[2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
pause: 3262 milliseconds.
[2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
pause: 1737 milliseconds.

GC log:
gridgain@dell-5580-92zc8h2:~$ cat
./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
threads were stopped: 0,0000845 seconds, Stopping threads took: 0,0000246
seconds
2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
threads were stopped: 0,0001072 seconds, Stopping threads took: 0,0000252
seconds
2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
threads were stopped: 0,5001082 seconds, Stopping threads took: 0,0000178
seconds // GOT!
2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
threads were stopped: 5,5856603 seconds, Stopping threads took: 0,0000229
seconds // GOT!
2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
threads were stopped: 3,2595700 seconds, Stopping threads took: 0,0000223
seconds // GOT!
2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
threads were stopped: 1,7337123 seconds, Stopping threads took: 0,0000121
seconds // GOT!

--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/

Anton Vinogradov

Re: Facility to detect long STW pauses and other system response degradations

Yakov,

Issue is https://issues.apache.org/jira/browse/IGNITE-6171

We split issue to
#1 STW duration metrics
#2 External monitoring allows to stop node during STW

> Testing GC pause with java thread is
> a bit strange and can give info only after GC pause finishes.

That's ok since it's #1

On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <[hidden email]>
wrote:

> I have tested solution with java-thread and GC logs had contain same pause
> values of thread stopping which was detected by java-thread.
>
>
> My log (contains pauses > 100ms):
> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
> pause: 507 milliseconds.
> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
> pause: 5595 milliseconds.
> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
> pause: 3262 milliseconds.
> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
> pause: 1737 milliseconds.
>
> GC log:
> gridgain@dell-5580-92zc8h2:~$ cat
> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> threads were stopped: 0,0000845 seconds, Stopping threads took: 0,0000246
> seconds
> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> threads were stopped: 0,0001072 seconds, Stopping threads took: 0,0000252
> seconds
> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> threads were stopped: 0,5001082 seconds, Stopping threads took: 0,0000178
> seconds // GOT!
> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> threads were stopped: 5,5856603 seconds, Stopping threads took: 0,0000229
> seconds // GOT!
> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
> threads were stopped: 3,2595700 seconds, Stopping threads took: 0,0000223
> seconds // GOT!
> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
> threads were stopped: 1,7337123 seconds, Stopping threads took: 0,0000121
> seconds // GOT!
>
>
>
>
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>

dmagda

Re: Facility to detect long STW pauses and other system response degradations

My 2 cents.

1. Totally for a separate native process that will handle the monitoring of an Ignite process. The watchdog process can simply start a JVM tool like jstat and parse its GC logs: https://dzone.com/articles/how-monitor-java-garbage <https://dzone.com/articles/how-monitor-java-garbage>

2. As for the STW handling, I would make a possible reaction more generic. Let’s define a policy (enumeration) that will define how to deal with an unstable node. The events might be as follows - kill a node, restart a node, trigger a custom script using Runtime.exec or other methods.

What’d you think? Specifically on point 2.

—
Denis

> On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <[hidden email]> wrote:
>
> Yakov,
>
> Issue is https://issues.apache.org/jira/browse/IGNITE-6171
>
> We split issue to
> #1 STW duration metrics
> #2 External monitoring allows to stop node during STW
>
>> Testing GC pause with java thread is
>> a bit strange and can give info only after GC pause finishes.
>
> That's ok since it's #1
>
> On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <[hidden email]>
> wrote:
>
>> I have tested solution with java-thread and GC logs had contain same pause
>> values of thread stopping which was detected by java-thread.
>>
>>
>> My log (contains pauses > 100ms):
>> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
>> pause: 507 milliseconds.
>> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
>> pause: 5595 milliseconds.
>> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
>> pause: 3262 milliseconds.
>> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
>> pause: 1737 milliseconds.
>>
>> GC log:
>> gridgain@dell-5580-92zc8h2:~$ cat
>> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
>> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
>> threads were stopped: 0,0000845 seconds, Stopping threads took: 0,0000246
>> seconds
>> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
>> threads were stopped: 0,0001072 seconds, Stopping threads took: 0,0000252
>> seconds
>> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
>> threads were stopped: 0,5001082 seconds, Stopping threads took: 0,0000178
>> seconds // GOT!
>> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
>> threads were stopped: 5,5856603 seconds, Stopping threads took: 0,0000229
>> seconds // GOT!
>> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
>> threads were stopped: 3,2595700 seconds, Stopping threads took: 0,0000223
>> seconds // GOT!
>> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
>> threads were stopped: 1,7337123 seconds, Stopping threads took: 0,0000121
>> seconds // GOT!
>>
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>>

Anton Vinogradov

Re: Facility to detect long STW pauses and other system response degradations

Denis,

> 1. Totally for a separate native process that will handle the monitoring
of an Ignite process. The watchdog process can simply start a JVM tool like
jstat and parse its GC logs: https://dzone.com/articles/
how-monitor-java-garbage <https://dzone.com/articles/
how-monitor-java-garbage>
Different GC and even same GC at different OS/JVM produce different logs.
That's not easy to parse them. But, since http://gceasy.io can do that, it
looks to be possible, somehow :) .
Do you know any libs or solutions allows to do this at realtime?

> 2. As for the STW handling, I would make a possible reaction more
generic. Let’s define a policy (enumeration) that will define how to deal
with an unstable node. The events might be as follows - kill a node,
restart a node, trigger a custom script using Runtime.exec or other methods.
Yes, it should be similar to segmentation policy + custom script execution.

On Tue, Nov 21, 2017 at 2:10 AM, Denis Magda <[hidden email]> wrote:

> My 2 cents.
>
> 1. Totally for a separate native process that will handle the monitoring
> of an Ignite process. The watchdog process can simply start a JVM tool like
> jstat and parse its GC logs: https://dzone.com/articles/
> how-monitor-java-garbage <https://dzone.com/articles/
> how-monitor-java-garbage>
>
> 2. As for the STW handling, I would make a possible reaction more generic.
> Let’s define a policy (enumeration) that will define how to deal with an
> unstable node. The events might be as follows - kill a node, restart a
> node, trigger a custom script using Runtime.exec or other methods.
>
> What’d you think? Specifically on point 2.
>
> —
> Denis
>
> > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <[hidden email]>
> wrote:
> >
> > Yakov,
> >
> > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> >
> > We split issue to
> > #1 STW duration metrics
> > #2 External monitoring allows to stop node during STW
> >
> >> Testing GC pause with java thread is
> >> a bit strange and can give info only after GC pause finishes.
> >
> > That's ok since it's #1
> >
> > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> [hidden email]>
> > wrote:
> >
> >> I have tested solution with java-thread and GC logs had contain same
> pause
> >> values of thread stopping which was detected by java-thread.
> >>
> >>
> >> My log (contains pauses > 100ms):
> >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
> >> pause: 507 milliseconds.
> >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
> >> pause: 5595 milliseconds.
> >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
> >> pause: 3262 milliseconds.
> >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
> >> pause: 1737 milliseconds.
> >>
> >> GC log:
> >> gridgain@dell-5580-92zc8h2:~$ cat
> >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> 0,0000246
> >> seconds
> >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> 0,0000252
> >> seconds
> >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> 0,0000178
> >> seconds // GOT!
> >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> 0,0000229
> >> seconds // GOT!
> >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
> >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> 0,0000223
> >> seconds // GOT!
> >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
> >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> 0,0000121
> >> seconds // GOT!
> >>
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> >>
>
>

Dmitriy_Sorokin

Re: Facility to detect long STW pauses and other system response degradations

Don't forget that the high utilization of CPU can occur for reasons other
than GC STW, and GC log parsing will not help us in that case.

вт, 21 нояб. 2017 г. в 13:06, Anton Vinogradov [via Apache Ignite
Developers] <[hidden email]>:

> Denis,
>
> > 1. Totally for a separate native process that will handle the monitoring
> of an Ignite process. The watchdog process can simply start a JVM tool
> like
> jstat and parse its GC logs: https://dzone.com/articles/
> how-monitor-java-garbage <https://dzone.com/articles/
> how-monitor-java-garbage>
> Different GC and even same GC at different OS/JVM produce different logs.
> That's not easy to parse them. But, since http://gceasy.io can do that,
> it
> looks to be possible, somehow :) .
> Do you know any libs or solutions allows to do this at realtime?
>
> > 2. As for the STW handling, I would make a possible reaction more
> generic. Let’s define a policy (enumeration) that will define how to deal
> with an unstable node. The events might be as follows - kill a node,
> restart a node, trigger a custom script using Runtime.exec or other
> methods.
> Yes, it should be similar to segmentation policy + custom script
> execution.
>
>
> On Tue, Nov 21, 2017 at 2:10 AM, Denis Magda <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=24497&i=0>> wrote:
>
> > My 2 cents.
> >
> > 1. Totally for a separate native process that will handle the monitoring
> > of an Ignite process. The watchdog process can simply start a JVM tool
> like
> > jstat and parse its GC logs: https://dzone.com/articles/
> > how-monitor-java-garbage <https://dzone.com/articles/
> > how-monitor-java-garbage>
> >
> > 2. As for the STW handling, I would make a possible reaction more
> generic.
> > Let’s define a policy (enumeration) that will define how to deal with an
> > unstable node. The events might be as follows - kill a node, restart a
> > node, trigger a custom script using Runtime.exec or other methods.
> >
> > What’d you think? Specifically on point 2.
> >
> > —
> > Denis
> >
> > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=24497&i=1>>
> > wrote:
> > >
> > > Yakov,
> > >
> > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > >
> > > We split issue to
> > > #1 STW duration metrics
> > > #2 External monitoring allows to stop node during STW
> > >
> > >> Testing GC pause with java thread is
> > >> a bit strange and can give info only after GC pause finishes.
> > >
> > > That's ok since it's #1
> > >
> > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > [hidden email] <http:///user/SendEmail.jtp?type=node&node=24497&i=2>>
> > > wrote:
> > >
> > >> I have tested solution with java-thread and GC logs had contain same
> > pause
> > >> values of thread stopping which was detected by java-thread.
> > >>
> > >>
> > >> My log (contains pauses > 100ms):
> > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 507 milliseconds.
> > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 5595 milliseconds.
> > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 3262 milliseconds.
> > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 1737 milliseconds.
> > >>
> > >> GC log:
> > >> gridgain@dell-5580-92zc8h2:~$ cat
> > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > 0,0000246
> > >> seconds
> > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > 0,0000252
> > >> seconds
> > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > 0,0000178
> > >> seconds // GOT!
> > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > 0,0000229
> > >> seconds // GOT!
> > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> application
> > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > 0,0000223
> > >> seconds // GOT!
> > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> application
> > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > 0,0000121
> > >> seconds // GOT!
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > >>
> >
> >
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-ignite-developers.2346864.n4.nabble.com/Facility-to-detect-long-STW-pauses-and-other-system-response-degradations-tp24391p24497.html
> To unsubscribe from Facility to detect long STW pauses and other system
> response degradations, click here
> <
> .
> NAML
> <http://apache-ignite-developers.2346864.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

Vladimir Ozerov

Re: Facility to detect long STW pauses and other system response degradations

Honestly I do not understand why we need separate process to monitor node's
state. Is it possible to list advantages of this approach comparing to
in-proc monitoring threads?

On Tue, Nov 21, 2017 at 1:16 PM, Дмитрий Сорокин <[hidden email]>
wrote:

> Don't forget that the high utilization of CPU can occur for reasons other
> than GC STW, and GC log parsing will not help us in that case.
>
>
> вт, 21 нояб. 2017 г. в 13:06, Anton Vinogradov [via Apache Ignite
> Developers] <[hidden email]>:
>
> > Denis,
> >
> > > 1. Totally for a separate native process that will handle the
> monitoring
> > of an Ignite process. The watchdog process can simply start a JVM tool
> > like
> > jstat and parse its GC logs: https://dzone.com/articles/
> > how-monitor-java-garbage <https://dzone.com/articles/
> > how-monitor-java-garbage>
> > Different GC and even same GC at different OS/JVM produce different logs.
> > That's not easy to parse them. But, since http://gceasy.io can do that,
> > it
> > looks to be possible, somehow :) .
> > Do you know any libs or solutions allows to do this at realtime?
> >
> > > 2. As for the STW handling, I would make a possible reaction more
> > generic. Let’s define a policy (enumeration) that will define how to deal
> > with an unstable node. The events might be as follows - kill a node,
> > restart a node, trigger a custom script using Runtime.exec or other
> > methods.
> > Yes, it should be similar to segmentation policy + custom script
> > execution.
> >
> >
> > On Tue, Nov 21, 2017 at 2:10 AM, Denis Magda <[hidden email]
> > <http:///user/SendEmail.jtp?type=node&node=24497&i=0>> wrote:
> >
> > > My 2 cents.
> > >
> > > 1. Totally for a separate native process that will handle the
> monitoring
> > > of an Ignite process. The watchdog process can simply start a JVM tool
> > like
> > > jstat and parse its GC logs: https://dzone.com/articles/
> > > how-monitor-java-garbage <https://dzone.com/articles/
> > > how-monitor-java-garbage>
> > >
> > > 2. As for the STW handling, I would make a possible reaction more
> > generic.
> > > Let’s define a policy (enumeration) that will define how to deal with
> an
> > > unstable node. The events might be as follows - kill a node, restart a
> > > node, trigger a custom script using Runtime.exec or other methods.
> > >
> > > What’d you think? Specifically on point 2.
> > >
> > > —
> > > Denis
> > >
> > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <[hidden email]
> > <http:///user/SendEmail.jtp?type=node&node=24497&i=1>>
> > > wrote:
> > > >
> > > > Yakov,
> > > >
> > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > >
> > > > We split issue to
> > > > #1 STW duration metrics
> > > > #2 External monitoring allows to stop node during STW
> > > >
> > > >> Testing GC pause with java thread is
> > > >> a bit strange and can give info only after GC pause finishes.
> > > >
> > > > That's ok since it's #1
> > > >
> > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > [hidden email] <http:///user/SendEmail.jtp?type=node&node=24497&i=2>>
> > > > wrote:
> > > >
> > > >> I have tested solution with java-thread and GC logs had contain same
> > > pause
> > > >> values of thread stopping which was detected by java-thread.
> > > >>
> > > >>
> > > >> My log (contains pauses > 100ms):
> > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long
> > STW
> > > >> pause: 507 milliseconds.
> > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long
> > STW
> > > >> pause: 5595 milliseconds.
> > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long
> > STW
> > > >> pause: 3262 milliseconds.
> > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long
> > STW
> > > >> pause: 1737 milliseconds.
> > > >>
> > > >> GC log:
> > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which
> application
> > > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > > 0,0000246
> > > >> seconds
> > > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which
> application
> > > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > > 0,0000252
> > > >> seconds
> > > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which
> application
> > > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > > 0,0000178
> > > >> seconds // GOT!
> > > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which
> application
> > > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > > 0,0000229
> > > >> seconds // GOT!
> > > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> > application
> > > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > > 0,0000223
> > > >> seconds // GOT!
> > > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> > application
> > > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > > 0,0000121
> > > >> seconds // GOT!
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > > >>
> > >
> > >
> > If you reply to this email, your message will be added to the discussion
> > below:
> >
> > http://apache-ignite-developers.2346864.n4.nabble.
> com/Facility-to-detect-long-STW-pauses-and-other-system-
> response-degradations-tp24391p24497.html
> > To unsubscribe from Facility to detect long STW pauses and other system
> > response degradations, click here
> > <http://apache-ignite-developers.2346864.n4.nabble.
> com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=24391&code=
> c2J0LnNvcm9raW4uZHZsQGdtYWlsLmNvbXwyNDM5MXwtMjA0OTY3OTkxOQ==>
> > .
> > NAML
> > <http://apache-ignite-developers.2346864.n4.nabble.
> com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_
> html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.view.web.template.NodeNamespace&breadcrumbs=
> notify_subscribers%21nabble%3Aemail.naml-instant_emails%
> 21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> >
>

Andrey Kornev

Re: Facility to detect long STW pauses and other system response degradations

In reply to this post by dmagda

My 2 cents. Don’t do it. JVM provides sufficient means of detecting a struggling process out of the box. SRE/Operations teams usually know how to monitor JVMs and can handle killing of such processes themselves.

The feature adds no value, just complexity (and more configuration parameters (!) — as if Ignite didn’t have enough of them already).

Regards,
Andrey
_____________________________
From: Denis Magda <[hidden email]>
Sent: Monday, November 20, 2017 3:10 PM
Subject: Re: Facility to detect long STW pauses and other system response degradations
To: <[hidden email]>

My 2 cents.

1. Totally for a separate native process that will handle the monitoring of an Ignite process. The watchdog process can simply start a JVM tool like jstat and parse its GC logs: https://dzone.com/articles/how-monitor-java-garbage <https://dzone.com/articles/how-monitor-java-garbage>

2. As for the STW handling, I would make a possible reaction more generic. Let’s define a policy (enumeration) that will define how to deal with an unstable node. The events might be as follows - kill a node, restart a node, trigger a custom script using Runtime.exec or other methods.

What’d you think? Specifically on point 2.

—
Denis

Anton Vinogradov

Re: Facility to detect long STW pauses and other system response degradations

Andrey,

> JVM provides sufficient means of detecting a struggling process out of
the box.

Could you point to some articles describing how to detect STW exceeding
some duration using only JVM API?

On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <[hidden email]>
wrote:

> My 2 cents. Don’t do it. JVM provides sufficient means of detecting a
> struggling process out of the box. SRE/Operations teams usually know how to
> monitor JVMs and can handle killing of such processes themselves.
>
> The feature adds no value, just complexity (and more configuration
> parameters (!) — as if Ignite didn’t have enough of them already).
>
> Regards,
> Andrey
> _____________________________
> From: Denis Magda <[hidden email]>
> Sent: Monday, November 20, 2017 3:10 PM
> Subject: Re: Facility to detect long STW pauses and other system response
> degradations
> To: <[hidden email]>
>
>
> My 2 cents.
>
> 1. Totally for a separate native process that will handle the monitoring
> of an Ignite process. The watchdog process can simply start a JVM tool like
> jstat and parse its GC logs: https://dzone.com/articles/
> how-monitor-java-garbage <https://dzone.com/articles/
> how-monitor-java-garbage>
>
> 2. As for the STW handling, I would make a possible reaction more generic.
> Let’s define a policy (enumeration) that will define how to deal with an
> unstable node. The events might be as follows - kill a node, restart a
> node, trigger a custom script using Runtime.exec or other methods.
>
> What’d you think? Specifically on point 2.
>
> —
> Denis
>
> > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <[hidden email]>
> wrote:
> >
> > Yakov,
> >
> > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> >
> > We split issue to
> > #1 STW duration metrics
> > #2 External monitoring allows to stop node during STW
> >
> >> Testing GC pause with java thread is
> >> a bit strange and can give info only after GC pause finishes.
> >
> > That's ok since it's #1
> >
> > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> [hidden email]>
> > wrote:
> >
> >> I have tested solution with java-thread and GC logs had contain same
> pause
> >> values of thread stopping which was detected by java-thread.
> >>
> >>
> >> My log (contains pauses > 100ms):
> >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
> >> pause: 507 milliseconds.
> >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
> >> pause: 5595 milliseconds.
> >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
> >> pause: 3262 milliseconds.
> >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
> >> pause: 1737 milliseconds.
> >>
> >> GC log:
> >> gridgain@dell-5580-92zc8h2:~$ cat
> >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> 0,0000246
> >> seconds
> >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> 0,0000252
> >> seconds
> >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> 0,0000178
> >> seconds // GOT!
> >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> 0,0000229
> >> seconds // GOT!
> >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
> >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> 0,0000223
> >> seconds // GOT!
> >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
> >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> 0,0000121
> >> seconds // GOT!
> >>
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> >>
>
>
>
>

Vladimir Ozerov

Re: Facility to detect long STW pauses and other system response degradations

Anton,

The question is why user may need so precise measurement? I share Andrey’s
opinion - cannot understand the value.

вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <[hidden email]>:

> Andrey,
>
> > JVM provides sufficient means of detecting a struggling process out of
> the box.
>
> Could you point to some articles describing how to detect STW exceeding
> some duration using only JVM API?
>
> On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <[hidden email]>
> wrote:
>
> > My 2 cents. Don’t do it. JVM provides sufficient means of detecting a
> > struggling process out of the box. SRE/Operations teams usually know how
> to
> > monitor JVMs and can handle killing of such processes themselves.
> >
> > The feature adds no value, just complexity (and more configuration
> > parameters (!) — as if Ignite didn’t have enough of them already).
> >
> > Regards,
> > Andrey
> > _____________________________
> > From: Denis Magda <[hidden email]>
> > Sent: Monday, November 20, 2017 3:10 PM
> > Subject: Re: Facility to detect long STW pauses and other system response
> > degradations
> > To: <[hidden email]>
> >
> >
> > My 2 cents.
> >
> > 1. Totally for a separate native process that will handle the monitoring
> > of an Ignite process. The watchdog process can simply start a JVM tool
> like
> > jstat and parse its GC logs: https://dzone.com/articles/
> > how-monitor-java-garbage <https://dzone.com/articles/
> > how-monitor-java-garbage>
> >
> > 2. As for the STW handling, I would make a possible reaction more
> generic.
> > Let’s define a policy (enumeration) that will define how to deal with an
> > unstable node. The events might be as follows - kill a node, restart a
> > node, trigger a custom script using Runtime.exec or other methods.
> >
> > What’d you think? Specifically on point 2.
> >
> > —
> > Denis
> >
> > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> [hidden email]>
> > wrote:
> > >
> > > Yakov,
> > >
> > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > >
> > > We split issue to
> > > #1 STW duration metrics
> > > #2 External monitoring allows to stop node during STW
> > >
> > >> Testing GC pause with java thread is
> > >> a bit strange and can give info only after GC pause finishes.
> > >
> > > That's ok since it's #1
> > >
> > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > [hidden email]>
> > > wrote:
> > >
> > >> I have tested solution with java-thread and GC logs had contain same
> > pause
> > >> values of thread stopping which was detected by java-thread.
> > >>
> > >>
> > >> My log (contains pauses > 100ms):
> > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
> > >> pause: 507 milliseconds.
> > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
> > >> pause: 5595 milliseconds.
> > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
> > >> pause: 3262 milliseconds.
> > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
> > >> pause: 1737 milliseconds.
> > >>
> > >> GC log:
> > >> gridgain@dell-5580-92zc8h2:~$ cat
> > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > 0,0000246
> > >> seconds
> > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > 0,0000252
> > >> seconds
> > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > 0,0000178
> > >> seconds // GOT!
> > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > 0,0000229
> > >> seconds // GOT!
> > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
> > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > 0,0000223
> > >> seconds // GOT!
> > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
> > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > 0,0000121
> > >> seconds // GOT!
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > >>
> >
> >
> >
> >
>

Anton Vinogradov

Re: Facility to detect long STW pauses and other system response degradations

Vova,

1) We can gain collections info from GarbageCollectorMXBean
But it provides only
- collectionCount
- collectionTime.

This is very interesting metrics, but they tell us nothing about long STW.
Long STW means we have huge latency during STW and we should find the
reason and solve it.

2) So, we're working on new incremental metrics
- total amount of STW longer than XXX
- total duration of STW longer than XXX

which shows us JVM/GC health situation.

Is it answer to your question?

On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <[hidden email]>
wrote:

> Anton,
>
> The question is why user may need so precise measurement? I share Andrey’s
> opinion - cannot understand the value.
>
> вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <[hidden email]>:
>
> > Andrey,
> >
> > > JVM provides sufficient means of detecting a struggling process out of
> > the box.
> >
> > Could you point to some articles describing how to detect STW exceeding
> > some duration using only JVM API?
> >
> > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <[hidden email]
> >
> > wrote:
> >
> > > My 2 cents. Don’t do it. JVM provides sufficient means of detecting a
> > > struggling process out of the box. SRE/Operations teams usually know
> how
> > to
> > > monitor JVMs and can handle killing of such processes themselves.
> > >
> > > The feature adds no value, just complexity (and more configuration
> > > parameters (!) — as if Ignite didn’t have enough of them already).
> > >
> > > Regards,
> > > Andrey
> > > _____________________________
> > > From: Denis Magda <[hidden email]>
> > > Sent: Monday, November 20, 2017 3:10 PM
> > > Subject: Re: Facility to detect long STW pauses and other system
> response
> > > degradations
> > > To: <[hidden email]>
> > >
> > >
> > > My 2 cents.
> > >
> > > 1. Totally for a separate native process that will handle the
> monitoring
> > > of an Ignite process. The watchdog process can simply start a JVM tool
> > like
> > > jstat and parse its GC logs: https://dzone.com/articles/
> > > how-monitor-java-garbage <https://dzone.com/articles/
> > > how-monitor-java-garbage>
> > >
> > > 2. As for the STW handling, I would make a possible reaction more
> > generic.
> > > Let’s define a policy (enumeration) that will define how to deal with
> an
> > > unstable node. The events might be as follows - kill a node, restart a
> > > node, trigger a custom script using Runtime.exec or other methods.
> > >
> > > What’d you think? Specifically on point 2.
> > >
> > > —
> > > Denis
> > >
> > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> > [hidden email]>
> > > wrote:
> > > >
> > > > Yakov,
> > > >
> > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > >
> > > > We split issue to
> > > > #1 STW duration metrics
> > > > #2 External monitoring allows to stop node during STW
> > > >
> > > >> Testing GC pause with java thread is
> > > >> a bit strange and can give info only after GC pause finishes.
> > > >
> > > > That's ok since it's #1
> > > >
> > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > >> I have tested solution with java-thread and GC logs had contain same
> > > pause
> > > >> values of thread stopping which was detected by java-thread.
> > > >>
> > > >>
> > > >> My log (contains pauses > 100ms):
> > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 507 milliseconds.
> > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 5595 milliseconds.
> > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 3262 milliseconds.
> > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 1737 milliseconds.
> > > >>
> > > >> GC log:
> > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which
> application
> > > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > > 0,0000246
> > > >> seconds
> > > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which
> application
> > > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > > 0,0000252
> > > >> seconds
> > > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which
> application
> > > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > > 0,0000178
> > > >> seconds // GOT!
> > > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which
> application
> > > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > > 0,0000229
> > > >> seconds // GOT!
> > > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> application
> > > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > > 0,0000223
> > > >> seconds // GOT!
> > > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> application
> > > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > > 0,0000121
> > > >> seconds // GOT!
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > > >>
> > >
> > >
> > >
> > >
> >
>

Vladimir Ozerov

Re: Facility to detect long STW pauses and other system response degradations

The question is how administrator should interpret these numbers. Ok, I
opened JMX console and see that there was 10 long GC events, which took 100
seconds.
1) When did these events appear? Over the last day, which is more or less
OK, or over the last 10 minutes, so my server is nearly dead?
2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
pauses of 1 sec and one critical pause of 90s?

May be a kind of sliding window plus min/max values will do better job.

On Wed, Nov 22, 2017 at 1:07 PM, Anton Vinogradov <[hidden email]>
wrote:

> Vova,
>
> 1) We can gain collections info from GarbageCollectorMXBean
> But it provides only
> - collectionCount
> - collectionTime.
>
> This is very interesting metrics, but they tell us nothing about long STW.
> Long STW means we have huge latency during STW and we should find the
> reason and solve it.
>
> 2) So, we're working on new incremental metrics
> - total amount of STW longer than XXX
> - total duration of STW longer than XXX
>
> which shows us JVM/GC health situation.
>
> Is it answer to your question?
>
> On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > Anton,
> >
> > The question is why user may need so precise measurement? I share
> Andrey’s
> > opinion - cannot understand the value.
> >
> > вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <[hidden email]
> >:
> >
> > > Andrey,
> > >
> > > > JVM provides sufficient means of detecting a struggling process out
> of
> > > the box.
> > >
> > > Could you point to some articles describing how to detect STW exceeding
> > > some duration using only JVM API?
> > >
> > > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > My 2 cents. Don’t do it. JVM provides sufficient means of detecting a
> > > > struggling process out of the box. SRE/Operations teams usually know
> > how
> > > to
> > > > monitor JVMs and can handle killing of such processes themselves.
> > > >
> > > > The feature adds no value, just complexity (and more configuration
> > > > parameters (!) — as if Ignite didn’t have enough of them already).
> > > >
> > > > Regards,
> > > > Andrey
> > > > _____________________________
> > > > From: Denis Magda <[hidden email]>
> > > > Sent: Monday, November 20, 2017 3:10 PM
> > > > Subject: Re: Facility to detect long STW pauses and other system
> > response
> > > > degradations
> > > > To: <[hidden email]>
> > > >
> > > >
> > > > My 2 cents.
> > > >
> > > > 1. Totally for a separate native process that will handle the
> > monitoring
> > > > of an Ignite process. The watchdog process can simply start a JVM
> tool
> > > like
> > > > jstat and parse its GC logs: https://dzone.com/articles/
> > > > how-monitor-java-garbage <https://dzone.com/articles/
> > > > how-monitor-java-garbage>
> > > >
> > > > 2. As for the STW handling, I would make a possible reaction more
> > > generic.
> > > > Let’s define a policy (enumeration) that will define how to deal with
> > an
> > > > unstable node. The events might be as follows - kill a node, restart
> a
> > > > node, trigger a custom script using Runtime.exec or other methods.
> > > >
> > > > What’d you think? Specifically on point 2.
> > > >
> > > > —
> > > > Denis
> > > >
> > > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> > > [hidden email]>
> > > > wrote:
> > > > >
> > > > > Yakov,
> > > > >
> > > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > > >
> > > > > We split issue to
> > > > > #1 STW duration metrics
> > > > > #2 External monitoring allows to stop node during STW
> > > > >
> > > > >> Testing GC pause with java thread is
> > > > >> a bit strange and can give info only after GC pause finishes.
> > > > >
> > > > > That's ok since it's #1
> > > > >
> > > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > >> I have tested solution with java-thread and GC logs had contain
> same
> > > > pause
> > > > >> values of thread stopping which was detected by java-thread.
> > > > >>
> > > > >>
> > > > >> My log (contains pauses > 100ms):
> > > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long
> > STW
> > > > >> pause: 507 milliseconds.
> > > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long
> > STW
> > > > >> pause: 5595 milliseconds.
> > > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long
> > STW
> > > > >> pause: 3262 milliseconds.
> > > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long
> > STW
> > > > >> pause: 1737 milliseconds.
> > > > >>
> > > > >> GC log:
> > > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > > > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which
> > application
> > > > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > > > 0,0000246
> > > > >> seconds
> > > > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which
> > application
> > > > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > > > 0,0000252
> > > > >> seconds
> > > > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which
> > application
> > > > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > > > 0,0000178
> > > > >> seconds // GOT!
> > > > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which
> > application
> > > > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > > > 0,0000229
> > > > >> seconds // GOT!
> > > > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> > application
> > > > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > > > 0,0000223
> > > > >> seconds // GOT!
> > > > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> > application
> > > > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > > > 0,0000121
> > > > >> seconds // GOT!
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > > > >>
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Anton Vinogradov

Re: Facility to detect long STW pauses and other system response degradations

Vova,

Monitoring systems works not how you expected, they LOVE incremental
metrics :)

> 1) When did these events appear?

Monitoring gets metrics each N seconds.

01.00 got 0, 0
02.00 got 1, 20 -> means between 02.00 and 01.00 was 1 STW with duration 20
03.00 got 3, 100 -> means between 03.00 and 02.00 was 2 STW with total
duration 80

Good monitoring will record this values and show you graph when you decide
to check this value.
So, you'll see "0,1,2" and "0,20,80" at 03.01

> 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> pauses of 1 sec and one critical pause of 90s?

So, previous probe was 0, 0 and we got it minute ago.
Now we have 10, 100. It means we have 10 STW with total duration 100 last
minute.

But, it case we set interval to 10 seconds we'll get
1, 20
4, 20
0, 0
4, 10
1, 50
0, 0

So, precision depends on check interval. And it's up to administration team
what interval to choose.

> May be a kind of sliding window plus min/max values will do better job.

That's the monitoring system job (eg. zabbix)

On Wed, Nov 22, 2017 at 4:10 PM, Vladimir Ozerov <[hidden email]>
wrote:

> The question is how administrator should interpret these numbers. Ok, I
> opened JMX console and see that there was 10 long GC events, which took 100
> seconds.
> 1) When did these events appear? Over the last day, which is more or less
> OK, or over the last 10 minutes, so my server is nearly dead?
> 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> pauses of 1 sec and one critical pause of 90s?
>
> May be a kind of sliding window plus min/max values will do better job.
>
> On Wed, Nov 22, 2017 at 1:07 PM, Anton Vinogradov <
> [hidden email]>
> wrote:
>
> > Vova,
> >
> > 1) We can gain collections info from GarbageCollectorMXBean
> > But it provides only
> > - collectionCount
> > - collectionTime.
> >
> > This is very interesting metrics, but they tell us nothing about long
> STW.
> > Long STW means we have huge latency during STW and we should find the
> > reason and solve it.
> >
> > 2) So, we're working on new incremental metrics
> > - total amount of STW longer than XXX
> > - total duration of STW longer than XXX
> >
> > which shows us JVM/GC health situation.
> >
> > Is it answer to your question?
> >
> > On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <[hidden email]>
> > wrote:
> >
> > > Anton,
> > >
> > > The question is why user may need so precise measurement? I share
> > Andrey’s
> > > opinion - cannot understand the value.
> > >
> > > вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <
> [hidden email]
> > >:
> > >
> > > > Andrey,
> > > >
> > > > > JVM provides sufficient means of detecting a struggling process
> out
> > of
> > > > the box.
> > > >
> > > > Could you point to some articles describing how to detect STW
> exceeding
> > > > some duration using only JVM API?
> > > >
> > > > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > My 2 cents. Don’t do it. JVM provides sufficient means of
> detecting a
> > > > > struggling process out of the box. SRE/Operations teams usually
> know
> > > how
> > > > to
> > > > > monitor JVMs and can handle killing of such processes themselves.
> > > > >
> > > > > The feature adds no value, just complexity (and more configuration
> > > > > parameters (!) — as if Ignite didn’t have enough of them already).
> > > > >
> > > > > Regards,
> > > > > Andrey
> > > > > _____________________________
> > > > > From: Denis Magda <[hidden email]>
> > > > > Sent: Monday, November 20, 2017 3:10 PM
> > > > > Subject: Re: Facility to detect long STW pauses and other system
> > > response
> > > > > degradations
> > > > > To: <[hidden email]>
> > > > >
> > > > >
> > > > > My 2 cents.
> > > > >
> > > > > 1. Totally for a separate native process that will handle the
> > > monitoring
> > > > > of an Ignite process. The watchdog process can simply start a JVM
> > tool
> > > > like
> > > > > jstat and parse its GC logs: https://dzone.com/articles/
> > > > > how-monitor-java-garbage <https://dzone.com/articles/
> > > > > how-monitor-java-garbage>
> > > > >
> > > > > 2. As for the STW handling, I would make a possible reaction more
> > > > generic.
> > > > > Let’s define a policy (enumeration) that will define how to deal
> with
> > > an
> > > > > unstable node. The events might be as follows - kill a node,
> restart
> > a
> > > > > node, trigger a custom script using Runtime.exec or other methods.
> > > > >
> > > > > What’d you think? Specifically on point 2.
> > > > >
> > > > > —
> > > > > Denis
> > > > >
> > > > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> > > > [hidden email]>
> > > > > wrote:
> > > > > >
> > > > > > Yakov,
> > > > > >
> > > > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > > > >
> > > > > > We split issue to
> > > > > > #1 STW duration metrics
> > > > > > #2 External monitoring allows to stop node during STW
> > > > > >
> > > > > >> Testing GC pause with java thread is
> > > > > >> a bit strange and can give info only after GC pause finishes.
> > > > > >
> > > > > > That's ok since it's #1
> > > > > >
> > > > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > >> I have tested solution with java-thread and GC logs had contain
> > same
> > > > > pause
> > > > > >> values of thread stopping which was detected by java-thread.
> > > > > >>
> > > > > >>
> > > > > >> My log (contains pauses > 100ms):
> > > > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too
> long
> > > STW
> > > > > >> pause: 507 milliseconds.
> > > > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too
> long
> > > STW
> > > > > >> pause: 5595 milliseconds.
> > > > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too
> long
> > > STW
> > > > > >> pause: 3262 milliseconds.
> > > > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too
> long
> > > STW
> > > > > >> pause: 1737 milliseconds.
> > > > > >>
> > > > > >> GC log:
> > > > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > > > > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which
> > > application
> > > > > >> threads were stopped: 0,0000845 seconds, Stopping threads took:
> > > > > 0,0000246
> > > > > >> seconds
> > > > > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which
> > > application
> > > > > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > > > > 0,0000252
> > > > > >> seconds
> > > > > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which
> > > application
> > > > > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > > > > 0,0000178
> > > > > >> seconds // GOT!
> > > > > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which
> > > application
> > > > > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > > > > 0,0000229
> > > > > >> seconds // GOT!
> > > > > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> > > application
> > > > > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > > > > 0,0000223
> > > > > >> seconds // GOT!
> > > > > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> > > application
> > > > > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > > > > 0,0000121
> > > > > >> seconds // GOT!
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.
> com/
> > > > > >>
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Vladimir Ozerov

Re: Facility to detect long STW pauses and other system response degradations

Makes sense.

On Wed, Nov 22, 2017 at 4:54 PM, Anton Vinogradov <[hidden email]>
wrote:

> Vova,
>
> Monitoring systems works not how you expected, they LOVE incremental
> metrics :)
>
> > 1) When did these events appear?
>
> Monitoring gets metrics each N seconds.
>
> 01.00 got 0, 0
> 02.00 got 1, 20 -> means between 02.00 and 01.00 was 1 STW with duration
> 20
> 03.00 got 3, 100 -> means between 03.00 and 02.00 was 2 STW with total
> duration 80
>
> Good monitoring will record this values and show you graph when you decide
> to check this value.
> So, you'll see "0,1,2" and "0,20,80" at 03.01
>
> > 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> > pauses of 1 sec and one critical pause of 90s?
>
> So, previous probe was 0, 0 and we got it minute ago.
> Now we have 10, 100. It means we have 10 STW with total duration 100 last
> minute.
>
> But, it case we set interval to 10 seconds we'll get
> 1, 20
> 4, 20
> 0, 0
> 4, 10
> 1, 50
> 0, 0
>
> So, precision depends on check interval. And it's up to administration team
> what interval to choose.
>
> > May be a kind of sliding window plus min/max values will do better job.
>
> That's the monitoring system job (eg. zabbix)
>
>
> On Wed, Nov 22, 2017 at 4:10 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > The question is how administrator should interpret these numbers. Ok, I
> > opened JMX console and see that there was 10 long GC events, which took
> 100
> > seconds.
> > 1) When did these events appear? Over the last day, which is more or less
> > OK, or over the last 10 minutes, so my server is nearly dead?
> > 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> > pauses of 1 sec and one critical pause of 90s?
> >
> > May be a kind of sliding window plus min/max values will do better job.
> >
> > On Wed, Nov 22, 2017 at 1:07 PM, Anton Vinogradov <
> > [hidden email]>
> > wrote:
> >
> > > Vova,
> > >
> > > 1) We can gain collections info from GarbageCollectorMXBean
> > > But it provides only
> > > - collectionCount
> > > - collectionTime.
> > >
> > > This is very interesting metrics, but they tell us nothing about long
> > STW.
> > > Long STW means we have huge latency during STW and we should find the
> > > reason and solve it.
> > >
> > > 2) So, we're working on new incremental metrics
> > > - total amount of STW longer than XXX
> > > - total duration of STW longer than XXX
> > >
> > > which shows us JVM/GC health situation.
> > >
> > > Is it answer to your question?
> > >
> > > On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <[hidden email]
> >
> > > wrote:
> > >
> > > > Anton,
> > > >
> > > > The question is why user may need so precise measurement? I share
> > > Andrey’s
> > > > opinion - cannot understand the value.
> > > >
> > > > вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <
> > [hidden email]
> > > >:
> > > >
> > > > > Andrey,
> > > > >
> > > > > > JVM provides sufficient means of detecting a struggling process
> > out
> > > of
> > > > > the box.
> > > > >
> > > > > Could you point to some articles describing how to detect STW
> > exceeding
> > > > > some duration using only JVM API?
> > > > >
> > > > > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <
> > > [hidden email]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > My 2 cents. Don’t do it. JVM provides sufficient means of
> > detecting a
> > > > > > struggling process out of the box. SRE/Operations teams usually
> > know
> > > > how
> > > > > to
> > > > > > monitor JVMs and can handle killing of such processes themselves.
> > > > > >
> > > > > > The feature adds no value, just complexity (and more
> configuration
> > > > > > parameters (!) — as if Ignite didn’t have enough of them
> already).
> > > > > >
> > > > > > Regards,
> > > > > > Andrey
> > > > > > _____________________________
> > > > > > From: Denis Magda <[hidden email]>
> > > > > > Sent: Monday, November 20, 2017 3:10 PM
> > > > > > Subject: Re: Facility to detect long STW pauses and other system
> > > > response
> > > > > > degradations
> > > > > > To: <[hidden email]>
> > > > > >
> > > > > >
> > > > > > My 2 cents.
> > > > > >
> > > > > > 1. Totally for a separate native process that will handle the
> > > > monitoring
> > > > > > of an Ignite process. The watchdog process can simply start a JVM
> > > tool
> > > > > like
> > > > > > jstat and parse its GC logs: https://dzone.com/articles/
> > > > > > how-monitor-java-garbage <https://dzone.com/articles/
> > > > > > how-monitor-java-garbage>
> > > > > >
> > > > > > 2. As for the STW handling, I would make a possible reaction more
> > > > > generic.
> > > > > > Let’s define a policy (enumeration) that will define how to deal
> > with
> > > > an
> > > > > > unstable node. The events might be as follows - kill a node,
> > restart
> > > a
> > > > > > node, trigger a custom script using Runtime.exec or other
> methods.
> > > > > >
> > > > > > What’d you think? Specifically on point 2.
> > > > > >
> > > > > > —
> > > > > > Denis
> > > > > >
> > > > > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> > > > > [hidden email]>
> > > > > > wrote:
> > > > > > >
> > > > > > > Yakov,
> > > > > > >
> > > > > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > > > > >
> > > > > > > We split issue to
> > > > > > > #1 STW duration metrics
> > > > > > > #2 External monitoring allows to stop node during STW
> > > > > > >
> > > > > > >> Testing GC pause with java thread is
> > > > > > >> a bit strange and can give info only after GC pause finishes.
> > > > > > >
> > > > > > > That's ok since it's #1
> > > > > > >
> > > > > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> I have tested solution with java-thread and GC logs had
> contain
> > > same
> > > > > > pause
> > > > > > >> values of thread stopping which was detected by java-thread.
> > > > > > >>
> > > > > > >>
> > > > > > >> My log (contains pauses > 100ms):
> > > > > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too
> > long
> > > > STW
> > > > > > >> pause: 507 milliseconds.
> > > > > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too
> > long
> > > > STW
> > > > > > >> pause: 5595 milliseconds.
> > > > > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too
> > long
> > > > STW
> > > > > > >> pause: 3262 milliseconds.
> > > > > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too
> > long
> > > > STW
> > > > > > >> pause: 1737 milliseconds.
> > > > > > >>
> > > > > > >> GC log:
> > > > > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > > > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > > > > > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which
> > > > application
> > > > > > >> threads were stopped: 0,0000845 seconds, Stopping threads
> took:
> > > > > > 0,0000246
> > > > > > >> seconds
> > > > > > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which
> > > > application
> > > > > > >> threads were stopped: 0,0001072 seconds, Stopping threads
> took:
> > > > > > 0,0000252
> > > > > > >> seconds
> > > > > > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which
> > > > application
> > > > > > >> threads were stopped: 0,5001082 seconds, Stopping threads
> took:
> > > > > > 0,0000178
> > > > > > >> seconds // GOT!
> > > > > > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which
> > > > application
> > > > > > >> threads were stopped: 5,5856603 seconds, Stopping threads
> took:
> > > > > > 0,0000229
> > > > > > >> seconds // GOT!
> > > > > > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which
> > > > application
> > > > > > >> threads were stopped: 3,2595700 seconds, Stopping threads
> took:
> > > > > > 0,0000223
> > > > > > >> seconds // GOT!
> > > > > > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which
> > > > application
> > > > > > >> threads were stopped: 1,7337123 seconds, Stopping threads
> took:
> > > > > > 0,0000121
> > > > > > >> seconds // GOT!
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.
> > com/
> > > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>