Partition map exchange metrics

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Partition map exchange metrics

Nikita Amelchev
Hello, Igniters.

I suggest to add some useful metrics about the partition map exchange
(PME). For now, the duration of PME stages available only in log files
and cannot be obtained using JMX or other external tools. [1]

I made the list of local node metrics that help to understand the
actual status of current PME:

1. initialVersion. Topology version that initiates the exchange.
2. initTime. Time PME was started.
3. initEvent. Event that triggered PME.
4. partitionReleaseTime. Time when a node has finished waiting for all
updates and translations on a previous topology.
5. sendSingleMessageTime. Time when a node sent a single message.
6. recieveFullMessageTime. Time when a node received a full message.
7. finishTime. Time PME was ended.

When new PME started all these metrics resets.

These metrics help to understand:
- how long PME was (current or previous).
- how long awaited for all updates was completed.
- what node blocks PME (didn't send a single message)
- what triggered PME.

Thoughts?

[1] https://issues.apache.org/jira/browse/IGNITE-11961

--
Best wishes,
Amelchev Nikita
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Nikolay Izhikov-2
+1.

Nikita, please, go ahead.


вт, 16 июля 2019 г., 11:45 Nikita Amelchev <[hidden email]>:

> Hello, Igniters.
>
> I suggest to add some useful metrics about the partition map exchange
> (PME). For now, the duration of PME stages available only in log files
> and cannot be obtained using JMX or other external tools. [1]
>
> I made the list of local node metrics that help to understand the
> actual status of current PME:
>
> 1. initialVersion. Topology version that initiates the exchange.
> 2. initTime. Time PME was started.
> 3. initEvent. Event that triggered PME.
> 4. partitionReleaseTime. Time when a node has finished waiting for all
> updates and translations on a previous topology.
> 5. sendSingleMessageTime. Time when a node sent a single message.
> 6. recieveFullMessageTime. Time when a node received a full message.
> 7. finishTime. Time PME was ended.
>
> When new PME started all these metrics resets.
>
> These metrics help to understand:
> - how long PME was (current or previous).
> - how long awaited for all updates was completed.
> - what node blocks PME (didn't send a single message)
> - what triggered PME.
>
> Thoughts?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-11961
>
> --
> Best wishes,
> Amelchev Nikita
>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Anton Vinogradov-2
Nikita,

Looks like all we need now is a 1 simple metric: are operations blocked?
Just a true or false.
Lest start from this.
All other metrics can be extracted from logs now and can be implemented
later.

On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <[hidden email]>
wrote:

> +1.
>
> Nikita, please, go ahead.
>
>
> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <[hidden email]>:
>
> > Hello, Igniters.
> >
> > I suggest to add some useful metrics about the partition map exchange
> > (PME). For now, the duration of PME stages available only in log files
> > and cannot be obtained using JMX or other external tools. [1]
> >
> > I made the list of local node metrics that help to understand the
> > actual status of current PME:
> >
> > 1. initialVersion. Topology version that initiates the exchange.
> > 2. initTime. Time PME was started.
> > 3. initEvent. Event that triggered PME.
> > 4. partitionReleaseTime. Time when a node has finished waiting for all
> > updates and translations on a previous topology.
> > 5. sendSingleMessageTime. Time when a node sent a single message.
> > 6. recieveFullMessageTime. Time when a node received a full message.
> > 7. finishTime. Time PME was ended.
> >
> > When new PME started all these metrics resets.
> >
> > These metrics help to understand:
> > - how long PME was (current or previous).
> > - how long awaited for all updates was completed.
> > - what node blocks PME (didn't send a single message)
> > - what triggered PME.
> >
> > Thoughts?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> >
> > --
> > Best wishes,
> > Amelchev Nikita
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Nikolay Izhikov-2
Anton.

Why do we need to postpone implementation of this metrics?
For now, implementation of new metric is very simple.

I think we can implement this metrics as a single contribution.

В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:

> Nikita,
>
> Looks like all we need now is a 1 simple metric: are operations blocked?
> Just a true or false.
> Lest start from this.
> All other metrics can be extracted from logs now and can be implemented
> later.
>
> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <[hidden email]>
> wrote:
>
> > +1.
> >
> > Nikita, please, go ahead.
> >
> >
> > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <[hidden email]>:
> >
> > > Hello, Igniters.
> > >
> > > I suggest to add some useful metrics about the partition map exchange
> > > (PME). For now, the duration of PME stages available only in log files
> > > and cannot be obtained using JMX or other external tools. [1]
> > >
> > > I made the list of local node metrics that help to understand the
> > > actual status of current PME:
> > >
> > > 1. initialVersion. Topology version that initiates the exchange.
> > > 2. initTime. Time PME was started.
> > > 3. initEvent. Event that triggered PME.
> > > 4. partitionReleaseTime. Time when a node has finished waiting for all
> > > updates and translations on a previous topology.
> > > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > 6. recieveFullMessageTime. Time when a node received a full message.
> > > 7. finishTime. Time PME was ended.
> > >
> > > When new PME started all these metrics resets.
> > >
> > > These metrics help to understand:
> > > - how long PME was (current or previous).
> > > - how long awaited for all updates was completed.
> > > - what node blocks PME (didn't send a single message)
> > > - what triggered PME.
> > >
> > > Thoughts?
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > >
> > > --
> > > Best wishes,
> > > Amelchev Nikita
> > >

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Anton Vinogradov-2
BTW,
Found PME metric - getCurrentPmeDuration().
Seems, it shows exactly PME time and not so useful because of this.
The goal it so show exactly blocking period.
When PME cause no blocking, it's a good PME and I see no reason to have
monitoring related to it :)

On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <[hidden email]> wrote:

> Anton.
>
> Why do we need to postpone implementation of this metrics?
> For now, implementation of new metric is very simple.
>
> I think we can implement this metrics as a single contribution.
>
> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > Nikita,
> >
> > Looks like all we need now is a 1 simple metric: are operations blocked?
> > Just a true or false.
> > Lest start from this.
> > All other metrics can be extracted from logs now and can be implemented
> > later.
> >
> > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <[hidden email]>
> > wrote:
> >
> > > +1.
> > >
> > > Nikita, please, go ahead.
> > >
> > >
> > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <[hidden email]>:
> > >
> > > > Hello, Igniters.
> > > >
> > > > I suggest to add some useful metrics about the partition map exchange
> > > > (PME). For now, the duration of PME stages available only in log
> files
> > > > and cannot be obtained using JMX or other external tools. [1]
> > > >
> > > > I made the list of local node metrics that help to understand the
> > > > actual status of current PME:
> > > >
> > > > 1. initialVersion. Topology version that initiates the exchange.
> > > > 2. initTime. Time PME was started.
> > > > 3. initEvent. Event that triggered PME.
> > > > 4. partitionReleaseTime. Time when a node has finished waiting for
> all
> > > > updates and translations on a previous topology.
> > > > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > > 6. recieveFullMessageTime. Time when a node received a full message.
> > > > 7. finishTime. Time PME was ended.
> > > >
> > > > When new PME started all these metrics resets.
> > > >
> > > > These metrics help to understand:
> > > > - how long PME was (current or previous).
> > > > - how long awaited for all updates was completed.
> > > > - what node blocks PME (didn't send a single message)
> > > > - what triggered PME.
> > > >
> > > > Thoughts?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > >
> > > > --
> > > > Best wishes,
> > > > Amelchev Nikita
> > > >
>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Nikolay Izhikov-2
I think administator of Ignite cluster should be able to monitor all Ignite process, including non blocking PME.

В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:

> BTW,
> Found PME metric - getCurrentPmeDuration().
> Seems, it shows exactly PME time and not so useful because of this.
> The goal it so show exactly blocking period.
> When PME cause no blocking, it's a good PME and I see no reason to have
> monitoring related to it :)
>
> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <[hidden email]> wrote:
>
> > Anton.
> >
> > Why do we need to postpone implementation of this metrics?
> > For now, implementation of new metric is very simple.
> >
> > I think we can implement this metrics as a single contribution.
> >
> > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > > Nikita,
> > >
> > > Looks like all we need now is a 1 simple metric: are operations blocked?
> > > Just a true or false.
> > > Lest start from this.
> > > All other metrics can be extracted from logs now and can be implemented
> > > later.
> > >
> > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <[hidden email]>
> > > wrote:
> > >
> > > > +1.
> > > >
> > > > Nikita, please, go ahead.
> > > >
> > > >
> > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <[hidden email]>:
> > > >
> > > > > Hello, Igniters.
> > > > >
> > > > > I suggest to add some useful metrics about the partition map exchange
> > > > > (PME). For now, the duration of PME stages available only in log
> >
> > files
> > > > > and cannot be obtained using JMX or other external tools. [1]
> > > > >
> > > > > I made the list of local node metrics that help to understand the
> > > > > actual status of current PME:
> > > > >
> > > > > 1. initialVersion. Topology version that initiates the exchange.
> > > > > 2. initTime. Time PME was started.
> > > > > 3. initEvent. Event that triggered PME.
> > > > > 4. partitionReleaseTime. Time when a node has finished waiting for
> >
> > all
> > > > > updates and translations on a previous topology.
> > > > > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > > > 6. recieveFullMessageTime. Time when a node received a full message.
> > > > > 7. finishTime. Time PME was ended.
> > > > >
> > > > > When new PME started all these metrics resets.
> > > > >
> > > > > These metrics help to understand:
> > > > > - how long PME was (current or previous).
> > > > > - how long awaited for all updates was completed.
> > > > > - what node blocks PME (didn't send a single message)
> > > > > - what triggered PME.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > > >
> > > > > --
> > > > > Best wishes,
> > > > > Amelchev Nikita
> > > > >

signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Nikita Amelchev
Anton, Nikolay,

Thanks for the support.

For now, we have the getCurrentPmeDuration() metric that does not show
influence on the cluster correctly. PME can be without blocking
operations. For example, client node join/leave events.

I suggest add new metric - isOperationsBlockedByPme(). Together, these
metrics will show influence of the PME on cluster and user operations.

I have prepared PR for this (Bot visa is green). [1] Can anyone take a look?

[1] https://issues.apache.org/jira/browse/IGNITE-11961

вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <[hidden email]>:

>
> I think administator of Ignite cluster should be able to monitor all Ignite process, including non blocking PME.
>
> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > BTW,
> > Found PME metric - getCurrentPmeDuration().
> > Seems, it shows exactly PME time and not so useful because of this.
> > The goal it so show exactly blocking period.
> > When PME cause no blocking, it's a good PME and I see no reason to have
> > monitoring related to it :)
> >
> > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <[hidden email]> wrote:
> >
> > > Anton.
> > >
> > > Why do we need to postpone implementation of this metrics?
> > > For now, implementation of new metric is very simple.
> > >
> > > I think we can implement this metrics as a single contribution.
> > >
> > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > > > Nikita,
> > > >
> > > > Looks like all we need now is a 1 simple metric: are operations blocked?
> > > > Just a true or false.
> > > > Lest start from this.
> > > > All other metrics can be extracted from logs now and can be implemented
> > > > later.
> > > >
> > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <[hidden email]>
> > > > wrote:
> > > >
> > > > > +1.
> > > > >
> > > > > Nikita, please, go ahead.
> > > > >
> > > > >
> > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <[hidden email]>:
> > > > >
> > > > > > Hello, Igniters.
> > > > > >
> > > > > > I suggest to add some useful metrics about the partition map exchange
> > > > > > (PME). For now, the duration of PME stages available only in log
> > >
> > > files
> > > > > > and cannot be obtained using JMX or other external tools. [1]
> > > > > >
> > > > > > I made the list of local node metrics that help to understand the
> > > > > > actual status of current PME:
> > > > > >
> > > > > > 1. initialVersion. Topology version that initiates the exchange.
> > > > > > 2. initTime. Time PME was started.
> > > > > > 3. initEvent. Event that triggered PME.
> > > > > > 4. partitionReleaseTime. Time when a node has finished waiting for
> > >
> > > all
> > > > > > updates and translations on a previous topology.
> > > > > > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > > > > 6. recieveFullMessageTime. Time when a node received a full message.
> > > > > > 7. finishTime. Time PME was ended.
> > > > > >
> > > > > > When new PME started all these metrics resets.
> > > > > >
> > > > > > These metrics help to understand:
> > > > > > - how long PME was (current or previous).
> > > > > > - how long awaited for all updates was completed.
> > > > > > - what node blocks PME (didn't send a single message)
> > > > > > - what triggered PME.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > >
> > > > > > --
> > > > > > Best wishes,
> > > > > > Amelchev Nikita
> > > > > >



--
Best wishes,
Amelchev Nikita
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Pavel Kovalenko
Hi Nikita,

Thank you for working on this. What do you think if we change the boolean
value of metric to a long value that represents time in milliseconds when
operations were blocked?
Since we have not only JMX and now metrics are periodically exported to
some backend it can give a more clear picture of how much time we wait for
resuming cache operations instead of instant boolean indicator.

пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <[hidden email]>:

> Anton, Nikolay,
>
> Thanks for the support.
>
> For now, we have the getCurrentPmeDuration() metric that does not show
> influence on the cluster correctly. PME can be without blocking
> operations. For example, client node join/leave events.
>
> I suggest add new metric - isOperationsBlockedByPme(). Together, these
> metrics will show influence of the PME on cluster and user operations.
>
> I have prepared PR for this (Bot visa is green). [1] Can anyone take a
> look?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-11961
>
> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <[hidden email]>:
>
> >
> > I think administator of Ignite cluster should be able to monitor all
> Ignite process, including non blocking PME.
> >
> > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > BTW,
> > > Found PME metric - getCurrentPmeDuration().
> > > Seems, it shows exactly PME time and not so useful because of this.
> > > The goal it so show exactly blocking period.
> > > When PME cause no blocking, it's a good PME and I see no reason to have
> > > monitoring related to it :)
> > >
> > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <[hidden email]>
> wrote:
> > >
> > > > Anton.
> > > >
> > > > Why do we need to postpone implementation of this metrics?
> > > > For now, implementation of new metric is very simple.
> > > >
> > > > I think we can implement this metrics as a single contribution.
> > > >
> > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > > > > Nikita,
> > > > >
> > > > > Looks like all we need now is a 1 simple metric: are operations
> blocked?
> > > > > Just a true or false.
> > > > > Lest start from this.
> > > > > All other metrics can be extracted from logs now and can be
> implemented
> > > > > later.
> > > > >
> > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > +1.
> > > > > >
> > > > > > Nikita, please, go ahead.
> > > > > >
> > > > > >
> > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <[hidden email]
> >:
> > > > > >
> > > > > > > Hello, Igniters.
> > > > > > >
> > > > > > > I suggest to add some useful metrics about the partition map
> exchange
> > > > > > > (PME). For now, the duration of PME stages available only in
> log
> > > >
> > > > files
> > > > > > > and cannot be obtained using JMX or other external tools. [1]
> > > > > > >
> > > > > > > I made the list of local node metrics that help to understand
> the
> > > > > > > actual status of current PME:
> > > > > > >
> > > > > > > 1. initialVersion. Topology version that initiates the
> exchange.
> > > > > > > 2. initTime. Time PME was started.
> > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > 4. partitionReleaseTime. Time when a node has finished waiting
> for
> > > >
> > > > all
> > > > > > > updates and translations on a previous topology.
> > > > > > > 5. sendSingleMessageTime. Time when a node sent a single
> message.
> > > > > > > 6. recieveFullMessageTime. Time when a node received a full
> message.
> > > > > > > 7. finishTime. Time PME was ended.
> > > > > > >
> > > > > > > When new PME started all these metrics resets.
> > > > > > >
> > > > > > > These metrics help to understand:
> > > > > > > - how long PME was (current or previous).
> > > > > > > - how long awaited for all updates was completed.
> > > > > > > - what node blocks PME (didn't send a single message)
> > > > > > > - what triggered PME.
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > >
> > > > > > > --
> > > > > > > Best wishes,
> > > > > > > Amelchev Nikita
> > > > > > >
>
>
>
> --
> Best wishes,
> Amelchev Nikita
>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Nikita Amelchev
Hi Pavel,

This time already can be obtained from the getCurrentPmeDuration and
new isOperationsBlockedByPme metrics.

As an alternative solution, I can rework recently added
getCurrentPmeDuration metric (not released yet). Seems for users it
useless in case of non-blocking PME.
Lets name it timeSinceOperationsBlocked. It'll be timestamp when
blocking started (minimal value of cluster nodes) and 0 if blocking
ends (there is no running PME).

WDYT?

пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <[hidden email]>:

>
> Hi Nikita,
>
> Thank you for working on this. What do you think if we change the boolean
> value of metric to a long value that represents time in milliseconds when
> operations were blocked?
> Since we have not only JMX and now metrics are periodically exported to
> some backend it can give a more clear picture of how much time we wait for
> resuming cache operations instead of instant boolean indicator.
>
> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <[hidden email]>:
>
> > Anton, Nikolay,
> >
> > Thanks for the support.
> >
> > For now, we have the getCurrentPmeDuration() metric that does not show
> > influence on the cluster correctly. PME can be without blocking
> > operations. For example, client node join/leave events.
> >
> > I suggest add new metric - isOperationsBlockedByPme(). Together, these
> > metrics will show influence of the PME on cluster and user operations.
> >
> > I have prepared PR for this (Bot visa is green). [1] Can anyone take a
> > look?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> >
> > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <[hidden email]>:
> >
> > >
> > > I think administator of Ignite cluster should be able to monitor all
> > Ignite process, including non blocking PME.
> > >
> > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > > BTW,
> > > > Found PME metric - getCurrentPmeDuration().
> > > > Seems, it shows exactly PME time and not so useful because of this.
> > > > The goal it so show exactly blocking period.
> > > > When PME cause no blocking, it's a good PME and I see no reason to have
> > > > monitoring related to it :)
> > > >
> > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <[hidden email]>
> > wrote:
> > > >
> > > > > Anton.
> > > > >
> > > > > Why do we need to postpone implementation of this metrics?
> > > > > For now, implementation of new metric is very simple.
> > > > >
> > > > > I think we can implement this metrics as a single contribution.
> > > > >
> > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > > > > > Nikita,
> > > > > >
> > > > > > Looks like all we need now is a 1 simple metric: are operations
> > blocked?
> > > > > > Just a true or false.
> > > > > > Lest start from this.
> > > > > > All other metrics can be extracted from logs now and can be
> > implemented
> > > > > > later.
> > > > > >
> > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > +1.
> > > > > > >
> > > > > > > Nikita, please, go ahead.
> > > > > > >
> > > > > > >
> > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <[hidden email]
> > >:
> > > > > > >
> > > > > > > > Hello, Igniters.
> > > > > > > >
> > > > > > > > I suggest to add some useful metrics about the partition map
> > exchange
> > > > > > > > (PME). For now, the duration of PME stages available only in
> > log
> > > > >
> > > > > files
> > > > > > > > and cannot be obtained using JMX or other external tools. [1]
> > > > > > > >
> > > > > > > > I made the list of local node metrics that help to understand
> > the
> > > > > > > > actual status of current PME:
> > > > > > > >
> > > > > > > > 1. initialVersion. Topology version that initiates the
> > exchange.
> > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > 4. partitionReleaseTime. Time when a node has finished waiting
> > for
> > > > >
> > > > > all
> > > > > > > > updates and translations on a previous topology.
> > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single
> > message.
> > > > > > > > 6. recieveFullMessageTime. Time when a node received a full
> > message.
> > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > >
> > > > > > > > When new PME started all these metrics resets.
> > > > > > > >
> > > > > > > > These metrics help to understand:
> > > > > > > > - how long PME was (current or previous).
> > > > > > > > - how long awaited for all updates was completed.
> > > > > > > > - what node blocks PME (didn't send a single message)
> > > > > > > > - what triggered PME.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best wishes,
> > > > > > > > Amelchev Nikita
> > > > > > > >
> >
> >
> >
> > --
> > Best wishes,
> > Amelchev Nikita
> >



--
Best wishes,
Amelchev Nikita
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Pavel Kovalenko
Nikita,

I think getCurrentPmeDuration doesn't show useful information. The main PME
side effect for end-users is blocking cache operations. Not all PME time
blocks it.
What information gives to an end-user timestamp of
"timeSinceOperationsBlocked"? For what analysis it can be used and how?

пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <[hidden email]>:

> Hi Pavel,
>
> This time already can be obtained from the getCurrentPmeDuration and
> new isOperationsBlockedByPme metrics.
>
> As an alternative solution, I can rework recently added
> getCurrentPmeDuration metric (not released yet). Seems for users it
> useless in case of non-blocking PME.
> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
> blocking started (minimal value of cluster nodes) and 0 if blocking
> ends (there is no running PME).
>
> WDYT?
>
> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <[hidden email]>:
> >
> > Hi Nikita,
> >
> > Thank you for working on this. What do you think if we change the boolean
> > value of metric to a long value that represents time in milliseconds when
> > operations were blocked?
> > Since we have not only JMX and now metrics are periodically exported to
> > some backend it can give a more clear picture of how much time we wait
> for
> > resuming cache operations instead of instant boolean indicator.
> >
> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <[hidden email]>:
> >
> > > Anton, Nikolay,
> > >
> > > Thanks for the support.
> > >
> > > For now, we have the getCurrentPmeDuration() metric that does not show
> > > influence on the cluster correctly. PME can be without blocking
> > > operations. For example, client node join/leave events.
> > >
> > > I suggest add new metric - isOperationsBlockedByPme(). Together, these
> > > metrics will show influence of the PME on cluster and user operations.
> > >
> > > I have prepared PR for this (Bot visa is green). [1] Can anyone take a
> > > look?
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > >
> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <[hidden email]>:
> > >
> > > >
> > > > I think administator of Ignite cluster should be able to monitor all
> > > Ignite process, including non blocking PME.
> > > >
> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > > > BTW,
> > > > > Found PME metric - getCurrentPmeDuration().
> > > > > Seems, it shows exactly PME time and not so useful because of this.
> > > > > The goal it so show exactly blocking period.
> > > > > When PME cause no blocking, it's a good PME and I see no reason to
> have
> > > > > monitoring related to it :)
> > > > >
> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> [hidden email]>
> > > wrote:
> > > > >
> > > > > > Anton.
> > > > > >
> > > > > > Why do we need to postpone implementation of this metrics?
> > > > > > For now, implementation of new metric is very simple.
> > > > > >
> > > > > > I think we can implement this metrics as a single contribution.
> > > > > >
> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > > > > > > Nikita,
> > > > > > >
> > > > > > > Looks like all we need now is a 1 simple metric: are operations
> > > blocked?
> > > > > > > Just a true or false.
> > > > > > > Lest start from this.
> > > > > > > All other metrics can be extracted from logs now and can be
> > > implemented
> > > > > > > later.
> > > > > > >
> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1.
> > > > > > > >
> > > > > > > > Nikita, please, go ahead.
> > > > > > > >
> > > > > > > >
> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> [hidden email]
> > > >:
> > > > > > > >
> > > > > > > > > Hello, Igniters.
> > > > > > > > >
> > > > > > > > > I suggest to add some useful metrics about the partition
> map
> > > exchange
> > > > > > > > > (PME). For now, the duration of PME stages available only
> in
> > > log
> > > > > >
> > > > > > files
> > > > > > > > > and cannot be obtained using JMX or other external tools.
> [1]
> > > > > > > > >
> > > > > > > > > I made the list of local node metrics that help to
> understand
> > > the
> > > > > > > > > actual status of current PME:
> > > > > > > > >
> > > > > > > > > 1. initialVersion. Topology version that initiates the
> > > exchange.
> > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > 4. partitionReleaseTime. Time when a node has finished
> waiting
> > > for
> > > > > >
> > > > > > all
> > > > > > > > > updates and translations on a previous topology.
> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single
> > > message.
> > > > > > > > > 6. recieveFullMessageTime. Time when a node received a full
> > > message.
> > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > >
> > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > >
> > > > > > > > > These metrics help to understand:
> > > > > > > > > - how long PME was (current or previous).
> > > > > > > > > - how long awaited for all updates was completed.
> > > > > > > > > - what node blocks PME (didn't send a single message)
> > > > > > > > > - what triggered PME.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best wishes,
> > > > > > > > > Amelchev Nikita
> > > > > > > > >
> > >
> > >
> > >
> > > --
> > > Best wishes,
> > > Amelchev Nikita
> > >
>
>
>
> --
> Best wishes,
> Amelchev Nikita
>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Nikita Amelchev
Pavel,

The main purpose of this metric is
>> how much time we wait for resuming cache operations

Seems I misunderstood you. Do you mean timestamp or duration here?
>> What do you think if we change the boolean value of metric to a long value that represents time in milliseconds when operations were blocked?

This time can be calculated as (currentTime -
timeSinceOperationsBlocked) in case of timestamp.

Duration will be more understandable. It'll be something like
getCurrentBlockingPmeDuration. But I haven't come up with a better
name yet.

пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <[hidden email]>:

>
> Nikita,
>
> I think getCurrentPmeDuration doesn't show useful information. The main PME side effect for end-users is blocking cache operations. Not all PME time blocks it.
> What information gives to an end-user timestamp of "timeSinceOperationsBlocked"? For what analysis it can be used and how?
>
> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <[hidden email]>:
>>
>> Hi Pavel,
>>
>> This time already can be obtained from the getCurrentPmeDuration and
>> new isOperationsBlockedByPme metrics.
>>
>> As an alternative solution, I can rework recently added
>> getCurrentPmeDuration metric (not released yet). Seems for users it
>> useless in case of non-blocking PME.
>> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
>> blocking started (minimal value of cluster nodes) and 0 if blocking
>> ends (there is no running PME).
>>
>> WDYT?
>>
>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <[hidden email]>:
>> >
>> > Hi Nikita,
>> >
>> > Thank you for working on this. What do you think if we change the boolean
>> > value of metric to a long value that represents time in milliseconds when
>> > operations were blocked?
>> > Since we have not only JMX and now metrics are periodically exported to
>> > some backend it can give a more clear picture of how much time we wait for
>> > resuming cache operations instead of instant boolean indicator.
>> >
>> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <[hidden email]>:
>> >
>> > > Anton, Nikolay,
>> > >
>> > > Thanks for the support.
>> > >
>> > > For now, we have the getCurrentPmeDuration() metric that does not show
>> > > influence on the cluster correctly. PME can be without blocking
>> > > operations. For example, client node join/leave events.
>> > >
>> > > I suggest add new metric - isOperationsBlockedByPme(). Together, these
>> > > metrics will show influence of the PME on cluster and user operations.
>> > >
>> > > I have prepared PR for this (Bot visa is green). [1] Can anyone take a
>> > > look?
>> > >
>> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
>> > >
>> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <[hidden email]>:
>> > >
>> > > >
>> > > > I think administator of Ignite cluster should be able to monitor all
>> > > Ignite process, including non blocking PME.
>> > > >
>> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
>> > > > > BTW,
>> > > > > Found PME metric - getCurrentPmeDuration().
>> > > > > Seems, it shows exactly PME time and not so useful because of this.
>> > > > > The goal it so show exactly blocking period.
>> > > > > When PME cause no blocking, it's a good PME and I see no reason to have
>> > > > > monitoring related to it :)
>> > > > >
>> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <[hidden email]>
>> > > wrote:
>> > > > >
>> > > > > > Anton.
>> > > > > >
>> > > > > > Why do we need to postpone implementation of this metrics?
>> > > > > > For now, implementation of new metric is very simple.
>> > > > > >
>> > > > > > I think we can implement this metrics as a single contribution.
>> > > > > >
>> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
>> > > > > > > Nikita,
>> > > > > > >
>> > > > > > > Looks like all we need now is a 1 simple metric: are operations
>> > > blocked?
>> > > > > > > Just a true or false.
>> > > > > > > Lest start from this.
>> > > > > > > All other metrics can be extracted from logs now and can be
>> > > implemented
>> > > > > > > later.
>> > > > > > >
>> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
>> > > [hidden email]>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > +1.
>> > > > > > > >
>> > > > > > > > Nikita, please, go ahead.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <[hidden email]
>> > > >:
>> > > > > > > >
>> > > > > > > > > Hello, Igniters.
>> > > > > > > > >
>> > > > > > > > > I suggest to add some useful metrics about the partition map
>> > > exchange
>> > > > > > > > > (PME). For now, the duration of PME stages available only in
>> > > log
>> > > > > >
>> > > > > > files
>> > > > > > > > > and cannot be obtained using JMX or other external tools. [1]
>> > > > > > > > >
>> > > > > > > > > I made the list of local node metrics that help to understand
>> > > the
>> > > > > > > > > actual status of current PME:
>> > > > > > > > >
>> > > > > > > > > 1. initialVersion. Topology version that initiates the
>> > > exchange.
>> > > > > > > > > 2. initTime. Time PME was started.
>> > > > > > > > > 3. initEvent. Event that triggered PME.
>> > > > > > > > > 4. partitionReleaseTime. Time when a node has finished waiting
>> > > for
>> > > > > >
>> > > > > > all
>> > > > > > > > > updates and translations on a previous topology.
>> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single
>> > > message.
>> > > > > > > > > 6. recieveFullMessageTime. Time when a node received a full
>> > > message.
>> > > > > > > > > 7. finishTime. Time PME was ended.
>> > > > > > > > >
>> > > > > > > > > When new PME started all these metrics resets.
>> > > > > > > > >
>> > > > > > > > > These metrics help to understand:
>> > > > > > > > > - how long PME was (current or previous).
>> > > > > > > > > - how long awaited for all updates was completed.
>> > > > > > > > > - what node blocks PME (didn't send a single message)
>> > > > > > > > > - what triggered PME.
>> > > > > > > > >
>> > > > > > > > > Thoughts?
>> > > > > > > > >
>> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Best wishes,
>> > > > > > > > > Amelchev Nikita
>> > > > > > > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best wishes,
>> > > Amelchev Nikita
>> > >
>>
>>
>>
>> --
>> Best wishes,
>> Amelchev Nikita



--
Best wishes,
Amelchev Nikita
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Pavel Kovalenko
Nikita,

Yes, I mean duration not timestamp. For the metric name, I suggest
"cacheOperationsBlockingDuration", I think it cleaner represents what is
blocked during PME.
We can also combine both timestamp "cacheOperationsBlockingStartTs" and
duration to have better correlation when cache operations were blocked and
how much time it's taken.
For instant view (like in JMX bean) a calculated value as you mentioned can
be used.
For metrics are exported to some backend (IEP-35) a counter can be used.
The counter is incremented by blocking time after blocking has ended.

пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <[hidden email]>:

> Pavel,
>
> The main purpose of this metric is
> >> how much time we wait for resuming cache operations
>
> Seems I misunderstood you. Do you mean timestamp or duration here?
> >> What do you think if we change the boolean value of metric to a long
> value that represents time in milliseconds when operations were blocked?
>
> This time can be calculated as (currentTime -
> timeSinceOperationsBlocked) in case of timestamp.
>
> Duration will be more understandable. It'll be something like
> getCurrentBlockingPmeDuration. But I haven't come up with a better
> name yet.
>
> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <[hidden email]>:
> >
> > Nikita,
> >
> > I think getCurrentPmeDuration doesn't show useful information. The main
> PME side effect for end-users is blocking cache operations. Not all PME
> time blocks it.
> > What information gives to an end-user timestamp of
> "timeSinceOperationsBlocked"? For what analysis it can be used and how?
> >
> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <[hidden email]>:
> >>
> >> Hi Pavel,
> >>
> >> This time already can be obtained from the getCurrentPmeDuration and
> >> new isOperationsBlockedByPme metrics.
> >>
> >> As an alternative solution, I can rework recently added
> >> getCurrentPmeDuration metric (not released yet). Seems for users it
> >> useless in case of non-blocking PME.
> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
> >> blocking started (minimal value of cluster nodes) and 0 if blocking
> >> ends (there is no running PME).
> >>
> >> WDYT?
> >>
> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <[hidden email]>:
> >> >
> >> > Hi Nikita,
> >> >
> >> > Thank you for working on this. What do you think if we change the
> boolean
> >> > value of metric to a long value that represents time in milliseconds
> when
> >> > operations were blocked?
> >> > Since we have not only JMX and now metrics are periodically exported
> to
> >> > some backend it can give a more clear picture of how much time we
> wait for
> >> > resuming cache operations instead of instant boolean indicator.
> >> >
> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <[hidden email]>:
> >> >
> >> > > Anton, Nikolay,
> >> > >
> >> > > Thanks for the support.
> >> > >
> >> > > For now, we have the getCurrentPmeDuration() metric that does not
> show
> >> > > influence on the cluster correctly. PME can be without blocking
> >> > > operations. For example, client node join/leave events.
> >> > >
> >> > > I suggest add new metric - isOperationsBlockedByPme(). Together,
> these
> >> > > metrics will show influence of the PME on cluster and user
> operations.
> >> > >
> >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone
> take a
> >> > > look?
> >> > >
> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> >> > >
> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <[hidden email]>:
> >> > >
> >> > > >
> >> > > > I think administator of Ignite cluster should be able to monitor
> all
> >> > > Ignite process, including non blocking PME.
> >> > > >
> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> >> > > > > BTW,
> >> > > > > Found PME metric - getCurrentPmeDuration().
> >> > > > > Seems, it shows exactly PME time and not so useful because of
> this.
> >> > > > > The goal it so show exactly blocking period.
> >> > > > > When PME cause no blocking, it's a good PME and I see no reason
> to have
> >> > > > > monitoring related to it :)
> >> > > > >
> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> [hidden email]>
> >> > > wrote:
> >> > > > >
> >> > > > > > Anton.
> >> > > > > >
> >> > > > > > Why do we need to postpone implementation of this metrics?
> >> > > > > > For now, implementation of new metric is very simple.
> >> > > > > >
> >> > > > > > I think we can implement this metrics as a single
> contribution.
> >> > > > > >
> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> >> > > > > > > Nikita,
> >> > > > > > >
> >> > > > > > > Looks like all we need now is a 1 simple metric: are
> operations
> >> > > blocked?
> >> > > > > > > Just a true or false.
> >> > > > > > > Lest start from this.
> >> > > > > > > All other metrics can be extracted from logs now and can be
> >> > > implemented
> >> > > > > > > later.
> >> > > > > > >
> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> >> > > [hidden email]>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > +1.
> >> > > > > > > >
> >> > > > > > > > Nikita, please, go ahead.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> [hidden email]
> >> > > >:
> >> > > > > > > >
> >> > > > > > > > > Hello, Igniters.
> >> > > > > > > > >
> >> > > > > > > > > I suggest to add some useful metrics about the
> partition map
> >> > > exchange
> >> > > > > > > > > (PME). For now, the duration of PME stages available
> only in
> >> > > log
> >> > > > > >
> >> > > > > > files
> >> > > > > > > > > and cannot be obtained using JMX or other external
> tools. [1]
> >> > > > > > > > >
> >> > > > > > > > > I made the list of local node metrics that help to
> understand
> >> > > the
> >> > > > > > > > > actual status of current PME:
> >> > > > > > > > >
> >> > > > > > > > > 1. initialVersion. Topology version that initiates the
> >> > > exchange.
> >> > > > > > > > > 2. initTime. Time PME was started.
> >> > > > > > > > > 3. initEvent. Event that triggered PME.
> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has finished
> waiting
> >> > > for
> >> > > > > >
> >> > > > > > all
> >> > > > > > > > > updates and translations on a previous topology.
> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single
> >> > > message.
> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node received a
> full
> >> > > message.
> >> > > > > > > > > 7. finishTime. Time PME was ended.
> >> > > > > > > > >
> >> > > > > > > > > When new PME started all these metrics resets.
> >> > > > > > > > >
> >> > > > > > > > > These metrics help to understand:
> >> > > > > > > > > - how long PME was (current or previous).
> >> > > > > > > > > - how long awaited for all updates was completed.
> >> > > > > > > > > - what node blocks PME (didn't send a single message)
> >> > > > > > > > > - what triggered PME.
> >> > > > > > > > >
> >> > > > > > > > > Thoughts?
> >> > > > > > > > >
> >> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > > Best wishes,
> >> > > > > > > > > Amelchev Nikita
> >> > > > > > > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Best wishes,
> >> > > Amelchev Nikita
> >> > >
> >>
> >>
> >>
> >> --
> >> Best wishes,
> >> Amelchev Nikita
>
>
>
> --
> Best wishes,
> Amelchev Nikita
>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Anton Vinogradov-2
Folks,

What's the reason for duration counting?
AFAIU, it's a monitoring system feature to count the durations.
Sine monitoring system checks metrics periodically it will know the
duration by its own log.

On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <[hidden email]> wrote:

> Nikita,
>
> Yes, I mean duration not timestamp. For the metric name, I suggest
> "cacheOperationsBlockingDuration", I think it cleaner represents what is
> blocked during PME.
> We can also combine both timestamp "cacheOperationsBlockingStartTs" and
> duration to have better correlation when cache operations were blocked and
> how much time it's taken.
> For instant view (like in JMX bean) a calculated value as you mentioned
> can be used.
> For metrics are exported to some backend (IEP-35) a counter can be used.
> The counter is incremented by blocking time after blocking has ended.
>
> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <[hidden email]>:
>
>> Pavel,
>>
>> The main purpose of this metric is
>> >> how much time we wait for resuming cache operations
>>
>> Seems I misunderstood you. Do you mean timestamp or duration here?
>> >> What do you think if we change the boolean value of metric to a long
>> value that represents time in milliseconds when operations were blocked?
>>
>> This time can be calculated as (currentTime -
>> timeSinceOperationsBlocked) in case of timestamp.
>>
>> Duration will be more understandable. It'll be something like
>> getCurrentBlockingPmeDuration. But I haven't come up with a better
>> name yet.
>>
>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <[hidden email]>:
>> >
>> > Nikita,
>> >
>> > I think getCurrentPmeDuration doesn't show useful information. The main
>> PME side effect for end-users is blocking cache operations. Not all PME
>> time blocks it.
>> > What information gives to an end-user timestamp of
>> "timeSinceOperationsBlocked"? For what analysis it can be used and how?
>> >
>> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <[hidden email]>:
>> >>
>> >> Hi Pavel,
>> >>
>> >> This time already can be obtained from the getCurrentPmeDuration and
>> >> new isOperationsBlockedByPme metrics.
>> >>
>> >> As an alternative solution, I can rework recently added
>> >> getCurrentPmeDuration metric (not released yet). Seems for users it
>> >> useless in case of non-blocking PME.
>> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
>> >> blocking started (minimal value of cluster nodes) and 0 if blocking
>> >> ends (there is no running PME).
>> >>
>> >> WDYT?
>> >>
>> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <[hidden email]>:
>> >> >
>> >> > Hi Nikita,
>> >> >
>> >> > Thank you for working on this. What do you think if we change the
>> boolean
>> >> > value of metric to a long value that represents time in milliseconds
>> when
>> >> > operations were blocked?
>> >> > Since we have not only JMX and now metrics are periodically exported
>> to
>> >> > some backend it can give a more clear picture of how much time we
>> wait for
>> >> > resuming cache operations instead of instant boolean indicator.
>> >> >
>> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <[hidden email]>:
>> >> >
>> >> > > Anton, Nikolay,
>> >> > >
>> >> > > Thanks for the support.
>> >> > >
>> >> > > For now, we have the getCurrentPmeDuration() metric that does not
>> show
>> >> > > influence on the cluster correctly. PME can be without blocking
>> >> > > operations. For example, client node join/leave events.
>> >> > >
>> >> > > I suggest add new metric - isOperationsBlockedByPme(). Together,
>> these
>> >> > > metrics will show influence of the PME on cluster and user
>> operations.
>> >> > >
>> >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone
>> take a
>> >> > > look?
>> >> > >
>> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
>> >> > >
>> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <[hidden email]
>> >:
>> >> > >
>> >> > > >
>> >> > > > I think administator of Ignite cluster should be able to monitor
>> all
>> >> > > Ignite process, including non blocking PME.
>> >> > > >
>> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
>> >> > > > > BTW,
>> >> > > > > Found PME metric - getCurrentPmeDuration().
>> >> > > > > Seems, it shows exactly PME time and not so useful because of
>> this.
>> >> > > > > The goal it so show exactly blocking period.
>> >> > > > > When PME cause no blocking, it's a good PME and I see no
>> reason to have
>> >> > > > > monitoring related to it :)
>> >> > > > >
>> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
>> [hidden email]>
>> >> > > wrote:
>> >> > > > >
>> >> > > > > > Anton.
>> >> > > > > >
>> >> > > > > > Why do we need to postpone implementation of this metrics?
>> >> > > > > > For now, implementation of new metric is very simple.
>> >> > > > > >
>> >> > > > > > I think we can implement this metrics as a single
>> contribution.
>> >> > > > > >
>> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
>> >> > > > > > > Nikita,
>> >> > > > > > >
>> >> > > > > > > Looks like all we need now is a 1 simple metric: are
>> operations
>> >> > > blocked?
>> >> > > > > > > Just a true or false.
>> >> > > > > > > Lest start from this.
>> >> > > > > > > All other metrics can be extracted from logs now and can be
>> >> > > implemented
>> >> > > > > > > later.
>> >> > > > > > >
>> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
>> >> > > [hidden email]>
>> >> > > > > > > wrote:
>> >> > > > > > >
>> >> > > > > > > > +1.
>> >> > > > > > > >
>> >> > > > > > > > Nikita, please, go ahead.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
>> [hidden email]
>> >> > > >:
>> >> > > > > > > >
>> >> > > > > > > > > Hello, Igniters.
>> >> > > > > > > > >
>> >> > > > > > > > > I suggest to add some useful metrics about the
>> partition map
>> >> > > exchange
>> >> > > > > > > > > (PME). For now, the duration of PME stages available
>> only in
>> >> > > log
>> >> > > > > >
>> >> > > > > > files
>> >> > > > > > > > > and cannot be obtained using JMX or other external
>> tools. [1]
>> >> > > > > > > > >
>> >> > > > > > > > > I made the list of local node metrics that help to
>> understand
>> >> > > the
>> >> > > > > > > > > actual status of current PME:
>> >> > > > > > > > >
>> >> > > > > > > > > 1. initialVersion. Topology version that initiates the
>> >> > > exchange.
>> >> > > > > > > > > 2. initTime. Time PME was started.
>> >> > > > > > > > > 3. initEvent. Event that triggered PME.
>> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has finished
>> waiting
>> >> > > for
>> >> > > > > >
>> >> > > > > > all
>> >> > > > > > > > > updates and translations on a previous topology.
>> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a
>> single
>> >> > > message.
>> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node received a
>> full
>> >> > > message.
>> >> > > > > > > > > 7. finishTime. Time PME was ended.
>> >> > > > > > > > >
>> >> > > > > > > > > When new PME started all these metrics resets.
>> >> > > > > > > > >
>> >> > > > > > > > > These metrics help to understand:
>> >> > > > > > > > > - how long PME was (current or previous).
>> >> > > > > > > > > - how long awaited for all updates was completed.
>> >> > > > > > > > > - what node blocks PME (didn't send a single message)
>> >> > > > > > > > > - what triggered PME.
>> >> > > > > > > > >
>> >> > > > > > > > > Thoughts?
>> >> > > > > > > > >
>> >> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
>> >> > > > > > > > >
>> >> > > > > > > > > --
>> >> > > > > > > > > Best wishes,
>> >> > > > > > > > > Amelchev Nikita
>> >> > > > > > > > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Best wishes,
>> >> > > Amelchev Nikita
>> >> > >
>> >>
>> >>
>> >>
>> >> --
>> >> Best wishes,
>> >> Amelchev Nikita
>>
>>
>>
>> --
>> Best wishes,
>> Amelchev Nikita
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Nikolay Izhikov-2
Anton.

1. Value exported based on SPI settings, not in the moment it changed.

2. Clock synchronisation - if we export start time, we should also export
node local timestamp.

пн, 22 июля 2019 г., 8:33 Anton Vinogradov <[hidden email]>:

> Folks,
>
> What's the reason for duration counting?
> AFAIU, it's a monitoring system feature to count the durations.
> Sine monitoring system checks metrics periodically it will know the
> duration by its own log.
>
> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <[hidden email]>
> wrote:
>
> > Nikita,
> >
> > Yes, I mean duration not timestamp. For the metric name, I suggest
> > "cacheOperationsBlockingDuration", I think it cleaner represents what is
> > blocked during PME.
> > We can also combine both timestamp "cacheOperationsBlockingStartTs" and
> > duration to have better correlation when cache operations were blocked
> and
> > how much time it's taken.
> > For instant view (like in JMX bean) a calculated value as you mentioned
> > can be used.
> > For metrics are exported to some backend (IEP-35) a counter can be used.
> > The counter is incremented by blocking time after blocking has ended.
> >
> > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <[hidden email]>:
> >
> >> Pavel,
> >>
> >> The main purpose of this metric is
> >> >> how much time we wait for resuming cache operations
> >>
> >> Seems I misunderstood you. Do you mean timestamp or duration here?
> >> >> What do you think if we change the boolean value of metric to a long
> >> value that represents time in milliseconds when operations were blocked?
> >>
> >> This time can be calculated as (currentTime -
> >> timeSinceOperationsBlocked) in case of timestamp.
> >>
> >> Duration will be more understandable. It'll be something like
> >> getCurrentBlockingPmeDuration. But I haven't come up with a better
> >> name yet.
> >>
> >> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <[hidden email]>:
> >> >
> >> > Nikita,
> >> >
> >> > I think getCurrentPmeDuration doesn't show useful information. The
> main
> >> PME side effect for end-users is blocking cache operations. Not all PME
> >> time blocks it.
> >> > What information gives to an end-user timestamp of
> >> "timeSinceOperationsBlocked"? For what analysis it can be used and how?
> >> >
> >> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <[hidden email]>:
> >> >>
> >> >> Hi Pavel,
> >> >>
> >> >> This time already can be obtained from the getCurrentPmeDuration and
> >> >> new isOperationsBlockedByPme metrics.
> >> >>
> >> >> As an alternative solution, I can rework recently added
> >> >> getCurrentPmeDuration metric (not released yet). Seems for users it
> >> >> useless in case of non-blocking PME.
> >> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
> >> >> blocking started (minimal value of cluster nodes) and 0 if blocking
> >> >> ends (there is no running PME).
> >> >>
> >> >> WDYT?
> >> >>
> >> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <[hidden email]>:
> >> >> >
> >> >> > Hi Nikita,
> >> >> >
> >> >> > Thank you for working on this. What do you think if we change the
> >> boolean
> >> >> > value of metric to a long value that represents time in
> milliseconds
> >> when
> >> >> > operations were blocked?
> >> >> > Since we have not only JMX and now metrics are periodically
> exported
> >> to
> >> >> > some backend it can give a more clear picture of how much time we
> >> wait for
> >> >> > resuming cache operations instead of instant boolean indicator.
> >> >> >
> >> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <[hidden email]
> >:
> >> >> >
> >> >> > > Anton, Nikolay,
> >> >> > >
> >> >> > > Thanks for the support.
> >> >> > >
> >> >> > > For now, we have the getCurrentPmeDuration() metric that does not
> >> show
> >> >> > > influence on the cluster correctly. PME can be without blocking
> >> >> > > operations. For example, client node join/leave events.
> >> >> > >
> >> >> > > I suggest add new metric - isOperationsBlockedByPme(). Together,
> >> these
> >> >> > > metrics will show influence of the PME on cluster and user
> >> operations.
> >> >> > >
> >> >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone
> >> take a
> >> >> > > look?
> >> >> > >
> >> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> >> >> > >
> >> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> [hidden email]
> >> >:
> >> >> > >
> >> >> > > >
> >> >> > > > I think administator of Ignite cluster should be able to
> monitor
> >> all
> >> >> > > Ignite process, including non blocking PME.
> >> >> > > >
> >> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> >> >> > > > > BTW,
> >> >> > > > > Found PME metric - getCurrentPmeDuration().
> >> >> > > > > Seems, it shows exactly PME time and not so useful because of
> >> this.
> >> >> > > > > The goal it so show exactly blocking period.
> >> >> > > > > When PME cause no blocking, it's a good PME and I see no
> >> reason to have
> >> >> > > > > monitoring related to it :)
> >> >> > > > >
> >> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> >> [hidden email]>
> >> >> > > wrote:
> >> >> > > > >
> >> >> > > > > > Anton.
> >> >> > > > > >
> >> >> > > > > > Why do we need to postpone implementation of this metrics?
> >> >> > > > > > For now, implementation of new metric is very simple.
> >> >> > > > > >
> >> >> > > > > > I think we can implement this metrics as a single
> >> contribution.
> >> >> > > > > >
> >> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> >> >> > > > > > > Nikita,
> >> >> > > > > > >
> >> >> > > > > > > Looks like all we need now is a 1 simple metric: are
> >> operations
> >> >> > > blocked?
> >> >> > > > > > > Just a true or false.
> >> >> > > > > > > Lest start from this.
> >> >> > > > > > > All other metrics can be extracted from logs now and can
> be
> >> >> > > implemented
> >> >> > > > > > > later.
> >> >> > > > > > >
> >> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> >> >> > > [hidden email]>
> >> >> > > > > > > wrote:
> >> >> > > > > > >
> >> >> > > > > > > > +1.
> >> >> > > > > > > >
> >> >> > > > > > > > Nikita, please, go ahead.
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> >> [hidden email]
> >> >> > > >:
> >> >> > > > > > > >
> >> >> > > > > > > > > Hello, Igniters.
> >> >> > > > > > > > >
> >> >> > > > > > > > > I suggest to add some useful metrics about the
> >> partition map
> >> >> > > exchange
> >> >> > > > > > > > > (PME). For now, the duration of PME stages available
> >> only in
> >> >> > > log
> >> >> > > > > >
> >> >> > > > > > files
> >> >> > > > > > > > > and cannot be obtained using JMX or other external
> >> tools. [1]
> >> >> > > > > > > > >
> >> >> > > > > > > > > I made the list of local node metrics that help to
> >> understand
> >> >> > > the
> >> >> > > > > > > > > actual status of current PME:
> >> >> > > > > > > > >
> >> >> > > > > > > > > 1. initialVersion. Topology version that initiates
> the
> >> >> > > exchange.
> >> >> > > > > > > > > 2. initTime. Time PME was started.
> >> >> > > > > > > > > 3. initEvent. Event that triggered PME.
> >> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has
> finished
> >> waiting
> >> >> > > for
> >> >> > > > > >
> >> >> > > > > > all
> >> >> > > > > > > > > updates and translations on a previous topology.
> >> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a
> >> single
> >> >> > > message.
> >> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node received
> a
> >> full
> >> >> > > message.
> >> >> > > > > > > > > 7. finishTime. Time PME was ended.
> >> >> > > > > > > > >
> >> >> > > > > > > > > When new PME started all these metrics resets.
> >> >> > > > > > > > >
> >> >> > > > > > > > > These metrics help to understand:
> >> >> > > > > > > > > - how long PME was (current or previous).
> >> >> > > > > > > > > - how long awaited for all updates was completed.
> >> >> > > > > > > > > - what node blocks PME (didn't send a single message)
> >> >> > > > > > > > > - what triggered PME.
> >> >> > > > > > > > >
> >> >> > > > > > > > > Thoughts?
> >> >> > > > > > > > >
> >> >> > > > > > > > > [1]
> https://issues.apache.org/jira/browse/IGNITE-11961
> >> >> > > > > > > > >
> >> >> > > > > > > > > --
> >> >> > > > > > > > > Best wishes,
> >> >> > > > > > > > > Amelchev Nikita
> >> >> > > > > > > > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Best wishes,
> >> >> > > Amelchev Nikita
> >> >> > >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Best wishes,
> >> >> Amelchev Nikita
> >>
> >>
> >>
> >> --
> >> Best wishes,
> >> Amelchev Nikita
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Anton Vinogradov-2
Nikolay,

Still see no reason to replace boolean with long.

On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <[hidden email]> wrote:

> Anton.
>
> 1. Value exported based on SPI settings, not in the moment it changed.
>
> 2. Clock synchronisation - if we export start time, we should also export
> node local timestamp.
>
> пн, 22 июля 2019 г., 8:33 Anton Vinogradov <[hidden email]>:
>
> > Folks,
> >
> > What's the reason for duration counting?
> > AFAIU, it's a monitoring system feature to count the durations.
> > Sine monitoring system checks metrics periodically it will know the
> > duration by its own log.
> >
> > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <[hidden email]>
> > wrote:
> >
> > > Nikita,
> > >
> > > Yes, I mean duration not timestamp. For the metric name, I suggest
> > > "cacheOperationsBlockingDuration", I think it cleaner represents what
> is
> > > blocked during PME.
> > > We can also combine both timestamp "cacheOperationsBlockingStartTs" and
> > > duration to have better correlation when cache operations were blocked
> > and
> > > how much time it's taken.
> > > For instant view (like in JMX bean) a calculated value as you mentioned
> > > can be used.
> > > For metrics are exported to some backend (IEP-35) a counter can be
> used.
> > > The counter is incremented by blocking time after blocking has ended.
> > >
> > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <[hidden email]>:
> > >
> > >> Pavel,
> > >>
> > >> The main purpose of this metric is
> > >> >> how much time we wait for resuming cache operations
> > >>
> > >> Seems I misunderstood you. Do you mean timestamp or duration here?
> > >> >> What do you think if we change the boolean value of metric to a
> long
> > >> value that represents time in milliseconds when operations were
> blocked?
> > >>
> > >> This time can be calculated as (currentTime -
> > >> timeSinceOperationsBlocked) in case of timestamp.
> > >>
> > >> Duration will be more understandable. It'll be something like
> > >> getCurrentBlockingPmeDuration. But I haven't come up with a better
> > >> name yet.
> > >>
> > >> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <[hidden email]>:
> > >> >
> > >> > Nikita,
> > >> >
> > >> > I think getCurrentPmeDuration doesn't show useful information. The
> > main
> > >> PME side effect for end-users is blocking cache operations. Not all
> PME
> > >> time blocks it.
> > >> > What information gives to an end-user timestamp of
> > >> "timeSinceOperationsBlocked"? For what analysis it can be used and
> how?
> > >> >
> > >> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <[hidden email]
> >:
> > >> >>
> > >> >> Hi Pavel,
> > >> >>
> > >> >> This time already can be obtained from the getCurrentPmeDuration
> and
> > >> >> new isOperationsBlockedByPme metrics.
> > >> >>
> > >> >> As an alternative solution, I can rework recently added
> > >> >> getCurrentPmeDuration metric (not released yet). Seems for users it
> > >> >> useless in case of non-blocking PME.
> > >> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
> > >> >> blocking started (minimal value of cluster nodes) and 0 if blocking
> > >> >> ends (there is no running PME).
> > >> >>
> > >> >> WDYT?
> > >> >>
> > >> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <[hidden email]>:
> > >> >> >
> > >> >> > Hi Nikita,
> > >> >> >
> > >> >> > Thank you for working on this. What do you think if we change the
> > >> boolean
> > >> >> > value of metric to a long value that represents time in
> > milliseconds
> > >> when
> > >> >> > operations were blocked?
> > >> >> > Since we have not only JMX and now metrics are periodically
> > exported
> > >> to
> > >> >> > some backend it can give a more clear picture of how much time we
> > >> wait for
> > >> >> > resuming cache operations instead of instant boolean indicator.
> > >> >> >
> > >> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> [hidden email]
> > >:
> > >> >> >
> > >> >> > > Anton, Nikolay,
> > >> >> > >
> > >> >> > > Thanks for the support.
> > >> >> > >
> > >> >> > > For now, we have the getCurrentPmeDuration() metric that does
> not
> > >> show
> > >> >> > > influence on the cluster correctly. PME can be without blocking
> > >> >> > > operations. For example, client node join/leave events.
> > >> >> > >
> > >> >> > > I suggest add new metric - isOperationsBlockedByPme().
> Together,
> > >> these
> > >> >> > > metrics will show influence of the PME on cluster and user
> > >> operations.
> > >> >> > >
> > >> >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone
> > >> take a
> > >> >> > > look?
> > >> >> > >
> > >> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > >> >> > >
> > >> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > [hidden email]
> > >> >:
> > >> >> > >
> > >> >> > > >
> > >> >> > > > I think administator of Ignite cluster should be able to
> > monitor
> > >> all
> > >> >> > > Ignite process, including non blocking PME.
> > >> >> > > >
> > >> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > >> >> > > > > BTW,
> > >> >> > > > > Found PME metric - getCurrentPmeDuration().
> > >> >> > > > > Seems, it shows exactly PME time and not so useful because
> of
> > >> this.
> > >> >> > > > > The goal it so show exactly blocking period.
> > >> >> > > > > When PME cause no blocking, it's a good PME and I see no
> > >> reason to have
> > >> >> > > > > monitoring related to it :)
> > >> >> > > > >
> > >> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > >> [hidden email]>
> > >> >> > > wrote:
> > >> >> > > > >
> > >> >> > > > > > Anton.
> > >> >> > > > > >
> > >> >> > > > > > Why do we need to postpone implementation of this
> metrics?
> > >> >> > > > > > For now, implementation of new metric is very simple.
> > >> >> > > > > >
> > >> >> > > > > > I think we can implement this metrics as a single
> > >> contribution.
> > >> >> > > > > >
> > >> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > >> >> > > > > > > Nikita,
> > >> >> > > > > > >
> > >> >> > > > > > > Looks like all we need now is a 1 simple metric: are
> > >> operations
> > >> >> > > blocked?
> > >> >> > > > > > > Just a true or false.
> > >> >> > > > > > > Lest start from this.
> > >> >> > > > > > > All other metrics can be extracted from logs now and
> can
> > be
> > >> >> > > implemented
> > >> >> > > > > > > later.
> > >> >> > > > > > >
> > >> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > >> >> > > [hidden email]>
> > >> >> > > > > > > wrote:
> > >> >> > > > > > >
> > >> >> > > > > > > > +1.
> > >> >> > > > > > > >
> > >> >> > > > > > > > Nikita, please, go ahead.
> > >> >> > > > > > > >
> > >> >> > > > > > > >
> > >> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > >> [hidden email]
> > >> >> > > >:
> > >> >> > > > > > > >
> > >> >> > > > > > > > > Hello, Igniters.
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > I suggest to add some useful metrics about the
> > >> partition map
> > >> >> > > exchange
> > >> >> > > > > > > > > (PME). For now, the duration of PME stages
> available
> > >> only in
> > >> >> > > log
> > >> >> > > > > >
> > >> >> > > > > > files
> > >> >> > > > > > > > > and cannot be obtained using JMX or other external
> > >> tools. [1]
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > I made the list of local node metrics that help to
> > >> understand
> > >> >> > > the
> > >> >> > > > > > > > > actual status of current PME:
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > 1. initialVersion. Topology version that initiates
> > the
> > >> >> > > exchange.
> > >> >> > > > > > > > > 2. initTime. Time PME was started.
> > >> >> > > > > > > > > 3. initEvent. Event that triggered PME.
> > >> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > finished
> > >> waiting
> > >> >> > > for
> > >> >> > > > > >
> > >> >> > > > > > all
> > >> >> > > > > > > > > updates and translations on a previous topology.
> > >> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a
> > >> single
> > >> >> > > message.
> > >> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node
> received
> > a
> > >> full
> > >> >> > > message.
> > >> >> > > > > > > > > 7. finishTime. Time PME was ended.
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > When new PME started all these metrics resets.
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > These metrics help to understand:
> > >> >> > > > > > > > > - how long PME was (current or previous).
> > >> >> > > > > > > > > - how long awaited for all updates was completed.
> > >> >> > > > > > > > > - what node blocks PME (didn't send a single
> message)
> > >> >> > > > > > > > > - what triggered PME.
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > Thoughts?
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > [1]
> > https://issues.apache.org/jira/browse/IGNITE-11961
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > --
> > >> >> > > > > > > > > Best wishes,
> > >> >> > > > > > > > > Amelchev Nikita
> > >> >> > > > > > > > >
> > >> >> > >
> > >> >> > >
> > >> >> > >
> > >> >> > > --
> > >> >> > > Best wishes,
> > >> >> > > Amelchev Nikita
> > >> >> > >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Best wishes,
> > >> >> Amelchev Nikita
> > >>
> > >>
> > >>
> > >> --
> > >> Best wishes,
> > >> Amelchev Nikita
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Nikita Amelchev
Folks,

All previous suggestions have some disadvantages. It can be several
exchanges between two metric updates and fast exchange can rewrite
previous long exchange.

We can introduce a metric of total blocking duration that will
accumulate at the end of the exchange. So, users will get actual
information about how long operations were blocked. Cluster metric
will be a maximum of local nodes metrics. And we need a boolean metric
that will indicate realtime status. It needs because of duration
metric updates at the end of the exchange.

So I propose to change the current metric that not released to the
totalCacheOperationsBlockingDuration metric and to add the
isCacheOperationsBlocked metric.

WDYT?

пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <[hidden email]>:

>
> Nikolay,
>
> Still see no reason to replace boolean with long.
>
> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <[hidden email]> wrote:
>
> > Anton.
> >
> > 1. Value exported based on SPI settings, not in the moment it changed.
> >
> > 2. Clock synchronisation - if we export start time, we should also export
> > node local timestamp.
> >
> > пн, 22 июля 2019 г., 8:33 Anton Vinogradov <[hidden email]>:
> >
> > > Folks,
> > >
> > > What's the reason for duration counting?
> > > AFAIU, it's a monitoring system feature to count the durations.
> > > Sine monitoring system checks metrics periodically it will know the
> > > duration by its own log.
> > >
> > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <[hidden email]>
> > > wrote:
> > >
> > > > Nikita,
> > > >
> > > > Yes, I mean duration not timestamp. For the metric name, I suggest
> > > > "cacheOperationsBlockingDuration", I think it cleaner represents what
> > is
> > > > blocked during PME.
> > > > We can also combine both timestamp "cacheOperationsBlockingStartTs" and
> > > > duration to have better correlation when cache operations were blocked
> > > and
> > > > how much time it's taken.
> > > > For instant view (like in JMX bean) a calculated value as you mentioned
> > > > can be used.
> > > > For metrics are exported to some backend (IEP-35) a counter can be
> > used.
> > > > The counter is incremented by blocking time after blocking has ended.
> > > >
> > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <[hidden email]>:
> > > >
> > > >> Pavel,
> > > >>
> > > >> The main purpose of this metric is
> > > >> >> how much time we wait for resuming cache operations
> > > >>
> > > >> Seems I misunderstood you. Do you mean timestamp or duration here?
> > > >> >> What do you think if we change the boolean value of metric to a
> > long
> > > >> value that represents time in milliseconds when operations were
> > blocked?
> > > >>
> > > >> This time can be calculated as (currentTime -
> > > >> timeSinceOperationsBlocked) in case of timestamp.
> > > >>
> > > >> Duration will be more understandable. It'll be something like
> > > >> getCurrentBlockingPmeDuration. But I haven't come up with a better
> > > >> name yet.
> > > >>
> > > >> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <[hidden email]>:
> > > >> >
> > > >> > Nikita,
> > > >> >
> > > >> > I think getCurrentPmeDuration doesn't show useful information. The
> > > main
> > > >> PME side effect for end-users is blocking cache operations. Not all
> > PME
> > > >> time blocks it.
> > > >> > What information gives to an end-user timestamp of
> > > >> "timeSinceOperationsBlocked"? For what analysis it can be used and
> > how?
> > > >> >
> > > >> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <[hidden email]
> > >:
> > > >> >>
> > > >> >> Hi Pavel,
> > > >> >>
> > > >> >> This time already can be obtained from the getCurrentPmeDuration
> > and
> > > >> >> new isOperationsBlockedByPme metrics.
> > > >> >>
> > > >> >> As an alternative solution, I can rework recently added
> > > >> >> getCurrentPmeDuration metric (not released yet). Seems for users it
> > > >> >> useless in case of non-blocking PME.
> > > >> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp when
> > > >> >> blocking started (minimal value of cluster nodes) and 0 if blocking
> > > >> >> ends (there is no running PME).
> > > >> >>
> > > >> >> WDYT?
> > > >> >>
> > > >> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <[hidden email]>:
> > > >> >> >
> > > >> >> > Hi Nikita,
> > > >> >> >
> > > >> >> > Thank you for working on this. What do you think if we change the
> > > >> boolean
> > > >> >> > value of metric to a long value that represents time in
> > > milliseconds
> > > >> when
> > > >> >> > operations were blocked?
> > > >> >> > Since we have not only JMX and now metrics are periodically
> > > exported
> > > >> to
> > > >> >> > some backend it can give a more clear picture of how much time we
> > > >> wait for
> > > >> >> > resuming cache operations instead of instant boolean indicator.
> > > >> >> >
> > > >> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > [hidden email]
> > > >:
> > > >> >> >
> > > >> >> > > Anton, Nikolay,
> > > >> >> > >
> > > >> >> > > Thanks for the support.
> > > >> >> > >
> > > >> >> > > For now, we have the getCurrentPmeDuration() metric that does
> > not
> > > >> show
> > > >> >> > > influence on the cluster correctly. PME can be without blocking
> > > >> >> > > operations. For example, client node join/leave events.
> > > >> >> > >
> > > >> >> > > I suggest add new metric - isOperationsBlockedByPme().
> > Together,
> > > >> these
> > > >> >> > > metrics will show influence of the PME on cluster and user
> > > >> operations.
> > > >> >> > >
> > > >> >> > > I have prepared PR for this (Bot visa is green). [1] Can anyone
> > > >> take a
> > > >> >> > > look?
> > > >> >> > >
> > > >> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > >> >> > >
> > > >> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > > [hidden email]
> > > >> >:
> > > >> >> > >
> > > >> >> > > >
> > > >> >> > > > I think administator of Ignite cluster should be able to
> > > monitor
> > > >> all
> > > >> >> > > Ignite process, including non blocking PME.
> > > >> >> > > >
> > > >> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > >> >> > > > > BTW,
> > > >> >> > > > > Found PME metric - getCurrentPmeDuration().
> > > >> >> > > > > Seems, it shows exactly PME time and not so useful because
> > of
> > > >> this.
> > > >> >> > > > > The goal it so show exactly blocking period.
> > > >> >> > > > > When PME cause no blocking, it's a good PME and I see no
> > > >> reason to have
> > > >> >> > > > > monitoring related to it :)
> > > >> >> > > > >
> > > >> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > > >> [hidden email]>
> > > >> >> > > wrote:
> > > >> >> > > > >
> > > >> >> > > > > > Anton.
> > > >> >> > > > > >
> > > >> >> > > > > > Why do we need to postpone implementation of this
> > metrics?
> > > >> >> > > > > > For now, implementation of new metric is very simple.
> > > >> >> > > > > >
> > > >> >> > > > > > I think we can implement this metrics as a single
> > > >> contribution.
> > > >> >> > > > > >
> > > >> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > > >> >> > > > > > > Nikita,
> > > >> >> > > > > > >
> > > >> >> > > > > > > Looks like all we need now is a 1 simple metric: are
> > > >> operations
> > > >> >> > > blocked?
> > > >> >> > > > > > > Just a true or false.
> > > >> >> > > > > > > Lest start from this.
> > > >> >> > > > > > > All other metrics can be extracted from logs now and
> > can
> > > be
> > > >> >> > > implemented
> > > >> >> > > > > > > later.
> > > >> >> > > > > > >
> > > >> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > >> >> > > [hidden email]>
> > > >> >> > > > > > > wrote:
> > > >> >> > > > > > >
> > > >> >> > > > > > > > +1.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Nikita, please, go ahead.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > > >> [hidden email]
> > > >> >> > > >:
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > > Hello, Igniters.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > I suggest to add some useful metrics about the
> > > >> partition map
> > > >> >> > > exchange
> > > >> >> > > > > > > > > (PME). For now, the duration of PME stages
> > available
> > > >> only in
> > > >> >> > > log
> > > >> >> > > > > >
> > > >> >> > > > > > files
> > > >> >> > > > > > > > > and cannot be obtained using JMX or other external
> > > >> tools. [1]
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > I made the list of local node metrics that help to
> > > >> understand
> > > >> >> > > the
> > > >> >> > > > > > > > > actual status of current PME:
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > 1. initialVersion. Topology version that initiates
> > > the
> > > >> >> > > exchange.
> > > >> >> > > > > > > > > 2. initTime. Time PME was started.
> > > >> >> > > > > > > > > 3. initEvent. Event that triggered PME.
> > > >> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > finished
> > > >> waiting
> > > >> >> > > for
> > > >> >> > > > > >
> > > >> >> > > > > > all
> > > >> >> > > > > > > > > updates and translations on a previous topology.
> > > >> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a
> > > >> single
> > > >> >> > > message.
> > > >> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > received
> > > a
> > > >> full
> > > >> >> > > message.
> > > >> >> > > > > > > > > 7. finishTime. Time PME was ended.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > When new PME started all these metrics resets.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > These metrics help to understand:
> > > >> >> > > > > > > > > - how long PME was (current or previous).
> > > >> >> > > > > > > > > - how long awaited for all updates was completed.
> > > >> >> > > > > > > > > - what node blocks PME (didn't send a single
> > message)
> > > >> >> > > > > > > > > - what triggered PME.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > Thoughts?
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > [1]
> > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > --
> > > >> >> > > > > > > > > Best wishes,
> > > >> >> > > > > > > > > Amelchev Nikita
> > > >> >> > > > > > > > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > --
> > > >> >> > > Best wishes,
> > > >> >> > > Amelchev Nikita
> > > >> >> > >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Best wishes,
> > > >> >> Amelchev Nikita
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Best wishes,
> > > >> Amelchev Nikita
> > > >>
> > > >
> > >
> >



--
Best wishes,
Amelchev Nikita
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Pavel Kovalenko
Nikita,

I agree with total blocking duration metric but
I still don't understand why instant value indicating that operations are
blocked should be boolean.
Duration time since blocking has started looks more appropriate and useful.
It gives more information while semantic is left the same.



вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev <[hidden email]>:

> Folks,
>
> All previous suggestions have some disadvantages. It can be several
> exchanges between two metric updates and fast exchange can rewrite
> previous long exchange.
>
> We can introduce a metric of total blocking duration that will
> accumulate at the end of the exchange. So, users will get actual
> information about how long operations were blocked. Cluster metric
> will be a maximum of local nodes metrics. And we need a boolean metric
> that will indicate realtime status. It needs because of duration
> metric updates at the end of the exchange.
>
> So I propose to change the current metric that not released to the
> totalCacheOperationsBlockingDuration metric and to add the
> isCacheOperationsBlocked metric.
>
> WDYT?
>
> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <[hidden email]>:
> >
> > Nikolay,
> >
> > Still see no reason to replace boolean with long.
> >
> > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <[hidden email]>
> wrote:
> >
> > > Anton.
> > >
> > > 1. Value exported based on SPI settings, not in the moment it changed.
> > >
> > > 2. Clock synchronisation - if we export start time, we should also
> export
> > > node local timestamp.
> > >
> > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov <[hidden email]>:
> > >
> > > > Folks,
> > > >
> > > > What's the reason for duration counting?
> > > > AFAIU, it's a monitoring system feature to count the durations.
> > > > Sine monitoring system checks metrics periodically it will know the
> > > > duration by its own log.
> > > >
> > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <[hidden email]>
> > > > wrote:
> > > >
> > > > > Nikita,
> > > > >
> > > > > Yes, I mean duration not timestamp. For the metric name, I suggest
> > > > > "cacheOperationsBlockingDuration", I think it cleaner represents
> what
> > > is
> > > > > blocked during PME.
> > > > > We can also combine both timestamp
> "cacheOperationsBlockingStartTs" and
> > > > > duration to have better correlation when cache operations were
> blocked
> > > > and
> > > > > how much time it's taken.
> > > > > For instant view (like in JMX bean) a calculated value as you
> mentioned
> > > > > can be used.
> > > > > For metrics are exported to some backend (IEP-35) a counter can be
> > > used.
> > > > > The counter is incremented by blocking time after blocking has
> ended.
> > > > >
> > > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <[hidden email]
> >:
> > > > >
> > > > >> Pavel,
> > > > >>
> > > > >> The main purpose of this metric is
> > > > >> >> how much time we wait for resuming cache operations
> > > > >>
> > > > >> Seems I misunderstood you. Do you mean timestamp or duration here?
> > > > >> >> What do you think if we change the boolean value of metric to a
> > > long
> > > > >> value that represents time in milliseconds when operations were
> > > blocked?
> > > > >>
> > > > >> This time can be calculated as (currentTime -
> > > > >> timeSinceOperationsBlocked) in case of timestamp.
> > > > >>
> > > > >> Duration will be more understandable. It'll be something like
> > > > >> getCurrentBlockingPmeDuration. But I haven't come up with a better
> > > > >> name yet.
> > > > >>
> > > > >> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <[hidden email]
> >:
> > > > >> >
> > > > >> > Nikita,
> > > > >> >
> > > > >> > I think getCurrentPmeDuration doesn't show useful information.
> The
> > > > main
> > > > >> PME side effect for end-users is blocking cache operations. Not
> all
> > > PME
> > > > >> time blocks it.
> > > > >> > What information gives to an end-user timestamp of
> > > > >> "timeSinceOperationsBlocked"? For what analysis it can be used and
> > > how?
> > > > >> >
> > > > >> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> [hidden email]
> > > >:
> > > > >> >>
> > > > >> >> Hi Pavel,
> > > > >> >>
> > > > >> >> This time already can be obtained from the
> getCurrentPmeDuration
> > > and
> > > > >> >> new isOperationsBlockedByPme metrics.
> > > > >> >>
> > > > >> >> As an alternative solution, I can rework recently added
> > > > >> >> getCurrentPmeDuration metric (not released yet). Seems for
> users it
> > > > >> >> useless in case of non-blocking PME.
> > > > >> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp
> when
> > > > >> >> blocking started (minimal value of cluster nodes) and 0 if
> blocking
> > > > >> >> ends (there is no running PME).
> > > > >> >>
> > > > >> >> WDYT?
> > > > >> >>
> > > > >> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> [hidden email]>:
> > > > >> >> >
> > > > >> >> > Hi Nikita,
> > > > >> >> >
> > > > >> >> > Thank you for working on this. What do you think if we
> change the
> > > > >> boolean
> > > > >> >> > value of metric to a long value that represents time in
> > > > milliseconds
> > > > >> when
> > > > >> >> > operations were blocked?
> > > > >> >> > Since we have not only JMX and now metrics are periodically
> > > > exported
> > > > >> to
> > > > >> >> > some backend it can give a more clear picture of how much
> time we
> > > > >> wait for
> > > > >> >> > resuming cache operations instead of instant boolean
> indicator.
> > > > >> >> >
> > > > >> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > > [hidden email]
> > > > >:
> > > > >> >> >
> > > > >> >> > > Anton, Nikolay,
> > > > >> >> > >
> > > > >> >> > > Thanks for the support.
> > > > >> >> > >
> > > > >> >> > > For now, we have the getCurrentPmeDuration() metric that
> does
> > > not
> > > > >> show
> > > > >> >> > > influence on the cluster correctly. PME can be without
> blocking
> > > > >> >> > > operations. For example, client node join/leave events.
> > > > >> >> > >
> > > > >> >> > > I suggest add new metric - isOperationsBlockedByPme().
> > > Together,
> > > > >> these
> > > > >> >> > > metrics will show influence of the PME on cluster and user
> > > > >> operations.
> > > > >> >> > >
> > > > >> >> > > I have prepared PR for this (Bot visa is green). [1] Can
> anyone
> > > > >> take a
> > > > >> >> > > look?
> > > > >> >> > >
> > > > >> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > > >> >> > >
> > > > >> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > > > [hidden email]
> > > > >> >:
> > > > >> >> > >
> > > > >> >> > > >
> > > > >> >> > > > I think administator of Ignite cluster should be able to
> > > > monitor
> > > > >> all
> > > > >> >> > > Ignite process, including non blocking PME.
> > > > >> >> > > >
> > > > >> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > > >> >> > > > > BTW,
> > > > >> >> > > > > Found PME metric - getCurrentPmeDuration().
> > > > >> >> > > > > Seems, it shows exactly PME time and not so useful
> because
> > > of
> > > > >> this.
> > > > >> >> > > > > The goal it so show exactly blocking period.
> > > > >> >> > > > > When PME cause no blocking, it's a good PME and I see
> no
> > > > >> reason to have
> > > > >> >> > > > > monitoring related to it :)
> > > > >> >> > > > >
> > > > >> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > > > >> [hidden email]>
> > > > >> >> > > wrote:
> > > > >> >> > > > >
> > > > >> >> > > > > > Anton.
> > > > >> >> > > > > >
> > > > >> >> > > > > > Why do we need to postpone implementation of this
> > > metrics?
> > > > >> >> > > > > > For now, implementation of new metric is very simple.
> > > > >> >> > > > > >
> > > > >> >> > > > > > I think we can implement this metrics as a single
> > > > >> contribution.
> > > > >> >> > > > > >
> > > > >> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> пишет:
> > > > >> >> > > > > > > Nikita,
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Looks like all we need now is a 1 simple metric:
> are
> > > > >> operations
> > > > >> >> > > blocked?
> > > > >> >> > > > > > > Just a true or false.
> > > > >> >> > > > > > > Lest start from this.
> > > > >> >> > > > > > > All other metrics can be extracted from logs now
> and
> > > can
> > > > be
> > > > >> >> > > implemented
> > > > >> >> > > > > > > later.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > > >> >> > > [hidden email]>
> > > > >> >> > > > > > > wrote:
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > > +1.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Nikita, please, go ahead.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > > > >> [hidden email]
> > > > >> >> > > >:
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > > Hello, Igniters.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > I suggest to add some useful metrics about the
> > > > >> partition map
> > > > >> >> > > exchange
> > > > >> >> > > > > > > > > (PME). For now, the duration of PME stages
> > > available
> > > > >> only in
> > > > >> >> > > log
> > > > >> >> > > > > >
> > > > >> >> > > > > > files
> > > > >> >> > > > > > > > > and cannot be obtained using JMX or other
> external
> > > > >> tools. [1]
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > I made the list of local node metrics that
> help to
> > > > >> understand
> > > > >> >> > > the
> > > > >> >> > > > > > > > > actual status of current PME:
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > 1. initialVersion. Topology version that
> initiates
> > > > the
> > > > >> >> > > exchange.
> > > > >> >> > > > > > > > > 2. initTime. Time PME was started.
> > > > >> >> > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > >> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > > finished
> > > > >> waiting
> > > > >> >> > > for
> > > > >> >> > > > > >
> > > > >> >> > > > > > all
> > > > >> >> > > > > > > > > updates and translations on a previous
> topology.
> > > > >> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node
> sent a
> > > > >> single
> > > > >> >> > > message.
> > > > >> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > > received
> > > > a
> > > > >> full
> > > > >> >> > > message.
> > > > >> >> > > > > > > > > 7. finishTime. Time PME was ended.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > When new PME started all these metrics resets.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > These metrics help to understand:
> > > > >> >> > > > > > > > > - how long PME was (current or previous).
> > > > >> >> > > > > > > > > - how long awaited for all updates was
> completed.
> > > > >> >> > > > > > > > > - what node blocks PME (didn't send a single
> > > message)
> > > > >> >> > > > > > > > > - what triggered PME.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > Thoughts?
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > [1]
> > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > --
> > > > >> >> > > > > > > > > Best wishes,
> > > > >> >> > > > > > > > > Amelchev Nikita
> > > > >> >> > > > > > > > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > --
> > > > >> >> > > Best wishes,
> > > > >> >> > > Amelchev Nikita
> > > > >> >> > >
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> --
> > > > >> >> Best wishes,
> > > > >> >> Amelchev Nikita
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Best wishes,
> > > > >> Amelchev Nikita
> > > > >>
> > > > >
> > > >
> > >
>
>
>
> --
> Best wishes,
> Amelchev Nikita
>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Ivan Rakov
Folks, let me step in.

Nikita, thanks for your suggestions!

> 1. initialVersion. Topology version that initiates the exchange.
> 2. initTime. Time PME was started.
> 3. initEvent. Event that triggered PME.
> 4. partitionReleaseTime. Time when a node has finished waiting for all
> updates and translations on a previous topology.
> 5. sendSingleMessageTime. Time when a node sent a single message.
> 6. recieveFullMessageTime. Time when a node received a full message.
> 7. finishTime. Time PME was ended.
>
> When new PME started all these metrics resets.
Every metric from Nikita's list looks useful and simple to implement.
I think that it would be better to change format of metrics 4, 5, 6 and
7 a bit: we can keep only difference between time of previous event and
time of corresponding event. Such metrics would be easier to perceive:
they answer to specific questions "how much time did partition release
take?" or "how much time did awaiting of distributed phase end take?".
Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
graphs will show how different stages times change from one PME to another.

> When PME cause no blocking, it's a good PME and I see no reason to have
> monitoring related to it
Agree with Anton here. These metrics should be measured only for true
distributed exchange. Saving results for client leave/join PMEs will
just complicate monitoring.

> I agree with total blocking duration metric but
> I still don't understand why instant value indicating that operations are
> blocked should be boolean.
> Duration time since blocking has started looks more appropriate and useful.
> It gives more information while semantic is left the same.
Totally agree with Pavel here. Both "accumulated block time" and
"current PME block time" metrics are useful. Growth of accumulated
metric for specific period of time (should be easy to check via
monitoring system graph) will show for how much business operations were
blocked in total, and non-zero current metric will show that we are
experiencing issues right now. Boolean metric "are we blocked right now"
is not needed as it's obviously can be inferred from "current PME block
time".

Best Regards,
Ivan Rakov

On 23.07.2019 16:02, Pavel Kovalenko wrote:

> Nikita,
>
> I agree with total blocking duration metric but
> I still don't understand why instant value indicating that operations are
> blocked should be boolean.
> Duration time since blocking has started looks more appropriate and useful.
> It gives more information while semantic is left the same.
>
>
>
> вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev <[hidden email]>:
>
>> Folks,
>>
>> All previous suggestions have some disadvantages. It can be several
>> exchanges between two metric updates and fast exchange can rewrite
>> previous long exchange.
>>
>> We can introduce a metric of total blocking duration that will
>> accumulate at the end of the exchange. So, users will get actual
>> information about how long operations were blocked. Cluster metric
>> will be a maximum of local nodes metrics. And we need a boolean metric
>> that will indicate realtime status. It needs because of duration
>> metric updates at the end of the exchange.
>>
>> So I propose to change the current metric that not released to the
>> totalCacheOperationsBlockingDuration metric and to add the
>> isCacheOperationsBlocked metric.
>>
>> WDYT?
>>
>> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <[hidden email]>:
>>> Nikolay,
>>>
>>> Still see no reason to replace boolean with long.
>>>
>>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <[hidden email]>
>> wrote:
>>>> Anton.
>>>>
>>>> 1. Value exported based on SPI settings, not in the moment it changed.
>>>>
>>>> 2. Clock synchronisation - if we export start time, we should also
>> export
>>>> node local timestamp.
>>>>
>>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov <[hidden email]>:
>>>>
>>>>> Folks,
>>>>>
>>>>> What's the reason for duration counting?
>>>>> AFAIU, it's a monitoring system feature to count the durations.
>>>>> Sine monitoring system checks metrics periodically it will know the
>>>>> duration by its own log.
>>>>>
>>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> Nikita,
>>>>>>
>>>>>> Yes, I mean duration not timestamp. For the metric name, I suggest
>>>>>> "cacheOperationsBlockingDuration", I think it cleaner represents
>> what
>>>> is
>>>>>> blocked during PME.
>>>>>> We can also combine both timestamp
>> "cacheOperationsBlockingStartTs" and
>>>>>> duration to have better correlation when cache operations were
>> blocked
>>>>> and
>>>>>> how much time it's taken.
>>>>>> For instant view (like in JMX bean) a calculated value as you
>> mentioned
>>>>>> can be used.
>>>>>> For metrics are exported to some backend (IEP-35) a counter can be
>>>> used.
>>>>>> The counter is incremented by blocking time after blocking has
>> ended.
>>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <[hidden email]
>>> :
>>>>>>> Pavel,
>>>>>>>
>>>>>>> The main purpose of this metric is
>>>>>>>>> how much time we wait for resuming cache operations
>>>>>>> Seems I misunderstood you. Do you mean timestamp or duration here?
>>>>>>>>> What do you think if we change the boolean value of metric to a
>>>> long
>>>>>>> value that represents time in milliseconds when operations were
>>>> blocked?
>>>>>>> This time can be calculated as (currentTime -
>>>>>>> timeSinceOperationsBlocked) in case of timestamp.
>>>>>>>
>>>>>>> Duration will be more understandable. It'll be something like
>>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a better
>>>>>>> name yet.
>>>>>>>
>>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <[hidden email]
>>> :
>>>>>>>> Nikita,
>>>>>>>>
>>>>>>>> I think getCurrentPmeDuration doesn't show useful information.
>> The
>>>>> main
>>>>>>> PME side effect for end-users is blocking cache operations. Not
>> all
>>>> PME
>>>>>>> time blocks it.
>>>>>>>> What information gives to an end-user timestamp of
>>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be used and
>>>> how?
>>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
>> [hidden email]
>>>>> :
>>>>>>>>> Hi Pavel,
>>>>>>>>>
>>>>>>>>> This time already can be obtained from the
>> getCurrentPmeDuration
>>>> and
>>>>>>>>> new isOperationsBlockedByPme metrics.
>>>>>>>>>
>>>>>>>>> As an alternative solution, I can rework recently added
>>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
>> users it
>>>>>>>>> useless in case of non-blocking PME.
>>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
>> when
>>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
>> blocking
>>>>>>>>> ends (there is no running PME).
>>>>>>>>>
>>>>>>>>> WDYT?
>>>>>>>>>
>>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
>> [hidden email]>:
>>>>>>>>>> Hi Nikita,
>>>>>>>>>>
>>>>>>>>>> Thank you for working on this. What do you think if we
>> change the
>>>>>>> boolean
>>>>>>>>>> value of metric to a long value that represents time in
>>>>> milliseconds
>>>>>>> when
>>>>>>>>>> operations were blocked?
>>>>>>>>>> Since we have not only JMX and now metrics are periodically
>>>>> exported
>>>>>>> to
>>>>>>>>>> some backend it can give a more clear picture of how much
>> time we
>>>>>>> wait for
>>>>>>>>>> resuming cache operations instead of instant boolean
>> indicator.
>>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
>>>> [hidden email]
>>>>>> :
>>>>>>>>>>> Anton, Nikolay,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the support.
>>>>>>>>>>>
>>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
>> does
>>>> not
>>>>>>> show
>>>>>>>>>>> influence on the cluster correctly. PME can be without
>> blocking
>>>>>>>>>>> operations. For example, client node join/leave events.
>>>>>>>>>>>
>>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
>>>> Together,
>>>>>>> these
>>>>>>>>>>> metrics will show influence of the PME on cluster and user
>>>>>>> operations.
>>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
>> anyone
>>>>>>> take a
>>>>>>>>>>> look?
>>>>>>>>>>>
>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-11961
>>>>>>>>>>>
>>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
>>>>> [hidden email]
>>>>>>>> :
>>>>>>>>>>>> I think administator of Ignite cluster should be able to
>>>>> monitor
>>>>>>> all
>>>>>>>>>>> Ignite process, including non blocking PME.
>>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
>>>>>>>>>>>>> BTW,
>>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
>>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
>> because
>>>> of
>>>>>>> this.
>>>>>>>>>>>>> The goal it so show exactly blocking period.
>>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
>> no
>>>>>>> reason to have
>>>>>>>>>>>>> monitoring related to it :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
>>>>>>> [hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Anton.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Why do we need to postpone implementation of this
>>>> metrics?
>>>>>>>>>>>>>> For now, implementation of new metric is very simple.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think we can implement this metrics as a single
>>>>>>> contribution.
>>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
>> пишет:
>>>>>>>>>>>>>>> Nikita,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
>> are
>>>>>>> operations
>>>>>>>>>>> blocked?
>>>>>>>>>>>>>>> Just a true or false.
>>>>>>>>>>>>>>> Lest start from this.
>>>>>>>>>>>>>>> All other metrics can be extracted from logs now
>> and
>>>> can
>>>>> be
>>>>>>>>>>> implemented
>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
>>>>>>>>>>> [hidden email]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +1.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nikita, please, go ahead.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
>>>>>>> [hidden email]
>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>> Hello, Igniters.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
>>>>>>> partition map
>>>>>>>>>>> exchange
>>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
>>>> available
>>>>>>> only in
>>>>>>>>>>> log
>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
>> external
>>>>>>> tools. [1]
>>>>>>>>>>>>>>>>> I made the list of local node metrics that
>> help to
>>>>>>> understand
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> actual status of current PME:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
>> initiates
>>>>> the
>>>>>>>>>>> exchange.
>>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
>>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
>>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
>>>>> finished
>>>>>>> waiting
>>>>>>>>>>> for
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> updates and translations on a previous
>> topology.
>>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
>> sent a
>>>>>>> single
>>>>>>>>>>> message.
>>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
>>>> received
>>>>> a
>>>>>>> full
>>>>>>>>>>> message.
>>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> These metrics help to understand:
>>>>>>>>>>>>>>>>> - how long PME was (current or previous).
>>>>>>>>>>>>>>>>> - how long awaited for all updates was
>> completed.
>>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
>>>> message)
>>>>>>>>>>>>>>>>> - what triggered PME.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>> https://issues.apache.org/jira/browse/IGNITE-11961
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>>>> Amelchev Nikita
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best wishes,
>>>>>>>>>>> Amelchev Nikita
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best wishes,
>>>>>>>>> Amelchev Nikita
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best wishes,
>>>>>>> Amelchev Nikita
>>>>>>>
>>
>>
>> --
>> Best wishes,
>> Amelchev Nikita
>>
Reply | Threaded
Open this post in threaded view
|

Re: Partition map exchange metrics

Anton Vinogradov-2
Folks,

It looks like we're trying to implement "extended debug" instead of
"monitoring".
It should not be interesting for real admin what phase of PME is in
progress and so on.
Interested metrics are
- total blocked time (will be used for real SLA counting)
- are we blocked right now (shows we have an SLA degradation right now)
Duration of the current blocking period can be easily presented using any
modern monitoring tool by regular checks.
Initial true will means "period start", precision will be a result of
checks frequency.
Anyway, I'm ok to have current metric presented with long, where long is a
duration, see no reason, but ok :)

All other features you mentioned are useful for code or
deployment improving and can (should) be taken from logs at the analysis
phase.

On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov <[hidden email]> wrote:

> Folks, let me step in.
>
> Nikita, thanks for your suggestions!
>
> > 1. initialVersion. Topology version that initiates the exchange.
> > 2. initTime. Time PME was started.
> > 3. initEvent. Event that triggered PME.
> > 4. partitionReleaseTime. Time when a node has finished waiting for all
> > updates and translations on a previous topology.
> > 5. sendSingleMessageTime. Time when a node sent a single message.
> > 6. recieveFullMessageTime. Time when a node received a full message.
> > 7. finishTime. Time PME was ended.
> >
> > When new PME started all these metrics resets.
> Every metric from Nikita's list looks useful and simple to implement.
> I think that it would be better to change format of metrics 4, 5, 6 and
> 7 a bit: we can keep only difference between time of previous event and
> time of corresponding event. Such metrics would be easier to perceive:
> they answer to specific questions "how much time did partition release
> take?" or "how much time did awaiting of distributed phase end take?".
> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> graphs will show how different stages times change from one PME to another.
>
> > When PME cause no blocking, it's a good PME and I see no reason to have
> > monitoring related to it
> Agree with Anton here. These metrics should be measured only for true
> distributed exchange. Saving results for client leave/join PMEs will
> just complicate monitoring.
>
> > I agree with total blocking duration metric but
> > I still don't understand why instant value indicating that operations are
> > blocked should be boolean.
> > Duration time since blocking has started looks more appropriate and
> useful.
> > It gives more information while semantic is left the same.
> Totally agree with Pavel here. Both "accumulated block time" and
> "current PME block time" metrics are useful. Growth of accumulated
> metric for specific period of time (should be easy to check via
> monitoring system graph) will show for how much business operations were
> blocked in total, and non-zero current metric will show that we are
> experiencing issues right now. Boolean metric "are we blocked right now"
> is not needed as it's obviously can be inferred from "current PME block
> time".
>
> Best Regards,
> Ivan Rakov
>
> On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > Nikita,
> >
> > I agree with total blocking duration metric but
> > I still don't understand why instant value indicating that operations are
> > blocked should be boolean.
> > Duration time since blocking has started looks more appropriate and
> useful.
> > It gives more information while semantic is left the same.
> >
> >
> >
> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev <[hidden email]>:
> >
> >> Folks,
> >>
> >> All previous suggestions have some disadvantages. It can be several
> >> exchanges between two metric updates and fast exchange can rewrite
> >> previous long exchange.
> >>
> >> We can introduce a metric of total blocking duration that will
> >> accumulate at the end of the exchange. So, users will get actual
> >> information about how long operations were blocked. Cluster metric
> >> will be a maximum of local nodes metrics. And we need a boolean metric
> >> that will indicate realtime status. It needs because of duration
> >> metric updates at the end of the exchange.
> >>
> >> So I propose to change the current metric that not released to the
> >> totalCacheOperationsBlockingDuration metric and to add the
> >> isCacheOperationsBlocked metric.
> >>
> >> WDYT?
> >>
> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <[hidden email]>:
> >>> Nikolay,
> >>>
> >>> Still see no reason to replace boolean with long.
> >>>
> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <[hidden email]>
> >> wrote:
> >>>> Anton.
> >>>>
> >>>> 1. Value exported based on SPI settings, not in the moment it changed.
> >>>>
> >>>> 2. Clock synchronisation - if we export start time, we should also
> >> export
> >>>> node local timestamp.
> >>>>
> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov <[hidden email]>:
> >>>>
> >>>>> Folks,
> >>>>>
> >>>>> What's the reason for duration counting?
> >>>>> AFAIU, it's a monitoring system feature to count the durations.
> >>>>> Sine monitoring system checks metrics periodically it will know the
> >>>>> duration by its own log.
> >>>>>
> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <[hidden email]>
> >>>>> wrote:
> >>>>>
> >>>>>> Nikita,
> >>>>>>
> >>>>>> Yes, I mean duration not timestamp. For the metric name, I suggest
> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner represents
> >> what
> >>>> is
> >>>>>> blocked during PME.
> >>>>>> We can also combine both timestamp
> >> "cacheOperationsBlockingStartTs" and
> >>>>>> duration to have better correlation when cache operations were
> >> blocked
> >>>>> and
> >>>>>> how much time it's taken.
> >>>>>> For instant view (like in JMX bean) a calculated value as you
> >> mentioned
> >>>>>> can be used.
> >>>>>> For metrics are exported to some backend (IEP-35) a counter can be
> >>>> used.
> >>>>>> The counter is incremented by blocking time after blocking has
> >> ended.
> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <[hidden email]
> >>> :
> >>>>>>> Pavel,
> >>>>>>>
> >>>>>>> The main purpose of this metric is
> >>>>>>>>> how much time we wait for resuming cache operations
> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration here?
> >>>>>>>>> What do you think if we change the boolean value of metric to a
> >>>> long
> >>>>>>> value that represents time in milliseconds when operations were
> >>>> blocked?
> >>>>>>> This time can be calculated as (currentTime -
> >>>>>>> timeSinceOperationsBlocked) in case of timestamp.
> >>>>>>>
> >>>>>>> Duration will be more understandable. It'll be something like
> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a better
> >>>>>>> name yet.
> >>>>>>>
> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <[hidden email]
> >>> :
> >>>>>>>> Nikita,
> >>>>>>>>
> >>>>>>>> I think getCurrentPmeDuration doesn't show useful information.
> >> The
> >>>>> main
> >>>>>>> PME side effect for end-users is blocking cache operations. Not
> >> all
> >>>> PME
> >>>>>>> time blocks it.
> >>>>>>>> What information gives to an end-user timestamp of
> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be used and
> >>>> how?
> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> >> [hidden email]
> >>>>> :
> >>>>>>>>> Hi Pavel,
> >>>>>>>>>
> >>>>>>>>> This time already can be obtained from the
> >> getCurrentPmeDuration
> >>>> and
> >>>>>>>>> new isOperationsBlockedByPme metrics.
> >>>>>>>>>
> >>>>>>>>> As an alternative solution, I can rework recently added
> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
> >> users it
> >>>>>>>>> useless in case of non-blocking PME.
> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
> >> when
> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
> >> blocking
> >>>>>>>>> ends (there is no running PME).
> >>>>>>>>>
> >>>>>>>>> WDYT?
> >>>>>>>>>
> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> >> [hidden email]>:
> >>>>>>>>>> Hi Nikita,
> >>>>>>>>>>
> >>>>>>>>>> Thank you for working on this. What do you think if we
> >> change the
> >>>>>>> boolean
> >>>>>>>>>> value of metric to a long value that represents time in
> >>>>> milliseconds
> >>>>>>> when
> >>>>>>>>>> operations were blocked?
> >>>>>>>>>> Since we have not only JMX and now metrics are periodically
> >>>>> exported
> >>>>>>> to
> >>>>>>>>>> some backend it can give a more clear picture of how much
> >> time we
> >>>>>>> wait for
> >>>>>>>>>> resuming cache operations instead of instant boolean
> >> indicator.
> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> >>>> [hidden email]
> >>>>>> :
> >>>>>>>>>>> Anton, Nikolay,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the support.
> >>>>>>>>>>>
> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
> >> does
> >>>> not
> >>>>>>> show
> >>>>>>>>>>> influence on the cluster correctly. PME can be without
> >> blocking
> >>>>>>>>>>> operations. For example, client node join/leave events.
> >>>>>>>>>>>
> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
> >>>> Together,
> >>>>>>> these
> >>>>>>>>>>> metrics will show influence of the PME on cluster and user
> >>>>>>> operations.
> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
> >> anyone
> >>>>>>> take a
> >>>>>>>>>>> look?
> >>>>>>>>>>>
> >>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-11961
> >>>>>>>>>>>
> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> >>>>> [hidden email]
> >>>>>>>> :
> >>>>>>>>>>>> I think administator of Ignite cluster should be able to
> >>>>> monitor
> >>>>>>> all
> >>>>>>>>>>> Ignite process, including non blocking PME.
> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> >>>>>>>>>>>>> BTW,
> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
> >> because
> >>>> of
> >>>>>>> this.
> >>>>>>>>>>>>> The goal it so show exactly blocking period.
> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
> >> no
> >>>>>>> reason to have
> >>>>>>>>>>>>> monitoring related to it :)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> >>>>>>> [hidden email]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> Anton.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Why do we need to postpone implementation of this
> >>>> metrics?
> >>>>>>>>>>>>>> For now, implementation of new metric is very simple.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I think we can implement this metrics as a single
> >>>>>>> contribution.
> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> >> пишет:
> >>>>>>>>>>>>>>> Nikita,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
> >> are
> >>>>>>> operations
> >>>>>>>>>>> blocked?
> >>>>>>>>>>>>>>> Just a true or false.
> >>>>>>>>>>>>>>> Lest start from this.
> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now
> >> and
> >>>> can
> >>>>> be
> >>>>>>>>>>> implemented
> >>>>>>>>>>>>>>> later.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> >>>>>>>>>>> [hidden email]>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +1.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Nikita, please, go ahead.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> >>>>>>> [hidden email]
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>> Hello, Igniters.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
> >>>>>>> partition map
> >>>>>>>>>>> exchange
> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
> >>>> available
> >>>>>>> only in
> >>>>>>>>>>> log
> >>>>>>>>>>>>>> files
> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
> >> external
> >>>>>>> tools. [1]
> >>>>>>>>>>>>>>>>> I made the list of local node metrics that
> >> help to
> >>>>>>> understand
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> actual status of current PME:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
> >> initiates
> >>>>> the
> >>>>>>>>>>> exchange.
> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
> >>>>> finished
> >>>>>>> waiting
> >>>>>>>>>>> for
> >>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>> updates and translations on a previous
> >> topology.
> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
> >> sent a
> >>>>>>> single
> >>>>>>>>>>> message.
> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
> >>>> received
> >>>>> a
> >>>>>>> full
> >>>>>>>>>>> message.
> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> These metrics help to understand:
> >>>>>>>>>>>>>>>>> - how long PME was (current or previous).
> >>>>>>>>>>>>>>>>> - how long awaited for all updates was
> >> completed.
> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
> >>>> message)
> >>>>>>>>>>>>>>>>> - what triggered PME.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thoughts?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [1]
> >>>>> https://issues.apache.org/jira/browse/IGNITE-11961
> >>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>> Best wishes,
> >>>>>>>>>>>>>>>>> Amelchev Nikita
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best wishes,
> >>>>>>>>>>> Amelchev Nikita
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best wishes,
> >>>>>>>>> Amelchev Nikita
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Best wishes,
> >>>>>>> Amelchev Nikita
> >>>>>>>
> >>
> >>
> >> --
> >> Best wishes,
> >> Amelchev Nikita
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re[2]: Partition map exchange metrics

Zhenya Stanilovsky
+1 with Anton decisions.


>Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <[hidden email]>:
>
>Folks,
>
>It looks like we're trying to implement "extended debug" instead of
>"monitoring".
>It should not be interesting for real admin what phase of PME is in
>progress and so on.
>Interested metrics are
>- total blocked time (will be used for real SLA counting)
>- are we blocked right now (shows we have an SLA degradation right now)
>Duration of the current blocking period can be easily presented using any
>modern monitoring tool by regular checks.
>Initial true will means "period start", precision will be a result of
>checks frequency.
>Anyway, I'm ok to have current metric presented with long, where long is a
>duration, see no reason, but ok :)
>
>All other features you mentioned are useful for code or
>deployment improving and can (should) be taken from logs at the analysis
>phase.
>
>On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < [hidden email] > wrote:
>
>> Folks, let me step in.
>>
>> Nikita, thanks for your suggestions!
>>
>> > 1. initialVersion. Topology version that initiates the exchange.
>> > 2. initTime. Time PME was started.
>> > 3. initEvent. Event that triggered PME.
>> > 4. partitionReleaseTime. Time when a node has finished waiting for all
>> > updates and translations on a previous topology.
>> > 5. sendSingleMessageTime. Time when a node sent a single message.
>> > 6. recieveFullMessageTime. Time when a node received a full message.
>> > 7. finishTime. Time PME was ended.
>> >
>> > When new PME started all these metrics resets.
>> Every metric from Nikita's list looks useful and simple to implement.
>> I think that it would be better to change format of metrics 4, 5, 6 and
>> 7 a bit: we can keep only difference between time of previous event and
>> time of corresponding event. Such metrics would be easier to perceive:
>> they answer to specific questions "how much time did partition release
>> take?" or "how much time did awaiting of distributed phase end take?".
>> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
>> graphs will show how different stages times change from one PME to another.
>>
>> > When PME cause no blocking, it's a good PME and I see no reason to have
>> > monitoring related to it
>> Agree with Anton here. These metrics should be measured only for true
>> distributed exchange. Saving results for client leave/join PMEs will
>> just complicate monitoring.
>>
>> > I agree with total blocking duration metric but
>> > I still don't understand why instant value indicating that operations are
>> > blocked should be boolean.
>> > Duration time since blocking has started looks more appropriate and
>> useful.
>> > It gives more information while semantic is left the same.
>> Totally agree with Pavel here. Both "accumulated block time" and
>> "current PME block time" metrics are useful. Growth of accumulated
>> metric for specific period of time (should be easy to check via
>> monitoring system graph) will show for how much business operations were
>> blocked in total, and non-zero current metric will show that we are
>> experiencing issues right now. Boolean metric "are we blocked right now"
>> is not needed as it's obviously can be inferred from "current PME block
>> time".
>>
>> Best Regards,
>> Ivan Rakov
>>
>> On 23.07.2019 16:02, Pavel Kovalenko wrote:
>> > Nikita,
>> >
>> > I agree with total blocking duration metric but
>> > I still don't understand why instant value indicating that operations are
>> > blocked should be boolean.
>> > Duration time since blocking has started looks more appropriate and
>> useful.
>> > It gives more information while semantic is left the same.
>> >
>> >
>> >
>> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < [hidden email] >:
>> >
>> >> Folks,
>> >>
>> >> All previous suggestions have some disadvantages. It can be several
>> >> exchanges between two metric updates and fast exchange can rewrite
>> >> previous long exchange.
>> >>
>> >> We can introduce a metric of total blocking duration that will
>> >> accumulate at the end of the exchange. So, users will get actual
>> >> information about how long operations were blocked. Cluster metric
>> >> will be a maximum of local nodes metrics. And we need a boolean metric
>> >> that will indicate realtime status. It needs because of duration
>> >> metric updates at the end of the exchange.
>> >>
>> >> So I propose to change the current metric that not released to the
>> >> totalCacheOperationsBlockingDuration metric and to add the
>> >> isCacheOperationsBlocked metric.
>> >>
>> >> WDYT?
>> >>
>> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < [hidden email] >:
>> >>> Nikolay,
>> >>>
>> >>> Still see no reason to replace boolean with long.
>> >>>
>> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < [hidden email] >
>> >> wrote:
>> >>>> Anton.
>> >>>>
>> >>>> 1. Value exported based on SPI settings, not in the moment it changed.
>> >>>>
>> >>>> 2. Clock synchronisation - if we export start time, we should also
>> >> export
>> >>>> node local timestamp.
>> >>>>
>> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < [hidden email] >:
>> >>>>
>> >>>>> Folks,
>> >>>>>
>> >>>>> What's the reason for duration counting?
>> >>>>> AFAIU, it's a monitoring system feature to count the durations.
>> >>>>> Sine monitoring system checks metrics periodically it will know the
>> >>>>> duration by its own log.
>> >>>>>
>> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < [hidden email] >
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Nikita,
>> >>>>>>
>> >>>>>> Yes, I mean duration not timestamp. For the metric name, I suggest
>> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner represents
>> >> what
>> >>>> is
>> >>>>>> blocked during PME.
>> >>>>>> We can also combine both timestamp
>> >> "cacheOperationsBlockingStartTs" and
>> >>>>>> duration to have better correlation when cache operations were
>> >> blocked
>> >>>>> and
>> >>>>>> how much time it's taken.
>> >>>>>> For instant view (like in JMX bean) a calculated value as you
>> >> mentioned
>> >>>>>> can be used.
>> >>>>>> For metrics are exported to some backend (IEP-35) a counter can be
>> >>>> used.
>> >>>>>> The counter is incremented by blocking time after blocking has
>> >> ended.
>> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < [hidden email]
>> >>> :
>> >>>>>>> Pavel,
>> >>>>>>>
>> >>>>>>> The main purpose of this metric is
>> >>>>>>>>> how much time we wait for resuming cache operations
>> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration here?
>> >>>>>>>>> What do you think if we change the boolean value of metric to a
>> >>>> long
>> >>>>>>> value that represents time in milliseconds when operations were
>> >>>> blocked?
>> >>>>>>> This time can be calculated as (currentTime -
>> >>>>>>> timeSinceOperationsBlocked) in case of timestamp.
>> >>>>>>>
>> >>>>>>> Duration will be more understandable. It'll be something like
>> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a better
>> >>>>>>> name yet.
>> >>>>>>>
>> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < [hidden email]
>> >>> :
>> >>>>>>>> Nikita,
>> >>>>>>>>
>> >>>>>>>> I think getCurrentPmeDuration doesn't show useful information.
>> >> The
>> >>>>> main
>> >>>>>>> PME side effect for end-users is blocking cache operations. Not
>> >> all
>> >>>> PME
>> >>>>>>> time blocks it.
>> >>>>>>>> What information gives to an end-user timestamp of
>> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be used and
>> >>>> how?
>> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
>> >>  [hidden email]
>> >>>>> :
>> >>>>>>>>> Hi Pavel,
>> >>>>>>>>>
>> >>>>>>>>> This time already can be obtained from the
>> >> getCurrentPmeDuration
>> >>>> and
>> >>>>>>>>> new isOperationsBlockedByPme metrics.
>> >>>>>>>>>
>> >>>>>>>>> As an alternative solution, I can rework recently added
>> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
>> >> users it
>> >>>>>>>>> useless in case of non-blocking PME.
>> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
>> >> when
>> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
>> >> blocking
>> >>>>>>>>> ends (there is no running PME).
>> >>>>>>>>>
>> >>>>>>>>> WDYT?
>> >>>>>>>>>
>> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
>> >>  [hidden email] >:
>> >>>>>>>>>> Hi Nikita,
>> >>>>>>>>>>
>> >>>>>>>>>> Thank you for working on this. What do you think if we
>> >> change the
>> >>>>>>> boolean
>> >>>>>>>>>> value of metric to a long value that represents time in
>> >>>>> milliseconds
>> >>>>>>> when
>> >>>>>>>>>> operations were blocked?
>> >>>>>>>>>> Since we have not only JMX and now metrics are periodically
>> >>>>> exported
>> >>>>>>> to
>> >>>>>>>>>> some backend it can give a more clear picture of how much
>> >> time we
>> >>>>>>> wait for
>> >>>>>>>>>> resuming cache operations instead of instant boolean
>> >> indicator.
>> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
>> >>>>  [hidden email]
>> >>>>>> :
>> >>>>>>>>>>> Anton, Nikolay,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks for the support.
>> >>>>>>>>>>>
>> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
>> >> does
>> >>>> not
>> >>>>>>> show
>> >>>>>>>>>>> influence on the cluster correctly. PME can be without
>> >> blocking
>> >>>>>>>>>>> operations. For example, client node join/leave events.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
>> >>>> Together,
>> >>>>>>> these
>> >>>>>>>>>>> metrics will show influence of the PME on cluster and user
>> >>>>>>> operations.
>> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
>> >> anyone
>> >>>>>>> take a
>> >>>>>>>>>>> look?
>> >>>>>>>>>>>
>> >>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
>> >>>>>>>>>>>
>> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
>> >>>>>  [hidden email]
>> >>>>>>>> :
>> >>>>>>>>>>>> I think administator of Ignite cluster should be able to
>> >>>>> monitor
>> >>>>>>> all
>> >>>>>>>>>>> Ignite process, including non blocking PME.
>> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
>> >>>>>>>>>>>>> BTW,
>> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
>> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
>> >> because
>> >>>> of
>> >>>>>>> this.
>> >>>>>>>>>>>>> The goal it so show exactly blocking period.
>> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
>> >> no
>> >>>>>>> reason to have
>> >>>>>>>>>>>>> monitoring related to it :)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
>> >>>>>>>  [hidden email] >
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>> Anton.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Why do we need to postpone implementation of this
>> >>>> metrics?
>> >>>>>>>>>>>>>> For now, implementation of new metric is very simple.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I think we can implement this metrics as a single
>> >>>>>>> contribution.
>> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
>> >> пишет:
>> >>>>>>>>>>>>>>> Nikita,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
>> >> are
>> >>>>>>> operations
>> >>>>>>>>>>> blocked?
>> >>>>>>>>>>>>>>> Just a true or false.
>> >>>>>>>>>>>>>>> Lest start from this.
>> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now
>> >> and
>> >>>> can
>> >>>>> be
>> >>>>>>>>>>> implemented
>> >>>>>>>>>>>>>>> later.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
>> >>>>>>>>>>>  [hidden email] >
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> +1.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Nikita, please, go ahead.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
>> >>>>>>>  [hidden email]
>> >>>>>>>>>>>> :
>> >>>>>>>>>>>>>>>>> Hello, Igniters.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
>> >>>>>>> partition map
>> >>>>>>>>>>> exchange
>> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
>> >>>> available
>> >>>>>>> only in
>> >>>>>>>>>>> log
>> >>>>>>>>>>>>>> files
>> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
>> >> external
>> >>>>>>> tools. [1]
>> >>>>>>>>>>>>>>>>> I made the list of local node metrics that
>> >> help to
>> >>>>>>> understand
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> actual status of current PME:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
>> >> initiates
>> >>>>> the
>> >>>>>>>>>>> exchange.
>> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
>> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
>> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
>> >>>>> finished
>> >>>>>>> waiting
>> >>>>>>>>>>> for
>> >>>>>>>>>>>>>> all
>> >>>>>>>>>>>>>>>>> updates and translations on a previous
>> >> topology.
>> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
>> >> sent a
>> >>>>>>> single
>> >>>>>>>>>>> message.
>> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
>> >>>> received
>> >>>>> a
>> >>>>>>> full
>> >>>>>>>>>>> message.
>> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> These metrics help to understand:
>> >>>>>>>>>>>>>>>>> - how long PME was (current or previous).
>> >>>>>>>>>>>>>>>>> - how long awaited for all updates was
>> >> completed.
>> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
>> >>>> message)
>> >>>>>>>>>>>>>>>>> - what triggered PME.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thoughts?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> [1]
>> >>>>>  https://issues.apache.org/jira/browse/IGNITE-11961
>> >>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>> Best wishes,
>> >>>>>>>>>>>>>>>>> Amelchev Nikita
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Best wishes,
>> >>>>>>>>>>> Amelchev Nikita
>> >>>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Best wishes,
>> >>>>>>>>> Amelchev Nikita
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Best wishes,
>> >>>>>>> Amelchev Nikita
>> >>>>>>>
>> >>
>> >>
>> >> --
>> >> Best wishes,
>> >> Amelchev Nikita
>> >>
>>


--
Zhenya Stanilovsky
12