Lightweight profiling of messages processing

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Lightweight profiling of messages processing

Alexei Scherbakov
Igniters,

At work I often have to solve performance issues with Ignite cluster
without having access to source code of running user application.

Looks like Ignite have limited capabilities to identify bottlenecks without
extensive profiling on server and client side (JFR recording , sampling
profilers, regular thread dumps,  etc), which is not always possible.

Even having profiling data not always helpful for determining several types
of bottlenecks, on example, if where is a contention on single
key/partition.

I propose to implement new feature, like lightweight profiling of message
processing.

The feature will allow to have view on message processing statistics for
each node and for all grid nodes.

In short, it's necessary to track each message execution in executors and
record execution statistics like synchronous execution time in executor
thread and waiting time in queue.

Full description:

1. Implement detailed tracking of message waiting in queue and actual
processing by executors with splitting to several time bins. Example of
detailed statistics for each processed message:


Processing time(%)
                                 Message       Total     Average(ms)
     < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
500ms   < 750ms  < 1000ms  > 1000ms
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                GridNearSingleGetRequest   904311372        0.023000
  904240521     57394      7242      3961      1932       229
61        24         4         4
               GridNearSingleGetResponse   340134416        0.041000
  340118791     11660      1167       729       901      1001
158         8         1         0
                     GridNearLockRequest    77088689        0.079000
   77073458     11945      2299       643       311        31
2         0         0         0
 GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
   39580914     28222      6469      4638      9870     13414
2087       137         1         0
        GridDhtAtomicSingleUpdateRequest    37636829        0.277000
   37579375     23247      5915      4210      8954     12917
2048       162         1         0
     GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
   33579805       337        51         3         1         1
0         0         0         0
                GridNearTxPrepareRequest    21667790        0.238000
   21078069    580126      1622      1261      2531      3631
496        40         0        14
                GridDhtTxPrepareResponse    20949873        0.316000
   17130161   3803105      4615      3162      4489      3721
577        34         1         8
                 GridDhtTxPrepareRequest    20949838        0.501000
   16158732   4750217     16183      5735      8472      8994
1353        88        11        53
                 GridDhtTxFinishResponse    13835065        0.007000
   13832519      2476        27        28        14         1
0         0         0         0
                  GridDhtTxFinishRequest    13835028        0.547000
   12084106   1736789      8971      2340      1792       807
118        41         4        60
                 GridNearTxFinishRequest    13762197        0.725000
   11811828   1942499      4441      1400      1201       524
89        34        19       162
               GridDhtAtomicNearResponse     2784422        0.122000
    2783393      1022         5         2         0         0
0         0         0         0
                      GridNearGetRequest     2360483        0.484000
    2345937     14129       244       101        64         8
0         0         0         0
                     GridNearGetResponse     1984243        0.054000
    1981905      2327         8         1         1         1
0         0         0         0
               GridNearTxPrepareResponse      192856        0.153000
     192660       188         1         5         1         1
0         0         0         0
                    GridNearLockResponse      192780        0.091000
     192667       107         3         0         3         0
0         0         0         0
                GridNearTxFinishResponse         177        0.822000
        129        47         1         0         0         0
0         0         0         0
       GridNearAtomicSingleUpdateRequest         124        4.803000
         52        53        19         0         0         0
0         0         0         0
            GridNearAtomicUpdateResponse         120        0.448000
        110        10         0         0         0         0
0         0         0         0
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          1544912252
 1531765132  12965900     59283     28219     40537     45281
6989       568        42       301


Queue waiting time(%)
                                 Message       Total     Average(ms)
     < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
500ms   < 750ms  < 1000ms  > 1000ms
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                GridNearSingleGetRequest   904311372        0.023000
  903130400   1051499     33248     18049     34044     38809
4714       212         8       389
               GridNearSingleGetResponse   340134416        0.041000
  339708055    419479      3732      1004      1222       737
117         2         1        67
                     GridNearLockRequest    77088689        0.079000
   76949792    129666      2003      1624      2585      2531
390        18         2        78
 GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
   39572801     53913      3949      2960      5487      5583
963        45         5        46
        GridDhtAtomicSingleUpdateRequest    37636829        0.277000
   37561043     57294      3721      2736      5339      5716
879        54         6        41
     GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
   33512734     45873      4427      3340      6342      6363
1033        50        10        26
                GridNearTxPrepareRequest    21667790        0.238000
   21584416     76283      1681      1026      1948      2027
330        31         2        46
                GridDhtTxPrepareResponse    20949873        0.316000
   20862632     83578      1214       633       821       817
118        26         8        26
                 GridDhtTxPrepareRequest    20949838        0.501000
   20867551     72264      2265      1516      2773      2905
479        38         6        41
                 GridDhtTxFinishResponse    13835065        0.007000
   13808274     26487       150        68        55        21
3         1         0         6
                  GridDhtTxFinishRequest    13835028        0.547000
   13809380     25142       188       127       102        59
9         2         4        15
                 GridNearTxFinishRequest    13762197        0.725000
   13743925     17365       241       171       278       191
16         4         0         6
               GridDhtAtomicNearResponse     2784422        0.122000
    2780600      3767        29         9         7         9
1         0         0         0
                      GridNearGetRequest     2360483        0.484000
    2355963      3762       194       114       184       228
38         0         0         0
                     GridNearGetResponse     1984243        0.054000
    1979404      4721        31         9        49        28
1         0         0         0
               GridNearTxPrepareResponse      192856        0.153000
     192852         2         0         1         1         0
0         0         0         0
                    GridNearLockResponse      192780        0.091000
     192777         3         0         0         0         0
0         0         0         0
                GridNearTxFinishResponse         177        0.822000
        177         0         0         0         0         0
0         0         0         0
       GridNearAtomicSingleUpdateRequest         124        4.803000
        124         0         0         0         0         0
0         0         0         0
            GridNearAtomicUpdateResponse         120        0.448000
        119         1         0         0         0         0
0         0         0         0
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          1544912252
 1542613019   2071099     57073     33387     61237     66024
9091       483        52       787


This will allow to understand cluster load type and presence of problems
with increased latency in message processing. In general all messages
should be processed within time < 30ms.

2. Log messages which are waiting too long or have been processed for too
long (this is configured threshold).

Track the following: message enqueue time, waiting in queue until
processing is started, head of queue before starting to wait, queue size
before starting to wait, total waiting time, total processing time, queue
size after processing, message content. Example:

Slow message: *enqueueTs*=2018-11-27 15:10:22.241, *waitTime*=0.048,
*procTime*=305.186, *messageId*=3a3064a9, *queueSzBefore*=0,
*headMessageId*=null,
*queueSzAfter*=0, *message*=GridNearTxFinishRequest [miniId=1,
mvccSnapshot=null, super=GridDistributedTxFinishRequest
[topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0],
futId=199a3155761-f379f312-ad4b-4181-acc5-0aacb3391f07, threadId=296,
commitVer=null, invalidate=false, commit=true, baseVer=null, txSize=0,
sys=false, plc=2, subjId=dda703a0-69ee-47cf-9b9a-bf3dc9309feb,
taskNameHash=0, flags=32, syncMode=FULL_SYNC, txState=IgniteTxStateImpl
[activeCacheIds=[644280847], recovery=false, mvccEnabled=false,
txMap=HashSet [IgniteTxEntry [key=KeyCacheObjectImpl [part=8, val=8,
hasValBytes=true], cacheId=644280847, txKey=IgniteTxKey
[key=KeyCacheObjectImpl [part=8, val=8, hasValBytes=true],
cacheId=644280847], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=false,
entry=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8,
hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0,
nodeOrder=0], hash=8, extras=GridCacheObsoleteEntryExtras
[obsoleteVer=GridCacheVersion [topVer=2147483647, order=0, nodeOrder=0]],
flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
[rdrs=ReaderId[] [], part=8, super=], prepared=0, locked=false,
nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
[topVer=154800617, order=1543320621588, nodeOrder=1]], IgniteTxEntry
[key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true],
cacheId=644280847, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=11,
val=11, hasValBytes=true], cacheId=644280847], val=[op=READ, val=null],
prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [],
filtersPassed=false, filtersSet=false, entry=GridCacheMapEntry
[key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true], val=null,
ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=11,
extras=GridCacheObsoleteEntryExtras [obsoleteVer=GridCacheVersion
[topVer=2147483647, order=0, nodeOrder=0]],
flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
[rdrs=ReaderId[] [], part=11, super=], prepared=0, locked=false,
nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
[topVer=154800617, order=1543320621588, nodeOrder=1]]]],
super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=154800617,
order=1543320621587, nodeOrder=4], committedVers=null, rolledbackVers=null,
cnt=0, super=GridCacheIdMessage [cacheId=0]]]]

This will allow to understand type of activity which produces latency
issues and possibly identify culprit code.

3. Track all synchronous waits while processing message: checkpoint locks,
segment locks, page locks, cold page reads from disk, etc.

4. Allow extensions to collection additional statistics for specific
message type. On example, for *TxFinishRequest we could track total
entities written for each cache and size in bytes. For GridNearLockRequest
we could track keys with max lock queue length for identifying contention.

5. Implement additional logging, JMX metrics and control.sh command for
printing currently collected profiling data.

6. Implementation should be as lightweight as possible.

Maybe something else.

BTW, I already started working on it and want to contribute my work to
community. Ticket with PR (not finished) [1], check for
CacheMessageStatsTest and CacheMessageStatsIndexingTest

[1] https://issues.apache.org/jira/browse/IGNITE-10418

Thoughts ?

--

Best regards,
Alexei Scherbakov
Reply | Threaded
Open this post in threaded view
|

Re: Lightweight profiling of messages processing

Alexey Kukushkin
Hi Alexei,

Did you consider general-purpose APM
<https://en.wikipedia.org/wiki/Application_performance_management> tools
like free InspectIT <https://github.com/inspectIT/inspectIT> or commercial
DynaTrace <https://www.dynatrace.com/> or AppDynamics
<https://www.appdynamics.com/>?

Java APM tools do not require writing code or even instrumenting Ignite
binaries: you just attach them as javaagent
<https://zeroturnaround.com/rebellabs/how-to-inspect-classes-in-your-jvm/> to
Ignite JVM and then you can configure "sensors" to track whatever API you
want. APM tools collect metrics from the sensors and provide sophisticated
analysis tools.

DynaTrace claims they have 2%-7% overhead (depending on application) but
you can always detach the tool if you do not always need it in production.

I did not try APM for Ignite myself but it might work well.

On Tue, Nov 27, 2018 at 4:37 PM Alexei Scherbakov <
[hidden email]> wrote:

> Igniters,
>
> At work I often have to solve performance issues with Ignite cluster
> without having access to source code of running user application.
>
> Looks like Ignite have limited capabilities to identify bottlenecks without
> extensive profiling on server and client side (JFR recording , sampling
> profilers, regular thread dumps,  etc), which is not always possible.
>
> Even having profiling data not always helpful for determining several types
> of bottlenecks, on example, if where is a contention on single
> key/partition.
>
> I propose to implement new feature, like lightweight profiling of message
> processing.
>
> The feature will allow to have view on message processing statistics for
> each node and for all grid nodes.
>
> In short, it's necessary to track each message execution in executors and
> record execution statistics like synchronous execution time in executor
> thread and waiting time in queue.
>
> Full description:
>
> 1. Implement detailed tracking of message waiting in queue and actual
> processing by executors with splitting to several time bins. Example of
> detailed statistics for each processed message:
>
>
> Processing time(%)
>                                  Message       Total     Average(ms)
>      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
> 500ms   < 750ms  < 1000ms  > 1000ms
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                 GridNearSingleGetRequest   904311372        0.023000
>   904240521     57394      7242      3961      1932       229
> 61        24         4         4
>                GridNearSingleGetResponse   340134416        0.041000
>   340118791     11660      1167       729       901      1001
> 158         8         1         0
>                      GridNearLockRequest    77088689        0.079000
>    77073458     11945      2299       643       311        31
> 2         0         0         0
>  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
>    39580914     28222      6469      4638      9870     13414
> 2087       137         1         0
>         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
>    37579375     23247      5915      4210      8954     12917
> 2048       162         1         0
>      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
>    33579805       337        51         3         1         1
> 0         0         0         0
>                 GridNearTxPrepareRequest    21667790        0.238000
>    21078069    580126      1622      1261      2531      3631
> 496        40         0        14
>                 GridDhtTxPrepareResponse    20949873        0.316000
>    17130161   3803105      4615      3162      4489      3721
> 577        34         1         8
>                  GridDhtTxPrepareRequest    20949838        0.501000
>    16158732   4750217     16183      5735      8472      8994
> 1353        88        11        53
>                  GridDhtTxFinishResponse    13835065        0.007000
>    13832519      2476        27        28        14         1
> 0         0         0         0
>                   GridDhtTxFinishRequest    13835028        0.547000
>    12084106   1736789      8971      2340      1792       807
> 118        41         4        60
>                  GridNearTxFinishRequest    13762197        0.725000
>    11811828   1942499      4441      1400      1201       524
> 89        34        19       162
>                GridDhtAtomicNearResponse     2784422        0.122000
>     2783393      1022         5         2         0         0
> 0         0         0         0
>                       GridNearGetRequest     2360483        0.484000
>     2345937     14129       244       101        64         8
> 0         0         0         0
>                      GridNearGetResponse     1984243        0.054000
>     1981905      2327         8         1         1         1
> 0         0         0         0
>                GridNearTxPrepareResponse      192856        0.153000
>      192660       188         1         5         1         1
> 0         0         0         0
>                     GridNearLockResponse      192780        0.091000
>      192667       107         3         0         3         0
> 0         0         0         0
>                 GridNearTxFinishResponse         177        0.822000
>         129        47         1         0         0         0
> 0         0         0         0
>        GridNearAtomicSingleUpdateRequest         124        4.803000
>          52        53        19         0         0         0
> 0         0         0         0
>             GridNearAtomicUpdateResponse         120        0.448000
>         110        10         0         0         0         0
> 0         0         0         0
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                                           1544912252
>  1531765132  12965900     59283     28219     40537     45281
> 6989       568        42       301
>
>
> Queue waiting time(%)
>                                  Message       Total     Average(ms)
>      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
> 500ms   < 750ms  < 1000ms  > 1000ms
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                 GridNearSingleGetRequest   904311372        0.023000
>   903130400   1051499     33248     18049     34044     38809
> 4714       212         8       389
>                GridNearSingleGetResponse   340134416        0.041000
>   339708055    419479      3732      1004      1222       737
> 117         2         1        67
>                      GridNearLockRequest    77088689        0.079000
>    76949792    129666      2003      1624      2585      2531
> 390        18         2        78
>  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
>    39572801     53913      3949      2960      5487      5583
> 963        45         5        46
>         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
>    37561043     57294      3721      2736      5339      5716
> 879        54         6        41
>      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
>    33512734     45873      4427      3340      6342      6363
> 1033        50        10        26
>                 GridNearTxPrepareRequest    21667790        0.238000
>    21584416     76283      1681      1026      1948      2027
> 330        31         2        46
>                 GridDhtTxPrepareResponse    20949873        0.316000
>    20862632     83578      1214       633       821       817
> 118        26         8        26
>                  GridDhtTxPrepareRequest    20949838        0.501000
>    20867551     72264      2265      1516      2773      2905
> 479        38         6        41
>                  GridDhtTxFinishResponse    13835065        0.007000
>    13808274     26487       150        68        55        21
> 3         1         0         6
>                   GridDhtTxFinishRequest    13835028        0.547000
>    13809380     25142       188       127       102        59
> 9         2         4        15
>                  GridNearTxFinishRequest    13762197        0.725000
>    13743925     17365       241       171       278       191
> 16         4         0         6
>                GridDhtAtomicNearResponse     2784422        0.122000
>     2780600      3767        29         9         7         9
> 1         0         0         0
>                       GridNearGetRequest     2360483        0.484000
>     2355963      3762       194       114       184       228
> 38         0         0         0
>                      GridNearGetResponse     1984243        0.054000
>     1979404      4721        31         9        49        28
> 1         0         0         0
>                GridNearTxPrepareResponse      192856        0.153000
>      192852         2         0         1         1         0
> 0         0         0         0
>                     GridNearLockResponse      192780        0.091000
>      192777         3         0         0         0         0
> 0         0         0         0
>                 GridNearTxFinishResponse         177        0.822000
>         177         0         0         0         0         0
> 0         0         0         0
>        GridNearAtomicSingleUpdateRequest         124        4.803000
>         124         0         0         0         0         0
> 0         0         0         0
>             GridNearAtomicUpdateResponse         120        0.448000
>         119         1         0         0         0         0
> 0         0         0         0
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>                                           1544912252
>  1542613019   2071099     57073     33387     61237     66024
> 9091       483        52       787
>
>
> This will allow to understand cluster load type and presence of problems
> with increased latency in message processing. In general all messages
> should be processed within time < 30ms.
>
> 2. Log messages which are waiting too long or have been processed for too
> long (this is configured threshold).
>
> Track the following: message enqueue time, waiting in queue until
> processing is started, head of queue before starting to wait, queue size
> before starting to wait, total waiting time, total processing time, queue
> size after processing, message content. Example:
>
> Slow message: *enqueueTs*=2018-11-27 15:10:22.241, *waitTime*=0.048,
> *procTime*=305.186, *messageId*=3a3064a9, *queueSzBefore*=0,
> *headMessageId*=null,
> *queueSzAfter*=0, *message*=GridNearTxFinishRequest [miniId=1,
> mvccSnapshot=null, super=GridDistributedTxFinishRequest
> [topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0],
> futId=199a3155761-f379f312-ad4b-4181-acc5-0aacb3391f07, threadId=296,
> commitVer=null, invalidate=false, commit=true, baseVer=null, txSize=0,
> sys=false, plc=2, subjId=dda703a0-69ee-47cf-9b9a-bf3dc9309feb,
> taskNameHash=0, flags=32, syncMode=FULL_SYNC, txState=IgniteTxStateImpl
> [activeCacheIds=[644280847], recovery=false, mvccEnabled=false,
> txMap=HashSet [IgniteTxEntry [key=KeyCacheObjectImpl [part=8, val=8,
> hasValBytes=true], cacheId=644280847, txKey=IgniteTxKey
> [key=KeyCacheObjectImpl [part=8, val=8, hasValBytes=true],
> cacheId=644280847], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
> oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
> conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
> filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=false,
> entry=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8,
> hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0,
> nodeOrder=0], hash=8, extras=GridCacheObsoleteEntryExtras
> [obsoleteVer=GridCacheVersion [topVer=2147483647, order=0, nodeOrder=0]],
> flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
> [rdrs=ReaderId[] [], part=8, super=], prepared=0, locked=false,
> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
> [topVer=154800617, order=1543320621588, nodeOrder=1]], IgniteTxEntry
> [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true],
> cacheId=644280847, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=11,
> val=11, hasValBytes=true], cacheId=644280847], val=[op=READ, val=null],
> prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
> entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
> explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [],
> filtersPassed=false, filtersSet=false, entry=GridCacheMapEntry
> [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true], val=null,
> ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=11,
> extras=GridCacheObsoleteEntryExtras [obsoleteVer=GridCacheVersion
> [topVer=2147483647, order=0, nodeOrder=0]],
> flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
> [rdrs=ReaderId[] [], part=11, super=], prepared=0, locked=false,
> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
> [topVer=154800617, order=1543320621588, nodeOrder=1]]]],
> super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=154800617,
> order=1543320621587, nodeOrder=4], committedVers=null, rolledbackVers=null,
> cnt=0, super=GridCacheIdMessage [cacheId=0]]]]
>
> This will allow to understand type of activity which produces latency
> issues and possibly identify culprit code.
>
> 3. Track all synchronous waits while processing message: checkpoint locks,
> segment locks, page locks, cold page reads from disk, etc.
>
> 4. Allow extensions to collection additional statistics for specific
> message type. On example, for *TxFinishRequest we could track total
> entities written for each cache and size in bytes. For GridNearLockRequest
> we could track keys with max lock queue length for identifying contention.
>
> 5. Implement additional logging, JMX metrics and control.sh command for
> printing currently collected profiling data.
>
> 6. Implementation should be as lightweight as possible.
>
> Maybe something else.
>
> BTW, I already started working on it and want to contribute my work to
> community. Ticket with PR (not finished) [1], check for
> CacheMessageStatsTest and CacheMessageStatsIndexingTest
>
> [1] https://issues.apache.org/jira/browse/IGNITE-10418
>
> Thoughts ?
>
> --
>
> Best regards,
> Alexei Scherbakov
>


--
Best regards,
Alexey
Reply | Threaded
Open this post in threaded view
|

Re: Lightweight profiling of messages processing

Vladimir Ozerov
Alexey,

I would say that to implement this feature successfully, we first need to
clearly understand specific scenarios we want to target, and only then plan
implementation in small iterations. Please take in count the following
points:
1) Splitting by message type might be too fine-grained thing for the first
iteration. For example, we have N different message types processed in a
system pool. Add several messages causing long-running tasks (e.g. invoke),
and deque time will grow for all other message types for the same queue. I
think we can start with a simpler metrics - per-pool queueing time, pool
throughput, pool task latencies. This would be enough for a wide number of
use cases. We may expand it in future if needed.
2) Some IO statistics were already implemented as a part of IGNITE-6868
[1]. See TcpCommunicationSpiMBeanImpl. You may find it useful for your task.
3) Public API should be designed upfront. Normally it should include JMX
and system SQL views (e.g. like [2]). JMX is useful for external tools, SQL
views allow access to performance data from all platforms all at once. This
is critical.
4) It should be possible to enable/disable such metrics in runtime.
Probably infrastructure from IGNITE-369 [3] might be reused for this
purpose (see GridCacheProcessor.EnableStatisticsFuture)
5) IO access statistics are already under development [4]. First part will
be merged very soon.
5) Performance effects must be measured extremely carefully in all modes
(in-memory, persistence, background/log_only/fsync), because you are likely
to change very performance sensitive code pieces. We had a lot of
performance issues when implementing IGNITE-6868 [1].

Vladimir.

[1] https://issues.apache.org/jira/browse/IGNITE-6868
[1] https://issues.apache.org/jira/browse/IGNITE-7700
[3] https://issues.apache.org/jira/browse/IGNITE-369
[4]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-27%3A+Page+IO+statistics

On Tue, Nov 27, 2018 at 5:04 PM Alexey Kukushkin <[hidden email]>
wrote:

> Hi Alexei,
>
> Did you consider general-purpose APM
> <https://en.wikipedia.org/wiki/Application_performance_management> tools
> like free InspectIT <https://github.com/inspectIT/inspectIT> or commercial
> DynaTrace <https://www.dynatrace.com/> or AppDynamics
> <https://www.appdynamics.com/>?
>
> Java APM tools do not require writing code or even instrumenting Ignite
> binaries: you just attach them as javaagent
> <https://zeroturnaround.com/rebellabs/how-to-inspect-classes-in-your-jvm/>
> to
> Ignite JVM and then you can configure "sensors" to track whatever API you
> want. APM tools collect metrics from the sensors and provide sophisticated
> analysis tools.
>
> DynaTrace claims they have 2%-7% overhead (depending on application) but
> you can always detach the tool if you do not always need it in production.
>
> I did not try APM for Ignite myself but it might work well.
>
> On Tue, Nov 27, 2018 at 4:37 PM Alexei Scherbakov <
> [hidden email]> wrote:
>
> > Igniters,
> >
> > At work I often have to solve performance issues with Ignite cluster
> > without having access to source code of running user application.
> >
> > Looks like Ignite have limited capabilities to identify bottlenecks
> without
> > extensive profiling on server and client side (JFR recording , sampling
> > profilers, regular thread dumps,  etc), which is not always possible.
> >
> > Even having profiling data not always helpful for determining several
> types
> > of bottlenecks, on example, if where is a contention on single
> > key/partition.
> >
> > I propose to implement new feature, like lightweight profiling of message
> > processing.
> >
> > The feature will allow to have view on message processing statistics for
> > each node and for all grid nodes.
> >
> > In short, it's necessary to track each message execution in executors and
> > record execution statistics like synchronous execution time in executor
> > thread and waiting time in queue.
> >
> > Full description:
> >
> > 1. Implement detailed tracking of message waiting in queue and actual
> > processing by executors with splitting to several time bins. Example of
> > detailed statistics for each processed message:
> >
> >
> > Processing time(%)
> >                                  Message       Total     Average(ms)
> >      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
> > 500ms   < 750ms  < 1000ms  > 1000ms
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >                 GridNearSingleGetRequest   904311372        0.023000
> >   904240521     57394      7242      3961      1932       229
> > 61        24         4         4
> >                GridNearSingleGetResponse   340134416        0.041000
> >   340118791     11660      1167       729       901      1001
> > 158         8         1         0
> >                      GridNearLockRequest    77088689        0.079000
> >    77073458     11945      2299       643       311        31
> > 2         0         0         0
> >  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
> >    39580914     28222      6469      4638      9870     13414
> > 2087       137         1         0
> >         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
> >    37579375     23247      5915      4210      8954     12917
> > 2048       162         1         0
> >      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
> >    33579805       337        51         3         1         1
> > 0         0         0         0
> >                 GridNearTxPrepareRequest    21667790        0.238000
> >    21078069    580126      1622      1261      2531      3631
> > 496        40         0        14
> >                 GridDhtTxPrepareResponse    20949873        0.316000
> >    17130161   3803105      4615      3162      4489      3721
> > 577        34         1         8
> >                  GridDhtTxPrepareRequest    20949838        0.501000
> >    16158732   4750217     16183      5735      8472      8994
> > 1353        88        11        53
> >                  GridDhtTxFinishResponse    13835065        0.007000
> >    13832519      2476        27        28        14         1
> > 0         0         0         0
> >                   GridDhtTxFinishRequest    13835028        0.547000
> >    12084106   1736789      8971      2340      1792       807
> > 118        41         4        60
> >                  GridNearTxFinishRequest    13762197        0.725000
> >    11811828   1942499      4441      1400      1201       524
> > 89        34        19       162
> >                GridDhtAtomicNearResponse     2784422        0.122000
> >     2783393      1022         5         2         0         0
> > 0         0         0         0
> >                       GridNearGetRequest     2360483        0.484000
> >     2345937     14129       244       101        64         8
> > 0         0         0         0
> >                      GridNearGetResponse     1984243        0.054000
> >     1981905      2327         8         1         1         1
> > 0         0         0         0
> >                GridNearTxPrepareResponse      192856        0.153000
> >      192660       188         1         5         1         1
> > 0         0         0         0
> >                     GridNearLockResponse      192780        0.091000
> >      192667       107         3         0         3         0
> > 0         0         0         0
> >                 GridNearTxFinishResponse         177        0.822000
> >         129        47         1         0         0         0
> > 0         0         0         0
> >        GridNearAtomicSingleUpdateRequest         124        4.803000
> >          52        53        19         0         0         0
> > 0         0         0         0
> >             GridNearAtomicUpdateResponse         120        0.448000
> >         110        10         0         0         0         0
> > 0         0         0         0
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >                                           1544912252
> >  1531765132  12965900     59283     28219     40537     45281
> > 6989       568        42       301
> >
> >
> > Queue waiting time(%)
> >                                  Message       Total     Average(ms)
> >      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
> > 500ms   < 750ms  < 1000ms  > 1000ms
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >                 GridNearSingleGetRequest   904311372        0.023000
> >   903130400   1051499     33248     18049     34044     38809
> > 4714       212         8       389
> >                GridNearSingleGetResponse   340134416        0.041000
> >   339708055    419479      3732      1004      1222       737
> > 117         2         1        67
> >                      GridNearLockRequest    77088689        0.079000
> >    76949792    129666      2003      1624      2585      2531
> > 390        18         2        78
> >  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
> >    39572801     53913      3949      2960      5487      5583
> > 963        45         5        46
> >         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
> >    37561043     57294      3721      2736      5339      5716
> > 879        54         6        41
> >      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
> >    33512734     45873      4427      3340      6342      6363
> > 1033        50        10        26
> >                 GridNearTxPrepareRequest    21667790        0.238000
> >    21584416     76283      1681      1026      1948      2027
> > 330        31         2        46
> >                 GridDhtTxPrepareResponse    20949873        0.316000
> >    20862632     83578      1214       633       821       817
> > 118        26         8        26
> >                  GridDhtTxPrepareRequest    20949838        0.501000
> >    20867551     72264      2265      1516      2773      2905
> > 479        38         6        41
> >                  GridDhtTxFinishResponse    13835065        0.007000
> >    13808274     26487       150        68        55        21
> > 3         1         0         6
> >                   GridDhtTxFinishRequest    13835028        0.547000
> >    13809380     25142       188       127       102        59
> > 9         2         4        15
> >                  GridNearTxFinishRequest    13762197        0.725000
> >    13743925     17365       241       171       278       191
> > 16         4         0         6
> >                GridDhtAtomicNearResponse     2784422        0.122000
> >     2780600      3767        29         9         7         9
> > 1         0         0         0
> >                       GridNearGetRequest     2360483        0.484000
> >     2355963      3762       194       114       184       228
> > 38         0         0         0
> >                      GridNearGetResponse     1984243        0.054000
> >     1979404      4721        31         9        49        28
> > 1         0         0         0
> >                GridNearTxPrepareResponse      192856        0.153000
> >      192852         2         0         1         1         0
> > 0         0         0         0
> >                     GridNearLockResponse      192780        0.091000
> >      192777         3         0         0         0         0
> > 0         0         0         0
> >                 GridNearTxFinishResponse         177        0.822000
> >         177         0         0         0         0         0
> > 0         0         0         0
> >        GridNearAtomicSingleUpdateRequest         124        4.803000
> >         124         0         0         0         0         0
> > 0         0         0         0
> >             GridNearAtomicUpdateResponse         120        0.448000
> >         119         1         0         0         0         0
> > 0         0         0         0
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >                                           1544912252
> >  1542613019   2071099     57073     33387     61237     66024
> > 9091       483        52       787
> >
> >
> > This will allow to understand cluster load type and presence of problems
> > with increased latency in message processing. In general all messages
> > should be processed within time < 30ms.
> >
> > 2. Log messages which are waiting too long or have been processed for too
> > long (this is configured threshold).
> >
> > Track the following: message enqueue time, waiting in queue until
> > processing is started, head of queue before starting to wait, queue size
> > before starting to wait, total waiting time, total processing time, queue
> > size after processing, message content. Example:
> >
> > Slow message: *enqueueTs*=2018-11-27 15:10:22.241, *waitTime*=0.048,
> > *procTime*=305.186, *messageId*=3a3064a9, *queueSzBefore*=0,
> > *headMessageId*=null,
> > *queueSzAfter*=0, *message*=GridNearTxFinishRequest [miniId=1,
> > mvccSnapshot=null, super=GridDistributedTxFinishRequest
> > [topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0],
> > futId=199a3155761-f379f312-ad4b-4181-acc5-0aacb3391f07, threadId=296,
> > commitVer=null, invalidate=false, commit=true, baseVer=null, txSize=0,
> > sys=false, plc=2, subjId=dda703a0-69ee-47cf-9b9a-bf3dc9309feb,
> > taskNameHash=0, flags=32, syncMode=FULL_SYNC, txState=IgniteTxStateImpl
> > [activeCacheIds=[644280847], recovery=false, mvccEnabled=false,
> > txMap=HashSet [IgniteTxEntry [key=KeyCacheObjectImpl [part=8, val=8,
> > hasValBytes=true], cacheId=644280847, txKey=IgniteTxKey
> > [key=KeyCacheObjectImpl [part=8, val=8, hasValBytes=true],
> > cacheId=644280847], val=[op=READ, val=null], prevVal=[op=NOOP, val=null],
> > oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
> > conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
> > filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=false,
> > entry=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8,
> > hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0,
> > nodeOrder=0], hash=8, extras=GridCacheObsoleteEntryExtras
> > [obsoleteVer=GridCacheVersion [topVer=2147483647, order=0, nodeOrder=0]],
> > flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
> > [rdrs=ReaderId[] [], part=8, super=], prepared=0, locked=false,
> > nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
> > flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
> > [topVer=154800617, order=1543320621588, nodeOrder=1]], IgniteTxEntry
> > [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true],
> > cacheId=644280847, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=11,
> > val=11, hasValBytes=true], cacheId=644280847], val=[op=READ, val=null],
> > prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
> > entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
> > explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [],
> > filtersPassed=false, filtersSet=false, entry=GridCacheMapEntry
> > [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true], val=null,
> > ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=11,
> > extras=GridCacheObsoleteEntryExtras [obsoleteVer=GridCacheVersion
> > [topVer=2147483647, order=0, nodeOrder=0]],
> > flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
> > [rdrs=ReaderId[] [], part=11, super=], prepared=0, locked=false,
> > nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
> > flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
> > [topVer=154800617, order=1543320621588, nodeOrder=1]]]],
> > super=GridDistributedBaseMessage [ver=GridCacheVersion [topVer=154800617,
> > order=1543320621587, nodeOrder=4], committedVers=null,
> rolledbackVers=null,
> > cnt=0, super=GridCacheIdMessage [cacheId=0]]]]
> >
> > This will allow to understand type of activity which produces latency
> > issues and possibly identify culprit code.
> >
> > 3. Track all synchronous waits while processing message: checkpoint
> locks,
> > segment locks, page locks, cold page reads from disk, etc.
> >
> > 4. Allow extensions to collection additional statistics for specific
> > message type. On example, for *TxFinishRequest we could track total
> > entities written for each cache and size in bytes. For
> GridNearLockRequest
> > we could track keys with max lock queue length for identifying
> contention.
> >
> > 5. Implement additional logging, JMX metrics and control.sh command for
> > printing currently collected profiling data.
> >
> > 6. Implementation should be as lightweight as possible.
> >
> > Maybe something else.
> >
> > BTW, I already started working on it and want to contribute my work to
> > community. Ticket with PR (not finished) [1], check for
> > CacheMessageStatsTest and CacheMessageStatsIndexingTest
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-10418
> >
> > Thoughts ?
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
> >
>
>
> --
> Best regards,
> Alexey
>
Reply | Threaded
Open this post in threaded view
|

Re: Lightweight profiling of messages processing

Vladimir Ozerov
Clarification: p.5 is about buffer pool and disk access statistics, not
network.

On Tue, Nov 27, 2018 at 6:02 PM Vladimir Ozerov <[hidden email]>
wrote:

> Alexey,
>
> I would say that to implement this feature successfully, we first need to
> clearly understand specific scenarios we want to target, and only then plan
> implementation in small iterations. Please take in count the following
> points:
> 1) Splitting by message type might be too fine-grained thing for the first
> iteration. For example, we have N different message types processed in a
> system pool. Add several messages causing long-running tasks (e.g. invoke),
> and deque time will grow for all other message types for the same queue. I
> think we can start with a simpler metrics - per-pool queueing time, pool
> throughput, pool task latencies. This would be enough for a wide number of
> use cases. We may expand it in future if needed.
> 2) Some IO statistics were already implemented as a part of IGNITE-6868
> [1]. See TcpCommunicationSpiMBeanImpl. You may find it useful for your task.
> 3) Public API should be designed upfront. Normally it should include JMX
> and system SQL views (e.g. like [2]). JMX is useful for external tools, SQL
> views allow access to performance data from all platforms all at once. This
> is critical.
> 4) It should be possible to enable/disable such metrics in runtime.
> Probably infrastructure from IGNITE-369 [3] might be reused for this
> purpose (see GridCacheProcessor.EnableStatisticsFuture)
> 5) IO access statistics are already under development [4]. First part will
> be merged very soon.
> 5) Performance effects must be measured extremely carefully in all modes
> (in-memory, persistence, background/log_only/fsync), because you are likely
> to change very performance sensitive code pieces. We had a lot of
> performance issues when implementing IGNITE-6868 [1].
>
> Vladimir.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-6868
> [1] https://issues.apache.org/jira/browse/IGNITE-7700
> [3] https://issues.apache.org/jira/browse/IGNITE-369
> [4]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-27%3A+Page+IO+statistics
>
> On Tue, Nov 27, 2018 at 5:04 PM Alexey Kukushkin <
> [hidden email]> wrote:
>
>> Hi Alexei,
>>
>> Did you consider general-purpose APM
>> <https://en.wikipedia.org/wiki/Application_performance_management> tools
>> like free InspectIT <https://github.com/inspectIT/inspectIT> or
>> commercial
>> DynaTrace <https://www.dynatrace.com/> or AppDynamics
>> <https://www.appdynamics.com/>?
>>
>> Java APM tools do not require writing code or even instrumenting Ignite
>> binaries: you just attach them as javaagent
>> <https://zeroturnaround.com/rebellabs/how-to-inspect-classes-in-your-jvm/>
>> to
>> Ignite JVM and then you can configure "sensors" to track whatever API you
>> want. APM tools collect metrics from the sensors and provide sophisticated
>> analysis tools.
>>
>> DynaTrace claims they have 2%-7% overhead (depending on application) but
>> you can always detach the tool if you do not always need it in production.
>>
>> I did not try APM for Ignite myself but it might work well.
>>
>> On Tue, Nov 27, 2018 at 4:37 PM Alexei Scherbakov <
>> [hidden email]> wrote:
>>
>> > Igniters,
>> >
>> > At work I often have to solve performance issues with Ignite cluster
>> > without having access to source code of running user application.
>> >
>> > Looks like Ignite have limited capabilities to identify bottlenecks
>> without
>> > extensive profiling on server and client side (JFR recording , sampling
>> > profilers, regular thread dumps,  etc), which is not always possible.
>> >
>> > Even having profiling data not always helpful for determining several
>> types
>> > of bottlenecks, on example, if where is a contention on single
>> > key/partition.
>> >
>> > I propose to implement new feature, like lightweight profiling of
>> message
>> > processing.
>> >
>> > The feature will allow to have view on message processing statistics for
>> > each node and for all grid nodes.
>> >
>> > In short, it's necessary to track each message execution in executors
>> and
>> > record execution statistics like synchronous execution time in executor
>> > thread and waiting time in queue.
>> >
>> > Full description:
>> >
>> > 1. Implement detailed tracking of message waiting in queue and actual
>> > processing by executors with splitting to several time bins. Example of
>> > detailed statistics for each processed message:
>> >
>> >
>> > Processing time(%)
>> >                                  Message       Total     Average(ms)
>> >      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
>> > 500ms   < 750ms  < 1000ms  > 1000ms
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >                 GridNearSingleGetRequest   904311372        0.023000
>> >   904240521     57394      7242      3961      1932       229
>> > 61        24         4         4
>> >                GridNearSingleGetResponse   340134416        0.041000
>> >   340118791     11660      1167       729       901      1001
>> > 158         8         1         0
>> >                      GridNearLockRequest    77088689        0.079000
>> >    77073458     11945      2299       643       311        31
>> > 2         0         0         0
>> >  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
>> >    39580914     28222      6469      4638      9870     13414
>> > 2087       137         1         0
>> >         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
>> >    37579375     23247      5915      4210      8954     12917
>> > 2048       162         1         0
>> >      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
>> >    33579805       337        51         3         1         1
>> > 0         0         0         0
>> >                 GridNearTxPrepareRequest    21667790        0.238000
>> >    21078069    580126      1622      1261      2531      3631
>> > 496        40         0        14
>> >                 GridDhtTxPrepareResponse    20949873        0.316000
>> >    17130161   3803105      4615      3162      4489      3721
>> > 577        34         1         8
>> >                  GridDhtTxPrepareRequest    20949838        0.501000
>> >    16158732   4750217     16183      5735      8472      8994
>> > 1353        88        11        53
>> >                  GridDhtTxFinishResponse    13835065        0.007000
>> >    13832519      2476        27        28        14         1
>> > 0         0         0         0
>> >                   GridDhtTxFinishRequest    13835028        0.547000
>> >    12084106   1736789      8971      2340      1792       807
>> > 118        41         4        60
>> >                  GridNearTxFinishRequest    13762197        0.725000
>> >    11811828   1942499      4441      1400      1201       524
>> > 89        34        19       162
>> >                GridDhtAtomicNearResponse     2784422        0.122000
>> >     2783393      1022         5         2         0         0
>> > 0         0         0         0
>> >                       GridNearGetRequest     2360483        0.484000
>> >     2345937     14129       244       101        64         8
>> > 0         0         0         0
>> >                      GridNearGetResponse     1984243        0.054000
>> >     1981905      2327         8         1         1         1
>> > 0         0         0         0
>> >                GridNearTxPrepareResponse      192856        0.153000
>> >      192660       188         1         5         1         1
>> > 0         0         0         0
>> >                     GridNearLockResponse      192780        0.091000
>> >      192667       107         3         0         3         0
>> > 0         0         0         0
>> >                 GridNearTxFinishResponse         177        0.822000
>> >         129        47         1         0         0         0
>> > 0         0         0         0
>> >        GridNearAtomicSingleUpdateRequest         124        4.803000
>> >          52        53        19         0         0         0
>> > 0         0         0         0
>> >             GridNearAtomicUpdateResponse         120        0.448000
>> >         110        10         0         0         0         0
>> > 0         0         0         0
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >                                           1544912252
>> >  1531765132  12965900     59283     28219     40537     45281
>> > 6989       568        42       301
>> >
>> >
>> > Queue waiting time(%)
>> >                                  Message       Total     Average(ms)
>> >      < 1 ms   < 10 ms    < 30ms    < 50ms   < 100ms   < 250ms   <
>> > 500ms   < 750ms  < 1000ms  > 1000ms
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >                 GridNearSingleGetRequest   904311372        0.023000
>> >   903130400   1051499     33248     18049     34044     38809
>> > 4714       212         8       389
>> >                GridNearSingleGetResponse   340134416        0.041000
>> >   339708055    419479      3732      1004      1222       737
>> > 117         2         1        67
>> >                      GridNearLockRequest    77088689        0.079000
>> >    76949792    129666      2003      1624      2585      2531
>> > 390        18         2        78
>> >  GridNearAtomicSingleUpdateInvokeRequest    39645752        0.298000
>> >    39572801     53913      3949      2960      5487      5583
>> > 963        45         5        46
>> >         GridDhtAtomicSingleUpdateRequest    37636829        0.277000
>> >    37561043     57294      3721      2736      5339      5716
>> > 879        54         6        41
>> >      GridDhtAtomicDeferredUpdateResponse    33580198        0.002000
>> >    33512734     45873      4427      3340      6342      6363
>> > 1033        50        10        26
>> >                 GridNearTxPrepareRequest    21667790        0.238000
>> >    21584416     76283      1681      1026      1948      2027
>> > 330        31         2        46
>> >                 GridDhtTxPrepareResponse    20949873        0.316000
>> >    20862632     83578      1214       633       821       817
>> > 118        26         8        26
>> >                  GridDhtTxPrepareRequest    20949838        0.501000
>> >    20867551     72264      2265      1516      2773      2905
>> > 479        38         6        41
>> >                  GridDhtTxFinishResponse    13835065        0.007000
>> >    13808274     26487       150        68        55        21
>> > 3         1         0         6
>> >                   GridDhtTxFinishRequest    13835028        0.547000
>> >    13809380     25142       188       127       102        59
>> > 9         2         4        15
>> >                  GridNearTxFinishRequest    13762197        0.725000
>> >    13743925     17365       241       171       278       191
>> > 16         4         0         6
>> >                GridDhtAtomicNearResponse     2784422        0.122000
>> >     2780600      3767        29         9         7         9
>> > 1         0         0         0
>> >                       GridNearGetRequest     2360483        0.484000
>> >     2355963      3762       194       114       184       228
>> > 38         0         0         0
>> >                      GridNearGetResponse     1984243        0.054000
>> >     1979404      4721        31         9        49        28
>> > 1         0         0         0
>> >                GridNearTxPrepareResponse      192856        0.153000
>> >      192852         2         0         1         1         0
>> > 0         0         0         0
>> >                     GridNearLockResponse      192780        0.091000
>> >      192777         3         0         0         0         0
>> > 0         0         0         0
>> >                 GridNearTxFinishResponse         177        0.822000
>> >         177         0         0         0         0         0
>> > 0         0         0         0
>> >        GridNearAtomicSingleUpdateRequest         124        4.803000
>> >         124         0         0         0         0         0
>> > 0         0         0         0
>> >             GridNearAtomicUpdateResponse         120        0.448000
>> >         119         1         0         0         0         0
>> > 0         0         0         0
>> >
>> >
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >                                           1544912252
>> >  1542613019   2071099     57073     33387     61237     66024
>> > 9091       483        52       787
>> >
>> >
>> > This will allow to understand cluster load type and presence of problems
>> > with increased latency in message processing. In general all messages
>> > should be processed within time < 30ms.
>> >
>> > 2. Log messages which are waiting too long or have been processed for
>> too
>> > long (this is configured threshold).
>> >
>> > Track the following: message enqueue time, waiting in queue until
>> > processing is started, head of queue before starting to wait, queue size
>> > before starting to wait, total waiting time, total processing time,
>> queue
>> > size after processing, message content. Example:
>> >
>> > Slow message: *enqueueTs*=2018-11-27 15:10:22.241, *waitTime*=0.048,
>> > *procTime*=305.186, *messageId*=3a3064a9, *queueSzBefore*=0,
>> > *headMessageId*=null,
>> > *queueSzAfter*=0, *message*=GridNearTxFinishRequest [miniId=1,
>> > mvccSnapshot=null, super=GridDistributedTxFinishRequest
>> > [topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0],
>> > futId=199a3155761-f379f312-ad4b-4181-acc5-0aacb3391f07, threadId=296,
>> > commitVer=null, invalidate=false, commit=true, baseVer=null, txSize=0,
>> > sys=false, plc=2, subjId=dda703a0-69ee-47cf-9b9a-bf3dc9309feb,
>> > taskNameHash=0, flags=32, syncMode=FULL_SYNC, txState=IgniteTxStateImpl
>> > [activeCacheIds=[644280847], recovery=false, mvccEnabled=false,
>> > txMap=HashSet [IgniteTxEntry [key=KeyCacheObjectImpl [part=8, val=8,
>> > hasValBytes=true], cacheId=644280847, txKey=IgniteTxKey
>> > [key=KeyCacheObjectImpl [part=8, val=8, hasValBytes=true],
>> > cacheId=644280847], val=[op=READ, val=null], prevVal=[op=NOOP,
>> val=null],
>> > oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
>> > conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
>> > filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=false,
>> > entry=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8,
>> > hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0,
>> > nodeOrder=0], hash=8, extras=GridCacheObsoleteEntryExtras
>> > [obsoleteVer=GridCacheVersion [topVer=2147483647, order=0,
>> nodeOrder=0]],
>> > flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
>> > [rdrs=ReaderId[] [], part=8, super=], prepared=0, locked=false,
>> > nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
>> > flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
>> > [topVer=154800617, order=1543320621588, nodeOrder=1]], IgniteTxEntry
>> > [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true],
>> > cacheId=644280847, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=11,
>> > val=11, hasValBytes=true], cacheId=644280847], val=[op=READ, val=null],
>> > prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
>> > entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1,
>> conflictVer=null,
>> > explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [],
>> > filtersPassed=false, filtersSet=false, entry=GridCacheMapEntry
>> > [key=KeyCacheObjectImpl [part=11, val=11, hasValBytes=true], val=null,
>> > ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=11,
>> > extras=GridCacheObsoleteEntryExtras [obsoleteVer=GridCacheVersion
>> > [topVer=2147483647, order=0, nodeOrder=0]],
>> > flags=2]GridDistributedCacheEntry [super=]GridDhtCacheEntry
>> > [rdrs=ReaderId[] [], part=11, super=], prepared=0, locked=false,
>> > nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
>> > flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion
>> > [topVer=154800617, order=1543320621588, nodeOrder=1]]]],
>> > super=GridDistributedBaseMessage [ver=GridCacheVersion
>> [topVer=154800617,
>> > order=1543320621587, nodeOrder=4], committedVers=null,
>> rolledbackVers=null,
>> > cnt=0, super=GridCacheIdMessage [cacheId=0]]]]
>> >
>> > This will allow to understand type of activity which produces latency
>> > issues and possibly identify culprit code.
>> >
>> > 3. Track all synchronous waits while processing message: checkpoint
>> locks,
>> > segment locks, page locks, cold page reads from disk, etc.
>> >
>> > 4. Allow extensions to collection additional statistics for specific
>> > message type. On example, for *TxFinishRequest we could track total
>> > entities written for each cache and size in bytes. For
>> GridNearLockRequest
>> > we could track keys with max lock queue length for identifying
>> contention.
>> >
>> > 5. Implement additional logging, JMX metrics and control.sh command for
>> > printing currently collected profiling data.
>> >
>> > 6. Implementation should be as lightweight as possible.
>> >
>> > Maybe something else.
>> >
>> > BTW, I already started working on it and want to contribute my work to
>> > community. Ticket with PR (not finished) [1], check for
>> > CacheMessageStatsTest and CacheMessageStatsIndexingTest
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-10418
>> >
>> > Thoughts ?
>> >
>> > --
>> >
>> > Best regards,
>> > Alexei Scherbakov
>> >
>>
>>
>> --
>> Best regards,
>> Alexey
>>
>