Apache Ignite Developers - Legacy Mail Archive

Async cache groups rebalance not started with rebalanceOrder ZERO

Classic

List

Threaded

6 messages Options

Mmuzaf

Async cache groups rebalance not started with rebalanceOrder ZERO

Hello Igniters,

Each cache group has “rebalance order” property. As javadoc for
getRebalanceOrder() says: “Note that cache with order {@code 0} does not
participate in ordering. This means that cache with rebalance order {@code
0} will never wait for any other caches. All caches with order {@code 0}
will be rebalanced right away concurrently with each other and ordered
rebalance processes. If not set, cache order is 0, i.e. rebalancing is not
ordered.”

In fact GridCachePartitionExchangeManager always build the chain of
rebalancing cache groups to start (even for cache order ZERO):

ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2 -> cacheR5 -> cacheR1.

If one of these groups will fail to start further groups will never be run.

* Question 1*: Should we fix javadoc description or create a bug for fixing
such rebalance behavior?

[1]
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2630

Dmitriy Pavlov

Re: Async cache groups rebalance not started with rebalanceOrder ZERO

Hi Ilya,

Do you know, what is correct: behaviour or javadoc?

Sincerely,
Dmitriy Pavlov

пн, 9 июл. 2018 г. в 16:43, Maxim Muzafarov <[hidden email]>:

> Hello Igniters,
>
> Each cache group has “rebalance order” property. As javadoc for
> getRebalanceOrder() says: “Note that cache with order {@code 0} does not
> participate in ordering. This means that cache with rebalance order {@code
> 0} will never wait for any other caches. All caches with order {@code 0}
> will be rebalanced right away concurrently with each other and ordered
> rebalance processes. If not set, cache order is 0, i.e. rebalancing is not
> ordered.”
>
> In fact GridCachePartitionExchangeManager always build the chain of
> rebalancing cache groups to start (even for cache order ZERO):
>
> ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2 -> cacheR5 -> cacheR1.
>
> If one of these groups will fail to start further groups will never be run.
>
> * Question 1*: Should we fix javadoc description or create a bug for fixing
> such rebalance behavior?
>
> [1]
>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2630
>

yzhdanov

Re: Async cache groups rebalance not started with rebalanceOrder ZERO

In reply to this post by Mmuzaf

Maxim, I do not understand the problem. Imagine I do not have any ordering
but rebalancing of some cache fails to start - so in my understanding
overall rebalancing progress becomes blocked. Is that true?

Can you pleaes provide reproducer for your problem?

--Yakov

2018-07-09 16:42 GMT+03:00 Maxim Muzafarov <[hidden email]>:

> Hello Igniters,
>
> Each cache group has “rebalance order” property. As javadoc for
> getRebalanceOrder() says: “Note that cache with order {@code 0} does not
> participate in ordering. This means that cache with rebalance order {@code
> 0} will never wait for any other caches. All caches with order {@code 0}
> will be rebalanced right away concurrently with each other and ordered
> rebalance processes. If not set, cache order is 0, i.e. rebalancing is not
> ordered.”
>
> In fact GridCachePartitionExchangeManager always build the chain of
> rebalancing cache groups to start (even for cache order ZERO):
>
> ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2 -> cacheR5 -> cacheR1.
>
> If one of these groups will fail to start further groups will never be run.
>
> * Question 1*: Should we fix javadoc description or create a bug for fixing
> such rebalance behavior?
>
> [1]
> https://github.com/apache/ignite/blob/master/modules/
> core/src/main/java/org/apache/ignite/internal/processors/cache/
> GridCachePartitionExchangeManager.java#L2630
>

Mmuzaf

Re: Async cache groups rebalance not started with rebalanceOrder ZERO

Yakov,

Yes, you're right. Whole rebalancing progress will be stopped.

Actually, rebalancing order doesn't matter you right it too. Javadoc just
says the idea how rebalance should work for caches but in fact it don't
work as described. Personally, I'd prefer to start rebalance of each cache
group in async way independently.

Please, look at my reproducer [1].

Scenario:
Cluster with two REPLICATEDED caches.
Start new node.
First rebalance cache group is failed to start (e.g. network issues) - it's
OK.
Second rebalance cache group will neber be started - whole futher progress
stucks (I think rebalance here should be started!).

[1]
https://github.com/Mmuzaf/ignite/blob/rebalance-cancel/modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/rebalancing/GridCacheRebalancingCancelSelfTest.java

пт, 13 июл. 2018 г. в 17:46, Yakov Zhdanov <[hidden email]>:

> Maxim, I do not understand the problem. Imagine I do not have any ordering
> but rebalancing of some cache fails to start - so in my understanding
> overall rebalancing progress becomes blocked. Is that true?
>
> Can you pleaes provide reproducer for your problem?
>
> --Yakov
>
> 2018-07-09 16:42 GMT+03:00 Maxim Muzafarov <[hidden email]>:
>
> > Hello Igniters,
> >
> > Each cache group has “rebalance order” property. As javadoc for
> > getRebalanceOrder() says: “Note that cache with order {@code 0} does not
> > participate in ordering. This means that cache with rebalance order
> {@code
> > 0} will never wait for any other caches. All caches with order {@code 0}
> > will be rebalanced right away concurrently with each other and ordered
> > rebalance processes. If not set, cache order is 0, i.e. rebalancing is
> not
> > ordered.”
> >
> > In fact GridCachePartitionExchangeManager always build the chain of
> > rebalancing cache groups to start (even for cache order ZERO):
> >
> > ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2 -> cacheR5 -> cacheR1.
> >
> > If one of these groups will fail to start further groups will never be
> run.
> >
> > * Question 1*: Should we fix javadoc description or create a bug for
> fixing
> > such rebalance behavior?
> >
> > [1]
> > https://github.com/apache/ignite/blob/master/modules/
> > core/src/main/java/org/apache/ignite/internal/processors/cache/
> > GridCachePartitionExchangeManager.java#L2630
> >
>

--
--
Maxim Muzafarov

Yakov Zhdanov-2

Re: Async cache groups rebalance not started with rebalanceOrder ZERO

Maxim, I looked at the code you provided. I think we need to add some
timeout validation and output warning to logs on demander side in case
there is no supply message within 30 secs and repeat demanding process.
This should apply to any demand message throughout the rebalancing process
not only the 1st one.

You can use the following message

Failed to wait for supply message from node within 30 secs [cache=C,
partId=XX]

Alex Goncharuk do you have comments here?

Yakov Zhdanov
www.gridgain.com

2018-07-14 19:45 GMT+03:00 Maxim Muzafarov <[hidden email]>:

> Yakov,
>
> Yes, you're right. Whole rebalancing progress will be stopped.
>
> Actually, rebalancing order doesn't matter you right it too. Javadoc just
> says the idea how rebalance should work for caches but in fact it don't
> work as described. Personally, I'd prefer to start rebalance of each cache
> group in async way independently.
>
> Please, look at my reproducer [1].
>
> Scenario:
> Cluster with two REPLICATEDED caches.
> Start new node.
> First rebalance cache group is failed to start (e.g. network issues) - it's
> OK.
> Second rebalance cache group will neber be started - whole futher progress
> stucks (I think rebalance here should be started!).
>
>
> [1]
> https://github.com/Mmuzaf/ignite/blob/rebalance-cancel/
> modules/core/src/test/java/org/apache/ignite/internal/
> processors/cache/distributed/rebalancing/GridCacheRebalancingCancelSelf
> Test.java
>
> пт, 13 июл. 2018 г. в 17:46, Yakov Zhdanov <[hidden email]>:
>
> > Maxim, I do not understand the problem. Imagine I do not have any
> ordering
> > but rebalancing of some cache fails to start - so in my understanding
> > overall rebalancing progress becomes blocked. Is that true?
> >
> > Can you pleaes provide reproducer for your problem?
> >
> > --Yakov
> >
> > 2018-07-09 16:42 GMT+03:00 Maxim Muzafarov <[hidden email]>:
> >
> > > Hello Igniters,
> > >
> > > Each cache group has “rebalance order” property. As javadoc for
> > > getRebalanceOrder() says: “Note that cache with order {@code 0} does
> not
> > > participate in ordering. This means that cache with rebalance order
> > {@code
> > > 0} will never wait for any other caches. All caches with order {@code
> 0}
> > > will be rebalanced right away concurrently with each other and ordered
> > > rebalance processes. If not set, cache order is 0, i.e. rebalancing is
> > not
> > > ordered.”
> > >
> > > In fact GridCachePartitionExchangeManager always build the chain of
> > > rebalancing cache groups to start (even for cache order ZERO):
> > >
> > > ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2 -> cacheR5 -> cacheR1.
> > >
> > > If one of these groups will fail to start further groups will never be
> > run.
> > >
> > > * Question 1*: Should we fix javadoc description or create a bug for
> > fixing
> > > such rebalance behavior?
> > >
> > > [1]
> > > https://github.com/apache/ignite/blob/master/modules/
> > > core/src/main/java/org/apache/ignite/internal/processors/cache/
> > > GridCachePartitionExchangeManager.java#L2630
> > >
> >
> --
> --
> Maxim Muzafarov
>

Anton Vinogradov-2

Re: Async cache groups rebalance not started with rebalanceOrder ZERO

Maxim,

1) There is a typo at javadoc, feel free to fix it.

2) It's a bad idea to rebalance more than 1 cache simultaneously.
- It's hard to determine error reason in that case when (not "if", but
"when" :) ) we'll gain issue at prod (100+ caches case).
- We should have limited rebalance load.
Rebalance should not cause thousand messages per second, this will lead to
cluster death.
rebalanceThreadPoolSize(), rebalanceBatchSize() and
rebalanceBatchesPrefetchCount() provides us guarantee of limited but proper
load.

3) Correct fix for situation you described is to restart rebalancing
(chained) for both caches on timeout.
And that's what we'll gain once cluster detect that node have IO issues and
start new topology without it.

So, seems, only javadoc fixes required.

ср, 18 июл. 2018 г. в 15:13, Yakov Zhdanov <[hidden email]>:

> Maxim, I checked and it seems that send retry count is used only in cache
> IO manager and the usage is semantically very far from what I suggest.
> Resend count limits the attempts count, while I meant successfull send but
> possible problems on supplier side.
>
> --Yakov
>
> 2018-07-17 19:01 GMT+03:00 Maxim Muzafarov <[hidden email]>:
>
> > Yakov,
> >
> > But we already have DFLT_SEND_RETRY_CNT and DFLT_SEND_RETRY_DELAY for
> > configuring our CommunicationSPI behavior. What if user configure this
> > parameters his own way and he will see a lot of WARN messages in log
> which
> > have no sense?
> >
> > May be we use GridCachePartitionExchangeManager#forceRebalance (or may
> > be forceReassign) if we fail rebalance all that retries. What do you
> think?
> >
> >
> >
> > пн, 16 июл. 2018 г. в 21:12, Yakov Zhdanov <[hidden email]>:
> >
> > > Maxim, I looked at the code you provided. I think we need to add some
> > > timeout validation and output warning to logs on demander side in case
> > > there is no supply message within 30 secs and repeat demanding process.
> > > This should apply to any demand message throughout the rebalancing
> > process
> > > not only the 1st one.
> > >
> > > You can use the following message
> > >
> > > Failed to wait for supply message from node within 30 secs [cache=C,
> > > partId=XX]
> > >
> > > Alex Goncharuk do you have comments here?
> > >
> > > Yakov Zhdanov
> > > www.gridgain.com
> > >
> > > 2018-07-14 19:45 GMT+03:00 Maxim Muzafarov <[hidden email]>:
> > >
> > > > Yakov,
> > > >
> > > > Yes, you're right. Whole rebalancing progress will be stopped.
> > > >
> > > > Actually, rebalancing order doesn't matter you right it too. Javadoc
> > just
> > > > says the idea how rebalance should work for caches but in fact it
> don't
> > > > work as described. Personally, I'd prefer to start rebalance of each
> > > cache
> > > > group in async way independently.
> > > >
> > > > Please, look at my reproducer [1].
> > > >
> > > > Scenario:
> > > > Cluster with two REPLICATEDED caches.
> > > > Start new node.
> > > > First rebalance cache group is failed to start (e.g. network issues)
> -
> > > it's
> > > > OK.
> > > > Second rebalance cache group will neber be started - whole futher
> > > progress
> > > > stucks (I think rebalance here should be started!).
> > > >
> > > >
> > > > [1]
> > > > https://github.com/Mmuzaf/ignite/blob/rebalance-cancel/
> > > > modules/core/src/test/java/org/apache/ignite/internal/
> > > > processors/cache/distributed/rebalancing/
> > GridCacheRebalancingCancelSelf
> > > > Test.java
> > > >
> > > > пт, 13 июл. 2018 г. в 17:46, Yakov Zhdanov <[hidden email]>:
> > > >
> > > > > Maxim, I do not understand the problem. Imagine I do not have any
> > > > ordering
> > > > > but rebalancing of some cache fails to start - so in my
> understanding
> > > > > overall rebalancing progress becomes blocked. Is that true?
> > > > >
> > > > > Can you pleaes provide reproducer for your problem?
> > > > >
> > > > > --Yakov
> > > > >
> > > > > 2018-07-09 16:42 GMT+03:00 Maxim Muzafarov <[hidden email]>:
> > > > >
> > > > > > Hello Igniters,
> > > > > >
> > > > > > Each cache group has “rebalance order” property. As javadoc for
> > > > > > getRebalanceOrder() says: “Note that cache with order {@code 0}
> > does
> > > > not
> > > > > > participate in ordering. This means that cache with rebalance
> order
> > > > > {@code
> > > > > > 0} will never wait for any other caches. All caches with order
> > {@code
> > > > 0}
> > > > > > will be rebalanced right away concurrently with each other and
> > > ordered
> > > > > > rebalance processes. If not set, cache order is 0, i.e.
> rebalancing
> > > is
> > > > > not
> > > > > > ordered.”
> > > > > >
> > > > > > In fact GridCachePartitionExchangeManager always build the chain
> > of
> > > > > > rebalancing cache groups to start (even for cache order ZERO):
> > > > > >
> > > > > > ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2 -> cacheR5 ->
> > > cacheR1.
> > > > > >
> > > > > > If one of these groups will fail to start further groups will
> never
> > > be
> > > > > run.
> > > > > >
> > > > > > * Question 1*: Should we fix javadoc description or create a bug
> > for
> > > > > fixing
> > > > > > such rebalance behavior?
> > > > > >
> > > > > > [1]
> > > > > > https://github.com/apache/ignite/blob/master/modules/
> > > > > > core/src/main/java/org/apache/ignite/internal/processors/cache/
> > > > > > GridCachePartitionExchangeManager.java#L2630
> > > > > >
> > > > >
> > > > --
> > > > --
> > > > Maxim Muzafarov
> > > >
> > >
> > --
> > --
> > Maxim Muzafarov
> >
>