Unknown known issue on cache rebalancing delayed

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Unknown known issue on cache rebalancing delayed

Roman Shtykh
Igniters,
I have found "Known issue, possible deadlock in case of low priority cache rebalancing delayed" comment in GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please explain when using rebalance delay can be an issue and why?

-- Roman
Reply | Threaded
Open this post in threaded view
|

Re: Unknown known issue on cache rebalancing delayed

Mmuzaf
Hello Roman,

Did you faced with real issue of delayed rebalance or it's just only for
your personal interest?
If yes, please, share details and we will try to help you.

As for this comment I don't think he is actual. That change was in 2015.
Much has changed
within rebalance process since that time. I've uncommented it and rechecked
with that
cache configuration and haven't seen any failed tests or issues.

Probably, that problem was about cache in SYNC mode does not start util it
loads all data
from other nodes. But currently delayed rebalance works the same way as
IgniteCache#rebalance(),
so you can `setRebalanceDelay` to `-1` and call it manually to check.

On Mon, 20 Aug 2018 at 11:19 Roman Shtykh <[hidden email]> wrote:

> Igniters,
> I have found "Known issue, possible deadlock in case of low priority cache
> rebalancing delayed" comment in
> GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please explain
> when using rebalance delay can be an issue and why?
>
> -- Roman
>
--
--
Maxim Muzafarov
Reply | Threaded
Open this post in threaded view
|

Re: Unknown known issue on cache rebalancing delayed

Roman Shtykh
Hi Maxim,

I have some issues with a cluster with rebalance delay enabled, but need to check more -- if I find it's related I'll share.Just wanted to make sure it's not an issue anymore from someone working on rebalancing. We should remove that comment then, it looks scary :)

--
Roman Shtykh

    On Tuesday, August 21, 2018, 12:49:00 a.m. GMT+9, Maxim Muzafarov <[hidden email]> wrote:  
 
 Hello Roman,
Did you faced with real issue of delayed rebalance or it's just only for your personal interest? If yes, please, share details and we will try to help you.
As for this comment I don't think he is actual. That change was in 2015. Much has changed within rebalance process since that time. I've uncommented it and rechecked with thatcache configuration and haven't seen any failed tests or issues. 
Probably, that problem was about cache in SYNC mode does not start util it loads all data from other nodes. But currently delayed rebalance works the same way as IgniteCache#rebalance(), so you can `setRebalanceDelay` to `-1` and call it manually to check.
On Mon, 20 Aug 2018 at 11:19 Roman Shtykh <[hidden email]> wrote:

Igniters,
I have found "Known issue, possible deadlock in case of low priority cache rebalancing delayed" comment in GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please explain when using rebalance delay can be an issue and why?

-- Roman

--
--
Maxim Muzafarov  
Reply | Threaded
Open this post in threaded view
|

Re: Unknown known issue on cache rebalancing delayed

Mmuzaf
Roman,

I worked recently on rebalance improvements and haven't found any problems
with delayed cache rebalacne.
Agree with you - let's uncomment this and remove scary comment. Will you
create a ticket for it?

In case of any problems we can easily detec deadlock with newly configured
`FailureHandler`.

On Tue, 21 Aug 2018 at 03:49 Roman Shtykh <[hidden email]> wrote:

> Hi Maxim,
>
> I have some issues with a cluster with rebalance delay enabled, but need
> to check more -- if I find it's related I'll share.
> Just wanted to make sure it's not an issue anymore from someone working on
> rebalancing. We should remove that comment then, it looks scary :)
>
> --
> Roman Shtykh
>
>
> On Tuesday, August 21, 2018, 12:49:00 a.m. GMT+9, Maxim Muzafarov <
> [hidden email]> wrote:
>
>
> Hello Roman,
>
> Did you faced with real issue of delayed rebalance or it's just only for
> your personal interest?
> If yes, please, share details and we will try to help you.
>
> As for this comment I don't think he is actual. That change was in 2015.
> Much has changed
> within rebalance process since that time. I've uncommented it and
> rechecked with that
> cache configuration and haven't seen any failed tests or issues.
>
> Probably, that problem was about cache in SYNC mode does not start util it
> loads all data
> from other nodes. But currently delayed rebalance works the same way as
> IgniteCache#rebalance(),
> so you can `setRebalanceDelay` to `-1` and call it manually to check.
>
> On Mon, 20 Aug 2018 at 11:19 Roman Shtykh <[hidden email]>
> wrote:
>
> Igniters,
> I have found "Known issue, possible deadlock in case of low priority cache
> rebalancing delayed" comment in
> GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please explain
> when using rebalance delay can be an issue and why?
>
> -- Roman
>
> --
> --
> Maxim Muzafarov
>
--
--
Maxim Muzafarov
Reply | Threaded
Open this post in threaded view
|

Re: Unknown known issue on cache rebalancing delayed

Anton Vinogradov-2
Roman,

I see you uncommented this line.
I do not remember deadlock detail, but I remember it was the extremely rare
case.
I found and "fixed" it some days before merge when I had 24x7 sanity check
week :)

So, I propose to have at least 1_000 runs of this tests before keeping this
uncommented.



вт, 21 авг. 2018 г. в 11:08, Maxim Muzafarov <[hidden email]>:

> Roman,
>
> I worked recently on rebalance improvements and haven't found any problems
> with delayed cache rebalacne.
> Agree with you - let's uncomment this and remove scary comment. Will you
> create a ticket for it?
>
> In case of any problems we can easily detec deadlock with newly configured
> `FailureHandler`.
>
> On Tue, 21 Aug 2018 at 03:49 Roman Shtykh <[hidden email]> wrote:
>
> > Hi Maxim,
> >
> > I have some issues with a cluster with rebalance delay enabled, but need
> > to check more -- if I find it's related I'll share.
> > Just wanted to make sure it's not an issue anymore from someone working
> on
> > rebalancing. We should remove that comment then, it looks scary :)
> >
> > --
> > Roman Shtykh
> >
> >
> > On Tuesday, August 21, 2018, 12:49:00 a.m. GMT+9, Maxim Muzafarov <
> > [hidden email]> wrote:
> >
> >
> > Hello Roman,
> >
> > Did you faced with real issue of delayed rebalance or it's just only for
> > your personal interest?
> > If yes, please, share details and we will try to help you.
> >
> > As for this comment I don't think he is actual. That change was in 2015.
> > Much has changed
> > within rebalance process since that time. I've uncommented it and
> > rechecked with that
> > cache configuration and haven't seen any failed tests or issues.
> >
> > Probably, that problem was about cache in SYNC mode does not start util
> it
> > loads all data
> > from other nodes. But currently delayed rebalance works the same way as
> > IgniteCache#rebalance(),
> > so you can `setRebalanceDelay` to `-1` and call it manually to check.
> >
> > On Mon, 20 Aug 2018 at 11:19 Roman Shtykh <[hidden email]>
> > wrote:
> >
> > Igniters,
> > I have found "Known issue, possible deadlock in case of low priority
> cache
> > rebalancing delayed" comment in
> > GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please explain
> > when using rebalance delay can be an issue and why?
> >
> > -- Roman
> >
> > --
> > --
> > Maxim Muzafarov
> >
> --
> --
> Maxim Muzafarov
>
Reply | Threaded
Open this post in threaded view
|

Re: Unknown known issue on cache rebalancing delayed

Roman Shtykh
Anton,
Thank you. I would like to recheck it. How can this (1_000 runs) be done in TC?


    On Tuesday, September 4, 2018, 5:42:01 p.m. GMT+9, Anton Vinogradov <[hidden email]> wrote:  
 
 Roman,

I see you uncommented this line.
I do not remember deadlock detail, but I remember it was the extremely rare
case.
I found and "fixed" it some days before merge when I had 24x7 sanity check
week :)

So, I propose to have at least 1_000 runs of this tests before keeping this
uncommented.



вт, 21 авг. 2018 г. в 11:08, Maxim Muzafarov <[hidden email]>:

> Roman,
>
> I worked recently on rebalance improvements and haven't found any problems
> with delayed cache rebalacne.
> Agree with you - let's uncomment this and remove scary comment. Will you
> create a ticket for it?
>
> In case of any problems we can easily detec deadlock with newly configured
> `FailureHandler`.
>
> On Tue, 21 Aug 2018 at 03:49 Roman Shtykh <[hidden email]> wrote:
>
> > Hi Maxim,
> >
> > I have some issues with a cluster with rebalance delay enabled, but need
> > to check more -- if I find it's related I'll share.
> > Just wanted to make sure it's not an issue anymore from someone working
> on
> > rebalancing. We should remove that comment then, it looks scary :)
> >
> > --
> > Roman Shtykh
> >
> >
> > On Tuesday, August 21, 2018, 12:49:00 a.m. GMT+9, Maxim Muzafarov <
> > [hidden email]> wrote:
> >
> >
> > Hello Roman,
> >
> > Did you faced with real issue of delayed rebalance or it's just only for
> > your personal interest?
> > If yes, please, share details and we will try to help you.
> >
> > As for this comment I don't think he is actual. That change was in 2015.
> > Much has changed
> > within rebalance process since that time. I've uncommented it and
> > rechecked with that
> > cache configuration and haven't seen any failed tests or issues.
> >
> > Probably, that problem was about cache in SYNC mode does not start util
> it
> > loads all data
> > from other nodes. But currently delayed rebalance works the same way as
> > IgniteCache#rebalance(),
> > so you can `setRebalanceDelay` to `-1` and call it manually to check.
> >
> > On Mon, 20 Aug 2018 at 11:19 Roman Shtykh <[hidden email]>
> > wrote:
> >
> > Igniters,
> > I have found "Known issue, possible deadlock in case of low priority
> cache
> > rebalancing delayed" comment in
> > GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please explain
> > when using rebalance delay can be an issue and why?
> >
> > -- Roman
> >
> > --
> > --
> > Maxim Muzafarov
> >
> --
> --
> Maxim Muzafarov
>  
Reply | Threaded
Open this post in threaded view
|

Re: Unknown known issue on cache rebalancing delayed

Mmuzaf
Roman, Anton,

I've already created additional PR [2] all and run it on TC [1].
Please, follow up with the results.

[1]
https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_Cache8&tab=buildTypeStatusDiv&branch_IgniteTests24Java8=pull%2F4676%2Fhead
[2] https://github.com/apache/ignite/pull/4676/files


On Tue, 4 Sep 2018 at 12:46 Roman Shtykh <[hidden email]> wrote:

> Anton,
> Thank you. I would like to recheck it. How can this (1_000 runs) be done
> in TC?
>
>
>     On Tuesday, September 4, 2018, 5:42:01 p.m. GMT+9, Anton Vinogradov <
> [hidden email]> wrote:
>
>  Roman,
>
> I see you uncommented this line.
> I do not remember deadlock detail, but I remember it was the extremely rare
> case.
> I found and "fixed" it some days before merge when I had 24x7 sanity check
> week :)
>
> So, I propose to have at least 1_000 runs of this tests before keeping this
> uncommented.
>
>
>
> вт, 21 авг. 2018 г. в 11:08, Maxim Muzafarov <[hidden email]>:
>
> > Roman,
> >
> > I worked recently on rebalance improvements and haven't found any
> problems
> > with delayed cache rebalacne.
> > Agree with you - let's uncomment this and remove scary comment. Will you
> > create a ticket for it?
> >
> > In case of any problems we can easily detec deadlock with newly
> configured
> > `FailureHandler`.
> >
> > On Tue, 21 Aug 2018 at 03:49 Roman Shtykh <[hidden email]> wrote:
> >
> > > Hi Maxim,
> > >
> > > I have some issues with a cluster with rebalance delay enabled, but
> need
> > > to check more -- if I find it's related I'll share.
> > > Just wanted to make sure it's not an issue anymore from someone working
> > on
> > > rebalancing. We should remove that comment then, it looks scary :)
> > >
> > > --
> > > Roman Shtykh
> > >
> > >
> > > On Tuesday, August 21, 2018, 12:49:00 a.m. GMT+9, Maxim Muzafarov <
> > > [hidden email]> wrote:
> > >
> > >
> > > Hello Roman,
> > >
> > > Did you faced with real issue of delayed rebalance or it's just only
> for
> > > your personal interest?
> > > If yes, please, share details and we will try to help you.
> > >
> > > As for this comment I don't think he is actual. That change was in
> 2015.
> > > Much has changed
> > > within rebalance process since that time. I've uncommented it and
> > > rechecked with that
> > > cache configuration and haven't seen any failed tests or issues.
> > >
> > > Probably, that problem was about cache in SYNC mode does not start util
> > it
> > > loads all data
> > > from other nodes. But currently delayed rebalance works the same way as
> > > IgniteCache#rebalance(),
> > > so you can `setRebalanceDelay` to `-1` and call it manually to check.
> > >
> > > On Mon, 20 Aug 2018 at 11:19 Roman Shtykh <[hidden email]>
> > > wrote:
> > >
> > > Igniters,
> > > I have found "Known issue, possible deadlock in case of low priority
> > cache
> > > rebalancing delayed" comment in
> > > GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please
> explain
> > > when using rebalance delay can be an issue and why?
> > >
> > > -- Roman
> > >
> > > --
> > > --
> > > Maxim Muzafarov
> > >
> > --
> > --
> > Maxim Muzafarov
> >

--
--
Maxim Muzafarov
Reply | Threaded
Open this post in threaded view
|

Re: Unknown known issue on cache rebalancing delayed

Anton Vinogradov-2
Maxim,
20 is not 1k :)
Also, you forgot to check GridCacheRebalancingAsyncSelfTest

I'm not sure we should have exactly 1k runs, but 20 is definitely not
enough.

Roman,
I propose to use IDEA "run until failure" feature and perform test locally
(at your PC) while you're not using PC.

вт, 4 сент. 2018 г. в 12:51, Maxim Muzafarov <[hidden email]>:

> Roman, Anton,
>
> I've already created additional PR [2] all and run it on TC [1].
> Please, follow up with the results.
>
> [1]
>
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_Cache8&tab=buildTypeStatusDiv&branch_IgniteTests24Java8=pull%2F4676%2Fhead
> [2] https://github.com/apache/ignite/pull/4676/files
>
>
> On Tue, 4 Sep 2018 at 12:46 Roman Shtykh <[hidden email]>
> wrote:
>
> > Anton,
> > Thank you. I would like to recheck it. How can this (1_000 runs) be done
> > in TC?
> >
> >
> >     On Tuesday, September 4, 2018, 5:42:01 p.m. GMT+9, Anton Vinogradov <
> > [hidden email]> wrote:
> >
> >  Roman,
> >
> > I see you uncommented this line.
> > I do not remember deadlock detail, but I remember it was the extremely
> rare
> > case.
> > I found and "fixed" it some days before merge when I had 24x7 sanity
> check
> > week :)
> >
> > So, I propose to have at least 1_000 runs of this tests before keeping
> this
> > uncommented.
> >
> >
> >
> > вт, 21 авг. 2018 г. в 11:08, Maxim Muzafarov <[hidden email]>:
> >
> > > Roman,
> > >
> > > I worked recently on rebalance improvements and haven't found any
> > problems
> > > with delayed cache rebalacne.
> > > Agree with you - let's uncomment this and remove scary comment. Will
> you
> > > create a ticket for it?
> > >
> > > In case of any problems we can easily detec deadlock with newly
> > configured
> > > `FailureHandler`.
> > >
> > > On Tue, 21 Aug 2018 at 03:49 Roman Shtykh <[hidden email]> wrote:
> > >
> > > > Hi Maxim,
> > > >
> > > > I have some issues with a cluster with rebalance delay enabled, but
> > need
> > > > to check more -- if I find it's related I'll share.
> > > > Just wanted to make sure it's not an issue anymore from someone
> working
> > > on
> > > > rebalancing. We should remove that comment then, it looks scary :)
> > > >
> > > > --
> > > > Roman Shtykh
> > > >
> > > >
> > > > On Tuesday, August 21, 2018, 12:49:00 a.m. GMT+9, Maxim Muzafarov <
> > > > [hidden email]> wrote:
> > > >
> > > >
> > > > Hello Roman,
> > > >
> > > > Did you faced with real issue of delayed rebalance or it's just only
> > for
> > > > your personal interest?
> > > > If yes, please, share details and we will try to help you.
> > > >
> > > > As for this comment I don't think he is actual. That change was in
> > 2015.
> > > > Much has changed
> > > > within rebalance process since that time. I've uncommented it and
> > > > rechecked with that
> > > > cache configuration and haven't seen any failed tests or issues.
> > > >
> > > > Probably, that problem was about cache in SYNC mode does not start
> util
> > > it
> > > > loads all data
> > > > from other nodes. But currently delayed rebalance works the same way
> as
> > > > IgniteCache#rebalance(),
> > > > so you can `setRebalanceDelay` to `-1` and call it manually to check.
> > > >
> > > > On Mon, 20 Aug 2018 at 11:19 Roman Shtykh <[hidden email]
> >
> > > > wrote:
> > > >
> > > > Igniters,
> > > > I have found "Known issue, possible deadlock in case of low priority
> > > cache
> > > > rebalancing delayed" comment in
> > > > GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please
> > explain
> > > > when using rebalance delay can be an issue and why?
> > > >
> > > > -- Roman
> > > >
> > > > --
> > > > --
> > > > Maxim Muzafarov
> > > >
> > > --
> > > --
> > > Maxim Muzafarov
> > >
>
> --
> --
> Maxim Muzafarov
>
Reply | Threaded
Open this post in threaded view
|

Re: Unknown known issue on cache rebalancing delayed

Mmuzaf
Anton,

I agree with you 20 time is not enough. I've checked the single run of the
test class - it consumes ~7min per each execution.
CacheSuite8 total execution timeout - 210 min, so we can perform only 30
class execution in this suite. Our strategy here is
to `20 times within single` and put into the TC queue 50 runs. Total ~7000
min or 5 days.

Not sure that we should perform exactly 1000 executions, hopefully, we will
stop adding to the queue new tasks at some point.

On Tue, 4 Sep 2018 at 12:59 Anton Vinogradov <[hidden email]> wrote:

> Maxim,
> 20 is not 1k :)
> Also, you forgot to check GridCacheRebalancingAsyncSelfTest
>
> I'm not sure we should have exactly 1k runs, but 20 is definitely not
> enough.
>
> Roman,
> I propose to use IDEA "run until failure" feature and perform test locally
> (at your PC) while you're not using PC.
>
> вт, 4 сент. 2018 г. в 12:51, Maxim Muzafarov <[hidden email]>:
>
> > Roman, Anton,
> >
> > I've already created additional PR [2] all and run it on TC [1].
> > Please, follow up with the results.
> >
> > [1]
> >
> >
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_Cache8&tab=buildTypeStatusDiv&branch_IgniteTests24Java8=pull%2F4676%2Fhead
> > [2] https://github.com/apache/ignite/pull/4676/files
> >
> >
> > On Tue, 4 Sep 2018 at 12:46 Roman Shtykh <[hidden email]>
> > wrote:
> >
> > > Anton,
> > > Thank you. I would like to recheck it. How can this (1_000 runs) be
> done
> > > in TC?
> > >
> > >
> > >     On Tuesday, September 4, 2018, 5:42:01 p.m. GMT+9, Anton
> Vinogradov <
> > > [hidden email]> wrote:
> > >
> > >  Roman,
> > >
> > > I see you uncommented this line.
> > > I do not remember deadlock detail, but I remember it was the extremely
> > rare
> > > case.
> > > I found and "fixed" it some days before merge when I had 24x7 sanity
> > check
> > > week :)
> > >
> > > So, I propose to have at least 1_000 runs of this tests before keeping
> > this
> > > uncommented.
> > >
> > >
> > >
> > > вт, 21 авг. 2018 г. в 11:08, Maxim Muzafarov <[hidden email]>:
> > >
> > > > Roman,
> > > >
> > > > I worked recently on rebalance improvements and haven't found any
> > > problems
> > > > with delayed cache rebalacne.
> > > > Agree with you - let's uncomment this and remove scary comment. Will
> > you
> > > > create a ticket for it?
> > > >
> > > > In case of any problems we can easily detec deadlock with newly
> > > configured
> > > > `FailureHandler`.
> > > >
> > > > On Tue, 21 Aug 2018 at 03:49 Roman Shtykh <[hidden email]> wrote:
> > > >
> > > > > Hi Maxim,
> > > > >
> > > > > I have some issues with a cluster with rebalance delay enabled, but
> > > need
> > > > > to check more -- if I find it's related I'll share.
> > > > > Just wanted to make sure it's not an issue anymore from someone
> > working
> > > > on
> > > > > rebalancing. We should remove that comment then, it looks scary :)
> > > > >
> > > > > --
> > > > > Roman Shtykh
> > > > >
> > > > >
> > > > > On Tuesday, August 21, 2018, 12:49:00 a.m. GMT+9, Maxim Muzafarov <
> > > > > [hidden email]> wrote:
> > > > >
> > > > >
> > > > > Hello Roman,
> > > > >
> > > > > Did you faced with real issue of delayed rebalance or it's just
> only
> > > for
> > > > > your personal interest?
> > > > > If yes, please, share details and we will try to help you.
> > > > >
> > > > > As for this comment I don't think he is actual. That change was in
> > > 2015.
> > > > > Much has changed
> > > > > within rebalance process since that time. I've uncommented it and
> > > > > rechecked with that
> > > > > cache configuration and haven't seen any failed tests or issues.
> > > > >
> > > > > Probably, that problem was about cache in SYNC mode does not start
> > util
> > > > it
> > > > > loads all data
> > > > > from other nodes. But currently delayed rebalance works the same
> way
> > as
> > > > > IgniteCache#rebalance(),
> > > > > so you can `setRebalanceDelay` to `-1` and call it manually to
> check.
> > > > >
> > > > > On Mon, 20 Aug 2018 at 11:19 Roman Shtykh
> <[hidden email]
> > >
> > > > > wrote:
> > > > >
> > > > > Igniters,
> > > > > I have found "Known issue, possible deadlock in case of low
> priority
> > > > cache
> > > > > rebalancing delayed" comment in
> > > > > GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please
> > > explain
> > > > > when using rebalance delay can be an issue and why?
> > > > >
> > > > > -- Roman
> > > > >
> > > > > --
> > > > > --
> > > > > Maxim Muzafarov
> > > > >
> > > > --
> > > > --
> > > > Maxim Muzafarov
> > > >
> >
> > --
> > --
> > Maxim Muzafarov
> >
>
--
--
Maxim Muzafarov
Reply | Threaded
Open this post in threaded view
|

Re: Unknown known issue on cache rebalancing delayed

Anton Vinogradov-2
Maxim,

Let's create a branch with 10 checks of Sync and 10 checks of Async.
Then, run it 20 times at TC.
This should be enough I think.

вт, 4 сент. 2018 г. в 13:09, Maxim Muzafarov <[hidden email]>:

> Anton,
>
> I agree with you 20 time is not enough. I've checked the single run of the
> test class - it consumes ~7min per each execution.
> CacheSuite8 total execution timeout - 210 min, so we can perform only 30
> class execution in this suite. Our strategy here is
> to `20 times within single` and put into the TC queue 50 runs. Total ~7000
> min or 5 days.
>
> Not sure that we should perform exactly 1000 executions, hopefully, we will
> stop adding to the queue new tasks at some point.
>
> On Tue, 4 Sep 2018 at 12:59 Anton Vinogradov <[hidden email]> wrote:
>
> > Maxim,
> > 20 is not 1k :)
> > Also, you forgot to check GridCacheRebalancingAsyncSelfTest
> >
> > I'm not sure we should have exactly 1k runs, but 20 is definitely not
> > enough.
> >
> > Roman,
> > I propose to use IDEA "run until failure" feature and perform test
> locally
> > (at your PC) while you're not using PC.
> >
> > вт, 4 сент. 2018 г. в 12:51, Maxim Muzafarov <[hidden email]>:
> >
> > > Roman, Anton,
> > >
> > > I've already created additional PR [2] all and run it on TC [1].
> > > Please, follow up with the results.
> > >
> > > [1]
> > >
> > >
> >
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_Cache8&tab=buildTypeStatusDiv&branch_IgniteTests24Java8=pull%2F4676%2Fhead
> > > [2] https://github.com/apache/ignite/pull/4676/files
> > >
> > >
> > > On Tue, 4 Sep 2018 at 12:46 Roman Shtykh <[hidden email]>
> > > wrote:
> > >
> > > > Anton,
> > > > Thank you. I would like to recheck it. How can this (1_000 runs) be
> > done
> > > > in TC?
> > > >
> > > >
> > > >     On Tuesday, September 4, 2018, 5:42:01 p.m. GMT+9, Anton
> > Vinogradov <
> > > > [hidden email]> wrote:
> > > >
> > > >  Roman,
> > > >
> > > > I see you uncommented this line.
> > > > I do not remember deadlock detail, but I remember it was the
> extremely
> > > rare
> > > > case.
> > > > I found and "fixed" it some days before merge when I had 24x7 sanity
> > > check
> > > > week :)
> > > >
> > > > So, I propose to have at least 1_000 runs of this tests before
> keeping
> > > this
> > > > uncommented.
> > > >
> > > >
> > > >
> > > > вт, 21 авг. 2018 г. в 11:08, Maxim Muzafarov <[hidden email]>:
> > > >
> > > > > Roman,
> > > > >
> > > > > I worked recently on rebalance improvements and haven't found any
> > > > problems
> > > > > with delayed cache rebalacne.
> > > > > Agree with you - let's uncomment this and remove scary comment.
> Will
> > > you
> > > > > create a ticket for it?
> > > > >
> > > > > In case of any problems we can easily detec deadlock with newly
> > > > configured
> > > > > `FailureHandler`.
> > > > >
> > > > > On Tue, 21 Aug 2018 at 03:49 Roman Shtykh <[hidden email]>
> wrote:
> > > > >
> > > > > > Hi Maxim,
> > > > > >
> > > > > > I have some issues with a cluster with rebalance delay enabled,
> but
> > > > need
> > > > > > to check more -- if I find it's related I'll share.
> > > > > > Just wanted to make sure it's not an issue anymore from someone
> > > working
> > > > > on
> > > > > > rebalancing. We should remove that comment then, it looks scary
> :)
> > > > > >
> > > > > > --
> > > > > > Roman Shtykh
> > > > > >
> > > > > >
> > > > > > On Tuesday, August 21, 2018, 12:49:00 a.m. GMT+9, Maxim
> Muzafarov <
> > > > > > [hidden email]> wrote:
> > > > > >
> > > > > >
> > > > > > Hello Roman,
> > > > > >
> > > > > > Did you faced with real issue of delayed rebalance or it's just
> > only
> > > > for
> > > > > > your personal interest?
> > > > > > If yes, please, share details and we will try to help you.
> > > > > >
> > > > > > As for this comment I don't think he is actual. That change was
> in
> > > > 2015.
> > > > > > Much has changed
> > > > > > within rebalance process since that time. I've uncommented it and
> > > > > > rechecked with that
> > > > > > cache configuration and haven't seen any failed tests or issues.
> > > > > >
> > > > > > Probably, that problem was about cache in SYNC mode does not
> start
> > > util
> > > > > it
> > > > > > loads all data
> > > > > > from other nodes. But currently delayed rebalance works the same
> > way
> > > as
> > > > > > IgniteCache#rebalance(),
> > > > > > so you can `setRebalanceDelay` to `-1` and call it manually to
> > check.
> > > > > >
> > > > > > On Mon, 20 Aug 2018 at 11:19 Roman Shtykh
> > <[hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > Igniters,
> > > > > > I have found "Known issue, possible deadlock in case of low
> > priority
> > > > > cache
> > > > > > rebalancing delayed" comment in
> > > > > > GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please
> > > > explain
> > > > > > when using rebalance delay can be an issue and why?
> > > > > >
> > > > > > -- Roman
> > > > > >
> > > > > > --
> > > > > > --
> > > > > > Maxim Muzafarov
> > > > > >
> > > > > --
> > > > > --
> > > > > Maxim Muzafarov
> > > > >
> > >
> > > --
> > > --
> > > Maxim Muzafarov
> > >
> >
> --
> --
> Maxim Muzafarov
>
Reply | Threaded
Open this post in threaded view
|

Re: Unknown known issue on cache rebalancing delayed

Roman Shtykh
Anton, Maxim, thanks for following up! Looks like a good enough trade-off.
Sorry, couldn't catch the conversation because of the different time zone ;)

    On Tuesday, September 4, 2018, 7:54:05 p.m. GMT+9, Anton Vinogradov <[hidden email]> wrote:  
 
 Maxim,

Let's create a branch with 10 checks of Sync and 10 checks of Async.
Then, run it 20 times at TC.
This should be enough I think.

вт, 4 сент. 2018 г. в 13:09, Maxim Muzafarov <[hidden email]>:

> Anton,
>
> I agree with you 20 time is not enough. I've checked the single run of the
> test class - it consumes ~7min per each execution.
> CacheSuite8 total execution timeout - 210 min, so we can perform only 30
> class execution in this suite. Our strategy here is
> to `20 times within single` and put into the TC queue 50 runs. Total ~7000
> min or 5 days.
>
> Not sure that we should perform exactly 1000 executions, hopefully, we will
> stop adding to the queue new tasks at some point.
>
> On Tue, 4 Sep 2018 at 12:59 Anton Vinogradov <[hidden email]> wrote:
>
> > Maxim,
> > 20 is not 1k :)
> > Also, you forgot to check GridCacheRebalancingAsyncSelfTest
> >
> > I'm not sure we should have exactly 1k runs, but 20 is definitely not
> > enough.
> >
> > Roman,
> > I propose to use IDEA "run until failure" feature and perform test
> locally
> > (at your PC) while you're not using PC.
> >
> > вт, 4 сент. 2018 г. в 12:51, Maxim Muzafarov <[hidden email]>:
> >
> > > Roman, Anton,
> > >
> > > I've already created additional PR [2] all and run it on TC [1].
> > > Please, follow up with the results.
> > >
> > > [1]
> > >
> > >
> >
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_Cache8&tab=buildTypeStatusDiv&branch_IgniteTests24Java8=pull%2F4676%2Fhead
> > > [2] https://github.com/apache/ignite/pull/4676/files
> > >
> > >
> > > On Tue, 4 Sep 2018 at 12:46 Roman Shtykh <[hidden email]>
> > > wrote:
> > >
> > > > Anton,
> > > > Thank you. I would like to recheck it. How can this (1_000 runs) be
> > done
> > > > in TC?
> > > >
> > > >
> > > >    On Tuesday, September 4, 2018, 5:42:01 p.m. GMT+9, Anton
> > Vinogradov <
> > > > [hidden email]> wrote:
> > > >
> > > >  Roman,
> > > >
> > > > I see you uncommented this line.
> > > > I do not remember deadlock detail, but I remember it was the
> extremely
> > > rare
> > > > case.
> > > > I found and "fixed" it some days before merge when I had 24x7 sanity
> > > check
> > > > week :)
> > > >
> > > > So, I propose to have at least 1_000 runs of this tests before
> keeping
> > > this
> > > > uncommented.
> > > >
> > > >
> > > >
> > > > вт, 21 авг. 2018 г. в 11:08, Maxim Muzafarov <[hidden email]>:
> > > >
> > > > > Roman,
> > > > >
> > > > > I worked recently on rebalance improvements and haven't found any
> > > > problems
> > > > > with delayed cache rebalacne.
> > > > > Agree with you - let's uncomment this and remove scary comment.
> Will
> > > you
> > > > > create a ticket for it?
> > > > >
> > > > > In case of any problems we can easily detec deadlock with newly
> > > > configured
> > > > > `FailureHandler`.
> > > > >
> > > > > On Tue, 21 Aug 2018 at 03:49 Roman Shtykh <[hidden email]>
> wrote:
> > > > >
> > > > > > Hi Maxim,
> > > > > >
> > > > > > I have some issues with a cluster with rebalance delay enabled,
> but
> > > > need
> > > > > > to check more -- if I find it's related I'll share.
> > > > > > Just wanted to make sure it's not an issue anymore from someone
> > > working
> > > > > on
> > > > > > rebalancing. We should remove that comment then, it looks scary
> :)
> > > > > >
> > > > > > --
> > > > > > Roman Shtykh
> > > > > >
> > > > > >
> > > > > > On Tuesday, August 21, 2018, 12:49:00 a.m. GMT+9, Maxim
> Muzafarov <
> > > > > > [hidden email]> wrote:
> > > > > >
> > > > > >
> > > > > > Hello Roman,
> > > > > >
> > > > > > Did you faced with real issue of delayed rebalance or it's just
> > only
> > > > for
> > > > > > your personal interest?
> > > > > > If yes, please, share details and we will try to help you.
> > > > > >
> > > > > > As for this comment I don't think he is actual. That change was
> in
> > > > 2015.
> > > > > > Much has changed
> > > > > > within rebalance process since that time. I've uncommented it and
> > > > > > rechecked with that
> > > > > > cache configuration and haven't seen any failed tests or issues.
> > > > > >
> > > > > > Probably, that problem was about cache in SYNC mode does not
> start
> > > util
> > > > > it
> > > > > > loads all data
> > > > > > from other nodes. But currently delayed rebalance works the same
> > way
> > > as
> > > > > > IgniteCache#rebalance(),
> > > > > > so you can `setRebalanceDelay` to `-1` and call it manually to
> > check.
> > > > > >
> > > > > > On Mon, 20 Aug 2018 at 11:19 Roman Shtykh
> > <[hidden email]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > Igniters,
> > > > > > I have found "Known issue, possible deadlock in case of low
> > priority
> > > > > cache
> > > > > > rebalancing delayed" comment in
> > > > > > GridCacheRebalancingSyncSelfTest#getConfiguration.Can you please
> > > > explain
> > > > > > when using rebalance delay can be an issue and why?
> > > > > >
> > > > > > -- Roman
> > > > > >
> > > > > > --
> > > > > > --
> > > > > > Maxim Muzafarov
> > > > > >
> > > > > --
> > > > > --
> > > > > Maxim Muzafarov
> > > > >
> > >
> > > --
> > > --
> > > Maxim Muzafarov
> > >
> >
> --
> --
> Maxim Muzafarov
>