Fixed deadlock in GridDhtAtomicCache (Alex G. your review is needed)

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Fixed deadlock in GridDhtAtomicCache (Alex G. your review is needed)

Denis Magda
Hi Alex, Igniters,

I've fixed a deadlock in GridDhtAtomicCache that was a reason of frequent hanging of "Cache Restart" test suite.

In short, the deadlock happened because a cache was already stopped but some running threads, that perform cache related operations, keep using invalidated GridCacheContext.
All the details are described here:
https://issues.apache.org/jira/browse/IGNITE-1189 <https://issues.apache.org/jira/browse/IGNITE-1189>

Alex, as one of earlier implementers of this code, please review the changes.

Regards,
Denis
Reply | Threaded
Open this post in threaded view
|

Re: Fixed deadlock in GridDhtAtomicCache (Alex G. your review is needed)

Alexey Goncharuk
The change by itself looks right and can be merged, however I do not think
this is a complete fix. What kind of running threads were using invalidated
cache context? These threads may raise plenty of other exceptions if
invalid context is used. I think the proper solution should block a guard
(I am sure we already have a guard that we can reuse) and wait for all
threads to release this guard before cleaning up the context.

2015-08-04 8:28 GMT-07:00 Denis Magda <[hidden email]>:

> Hi Alex, Igniters,
>
> I've fixed a deadlock in GridDhtAtomicCache that was a reason of frequent
> hanging of "Cache Restart" test suite.
>
> In short, the deadlock happened because a cache was already stopped but
> some running threads, that perform cache related operations, keep using
> invalidated GridCacheContext.
> All the details are described here:
> https://issues.apache.org/jira/browse/IGNITE-1189 <
> https://issues.apache.org/jira/browse/IGNITE-1189>
>
> Alex, as one of earlier implementers of this code, please review the
> changes.
>
> Regards,
> Denis
Reply | Threaded
Open this post in threaded view
|

Re: Fixed deadlock in GridDhtAtomicCache (Alex G. your review is needed)

Denis Magda
Alex, thanks for the review!

Sure, this is just a local fix.
Recently I've detected and fixed several issues in TCP communication SPI
that happened because of invalidated cache context. In addition, Andrey
Gura mentioned that periodically he reproduces hangs in cache get
operations that most likely to happen because of invalidated cache context
as well.

Seems that it's time to fix the situation with invalidated cache context
globally. I'll create a task in JIRA in several days when return from a
short vacation putting extensive details. Then someone from the community
or me will have a chance to makes his/her hands dirty with this :)

As for this deadlock I'll merge that changes in any case because we need to
have them in the code to omit other RuntimeExceptions that may happen
because of any other reason. The threads that led to the deadlock were
threads from partitions supply pool or some internal workers pool.

Regards,
Denis

On 4 авг. 2015 г., at 22:09, Alexey Goncharuk <[hidden email]>
wrote:

The change by itself looks right and can be merged, however I do not think
this is a complete fix. What kind of running threads were using invalidated
cache context? These threads may raise plenty of other exceptions if
invalid context is used. I think the proper solution should block a guard
(I am sure we already have a guard that we can reuse) and wait for all
threads to release this guard before cleaning up the context.

2015-08-04 8:28 GMT-07:00 Denis Magda <[hidden email]>:

Hi Alex, Igniters,

I've fixed a deadlock in GridDhtAtomicCache that was a reason of frequent
hanging of "Cache Restart" test suite.

In short, the deadlock happened because a cache was already stopped but
some running threads, that perform cache related operations, keep using
invalidated GridCacheContext.
All the details are described here:
https://issues.apache.org/jira/browse/IGNITE-1189 <
https://issues.apache.org/jira/browse/IGNITE-1189>

Alex, as one of earlier implementers of this code, please review the
changes.

Regards,
Denis
Reply | Threaded
Open this post in threaded view
|

Re: Fixed deadlock in GridDhtAtomicCache (Alex G. your review is needed)

yzhdanov
Guys, what about not invalidating cache contexts on stop? Let's cleanup on
higher level.

--Yakov

2015-08-04 22:48 GMT+03:00 Denis Magda <[hidden email]>:

> Alex, thanks for the review!
>
> Sure, this is just a local fix.
> Recently I've detected and fixed several issues in TCP communication SPI
> that happened because of invalidated cache context. In addition, Andrey
> Gura mentioned that periodically he reproduces hangs in cache get
> operations that most likely to happen because of invalidated cache context
> as well.
>
> Seems that it's time to fix the situation with invalidated cache context
> globally. I'll create a task in JIRA in several days when return from a
> short vacation putting extensive details. Then someone from the community
> or me will have a chance to makes his/her hands dirty with this :)
>
> As for this deadlock I'll merge that changes in any case because we need to
> have them in the code to omit other RuntimeExceptions that may happen
> because of any other reason. The threads that led to the deadlock were
> threads from partitions supply pool or some internal workers pool.
>
> Regards,
> Denis
>
> On 4 авг. 2015 г., at 22:09, Alexey Goncharuk <[hidden email]>
> wrote:
>
> The change by itself looks right and can be merged, however I do not think
> this is a complete fix. What kind of running threads were using invalidated
> cache context? These threads may raise plenty of other exceptions if
> invalid context is used. I think the proper solution should block a guard
> (I am sure we already have a guard that we can reuse) and wait for all
> threads to release this guard before cleaning up the context.
>
> 2015-08-04 8:28 GMT-07:00 Denis Magda <[hidden email]>:
>
> Hi Alex, Igniters,
>
> I've fixed a deadlock in GridDhtAtomicCache that was a reason of frequent
> hanging of "Cache Restart" test suite.
>
> In short, the deadlock happened because a cache was already stopped but
> some running threads, that perform cache related operations, keep using
> invalidated GridCacheContext.
> All the details are described here:
> https://issues.apache.org/jira/browse/IGNITE-1189 <
> https://issues.apache.org/jira/browse/IGNITE-1189>
>
> Alex, as one of earlier implementers of this code, please review the
> changes.
>
> Regards,
> Denis
>
Reply | Threaded
Open this post in threaded view
|

Re: Fixed deadlock in GridDhtAtomicCache (Alex G. your review is needed)

Denis Magda
What do you mean under the cleanup on a higher level?

Do you consider setting all cache context references to null when
required letting a garbage collector to deallocate context's internals
when it's time for that?

In any case I've created a ticker where we can put all the useful
thoughts/ideas that should help an implementor.
https://issues.apache.org/jira/browse/IGNITE-1221

--
Denis

On 8/5/2015 5:48 PM, Yakov Zhdanov wrote:

> Guys, what about not invalidating cache contexts on stop? Let's cleanup on
> higher level.
>
> --Yakov
>
> 2015-08-04 22:48 GMT+03:00 Denis Magda <[hidden email]>:
>
>> Alex, thanks for the review!
>>
>> Sure, this is just a local fix.
>> Recently I've detected and fixed several issues in TCP communication SPI
>> that happened because of invalidated cache context. In addition, Andrey
>> Gura mentioned that periodically he reproduces hangs in cache get
>> operations that most likely to happen because of invalidated cache context
>> as well.
>>
>> Seems that it's time to fix the situation with invalidated cache context
>> globally. I'll create a task in JIRA in several days when return from a
>> short vacation putting extensive details. Then someone from the community
>> or me will have a chance to makes his/her hands dirty with this :)
>>
>> As for this deadlock I'll merge that changes in any case because we need to
>> have them in the code to omit other RuntimeExceptions that may happen
>> because of any other reason. The threads that led to the deadlock were
>> threads from partitions supply pool or some internal workers pool.
>>
>> Regards,
>> Denis
>>
>> On 4 авг. 2015 г., at 22:09, Alexey Goncharuk <[hidden email]>
>> wrote:
>>
>> The change by itself looks right and can be merged, however I do not think
>> this is a complete fix. What kind of running threads were using invalidated
>> cache context? These threads may raise plenty of other exceptions if
>> invalid context is used. I think the proper solution should block a guard
>> (I am sure we already have a guard that we can reuse) and wait for all
>> threads to release this guard before cleaning up the context.
>>
>> 2015-08-04 8:28 GMT-07:00 Denis Magda <[hidden email]>:
>>
>> Hi Alex, Igniters,
>>
>> I've fixed a deadlock in GridDhtAtomicCache that was a reason of frequent
>> hanging of "Cache Restart" test suite.
>>
>> In short, the deadlock happened because a cache was already stopped but
>> some running threads, that perform cache related operations, keep using
>> invalidated GridCacheContext.
>> All the details are described here:
>> https://issues.apache.org/jira/browse/IGNITE-1189 <
>> https://issues.apache.org/jira/browse/IGNITE-1189>
>>
>> Alex, as one of earlier implementers of this code, please review the
>> changes.
>>
>> Regards,
>> Denis
>>

Reply | Threaded
Open this post in threaded view
|

Re: Fixed deadlock in GridDhtAtomicCache (Alex G. your review is needed)

Denis Magda
Andrey Gura,

Could you put the info on the errors you observed with cache read
operations in the ticket below?

--
Denis

On 8/10/2015 3:13 PM, Denis Magda wrote:

> What do you mean under the cleanup on a higher level?
>
> Do you consider setting all cache context references to null when
> required letting a garbage collector to deallocate context's internals
> when it's time for that?
>
> In any case I've created a ticker where we can put all the useful
> thoughts/ideas that should help an implementor.
> https://issues.apache.org/jira/browse/IGNITE-1221
>
> --
> Denis
>
> On 8/5/2015 5:48 PM, Yakov Zhdanov wrote:
>> Guys, what about not invalidating cache contexts on stop? Let's
>> cleanup on
>> higher level.
>>
>> --Yakov
>>
>> 2015-08-04 22:48 GMT+03:00 Denis Magda <[hidden email]>:
>>
>>> Alex, thanks for the review!
>>>
>>> Sure, this is just a local fix.
>>> Recently I've detected and fixed several issues in TCP communication
>>> SPI
>>> that happened because of invalidated cache context. In addition, Andrey
>>> Gura mentioned that periodically he reproduces hangs in cache get
>>> operations that most likely to happen because of invalidated cache
>>> context
>>> as well.
>>>
>>> Seems that it's time to fix the situation with invalidated cache
>>> context
>>> globally. I'll create a task in JIRA in several days when return from a
>>> short vacation putting extensive details. Then someone from the
>>> community
>>> or me will have a chance to makes his/her hands dirty with this :)
>>>
>>> As for this deadlock I'll merge that changes in any case because we
>>> need to
>>> have them in the code to omit other RuntimeExceptions that may happen
>>> because of any other reason. The threads that led to the deadlock were
>>> threads from partitions supply pool or some internal workers pool.
>>>
>>> Regards,
>>> Denis
>>>
>>> On 4 авг. 2015 г., at 22:09, Alexey Goncharuk
>>> <[hidden email]>
>>> wrote:
>>>
>>> The change by itself looks right and can be merged, however I do not
>>> think
>>> this is a complete fix. What kind of running threads were using
>>> invalidated
>>> cache context? These threads may raise plenty of other exceptions if
>>> invalid context is used. I think the proper solution should block a
>>> guard
>>> (I am sure we already have a guard that we can reuse) and wait for all
>>> threads to release this guard before cleaning up the context.
>>>
>>> 2015-08-04 8:28 GMT-07:00 Denis Magda <[hidden email]>:
>>>
>>> Hi Alex, Igniters,
>>>
>>> I've fixed a deadlock in GridDhtAtomicCache that was a reason of
>>> frequent
>>> hanging of "Cache Restart" test suite.
>>>
>>> In short, the deadlock happened because a cache was already stopped but
>>> some running threads, that perform cache related operations, keep using
>>> invalidated GridCacheContext.
>>> All the details are described here:
>>> https://issues.apache.org/jira/browse/IGNITE-1189 <
>>> https://issues.apache.org/jira/browse/IGNITE-1189>
>>>
>>> Alex, as one of earlier implementers of this code, please review the
>>> changes.
>>>
>>> Regards,
>>> Denis
>>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Fixed deadlock in GridDhtAtomicCache (Alex G. your review is needed)

Andrey Gura
Denis,

https://issues.apache.org/jira/browse/IGNITE-1135 can be related with this
problem (linked with https://issues.apache.org/jira/browse/IGNITE-1221).


On Mon, Aug 10, 2015 at 3:16 PM, Denis Magda <[hidden email]> wrote:

> Andrey Gura,
>
> Could you put the info on the errors you observed with cache read
> operations in the ticket below?
>
> --
> Denis
>
>
> On 8/10/2015 3:13 PM, Denis Magda wrote:
>
>> What do you mean under the cleanup on a higher level?
>>
>> Do you consider setting all cache context references to null when
>> required letting a garbage collector to deallocate context's internals when
>> it's time for that?
>>
>> In any case I've created a ticker where we can put all the useful
>> thoughts/ideas that should help an implementor.
>> https://issues.apache.org/jira/browse/IGNITE-1221
>>
>> --
>> Denis
>>
>> On 8/5/2015 5:48 PM, Yakov Zhdanov wrote:
>>
>>> Guys, what about not invalidating cache contexts on stop? Let's cleanup
>>> on
>>> higher level.
>>>
>>> --Yakov
>>>
>>> 2015-08-04 22:48 GMT+03:00 Denis Magda <[hidden email]>:
>>>
>>> Alex, thanks for the review!
>>>>
>>>> Sure, this is just a local fix.
>>>> Recently I've detected and fixed several issues in TCP communication SPI
>>>> that happened because of invalidated cache context. In addition, Andrey
>>>> Gura mentioned that periodically he reproduces hangs in cache get
>>>> operations that most likely to happen because of invalidated cache
>>>> context
>>>> as well.
>>>>
>>>> Seems that it's time to fix the situation with invalidated cache context
>>>> globally. I'll create a task in JIRA in several days when return from a
>>>> short vacation putting extensive details. Then someone from the
>>>> community
>>>> or me will have a chance to makes his/her hands dirty with this :)
>>>>
>>>> As for this deadlock I'll merge that changes in any case because we
>>>> need to
>>>> have them in the code to omit other RuntimeExceptions that may happen
>>>> because of any other reason. The threads that led to the deadlock were
>>>> threads from partitions supply pool or some internal workers pool.
>>>>
>>>> Regards,
>>>> Denis
>>>>
>>>> On 4 авг. 2015 г., at 22:09, Alexey Goncharuk <
>>>> [hidden email]>
>>>> wrote:
>>>>
>>>> The change by itself looks right and can be merged, however I do not
>>>> think
>>>> this is a complete fix. What kind of running threads were using
>>>> invalidated
>>>> cache context? These threads may raise plenty of other exceptions if
>>>> invalid context is used. I think the proper solution should block a
>>>> guard
>>>> (I am sure we already have a guard that we can reuse) and wait for all
>>>> threads to release this guard before cleaning up the context.
>>>>
>>>> 2015-08-04 8:28 GMT-07:00 Denis Magda <[hidden email]>:
>>>>
>>>> Hi Alex, Igniters,
>>>>
>>>> I've fixed a deadlock in GridDhtAtomicCache that was a reason of
>>>> frequent
>>>> hanging of "Cache Restart" test suite.
>>>>
>>>> In short, the deadlock happened because a cache was already stopped but
>>>> some running threads, that perform cache related operations, keep using
>>>> invalidated GridCacheContext.
>>>> All the details are described here:
>>>> https://issues.apache.org/jira/browse/IGNITE-1189 <
>>>> https://issues.apache.org/jira/browse/IGNITE-1189>
>>>>
>>>> Alex, as one of earlier implementers of this code, please review the
>>>> changes.
>>>>
>>>> Regards,
>>>> Denis
>>>>
>>>>
>>
>


--
Andrey Gura
GridGain Systems, Inc.
www.gridgain.com