IGNITE-4365: Data grid in deadlock on stop by DataStreamerImpl

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

IGNITE-4365: Data grid in deadlock on stop by DataStreamerImpl

Александр Меньшиков
Hello,

I want to make ticket IGNITE-4365
<https://issues.apache.org/jira/browse/IGNITE-4365>. The problem came from
DataStreamerImpl.
There are methods which use DataStreamerImpl under the lock
(GridCacheGateway), but the method DataStreamerImpl#doFlush() has a
"while(true)" loop. And in case when someone is calling the
GridCacheGateway#onStopped(), application can get stuck in the loop in
DataStreamerImpl#doFlush(), and in trying get a lock in
GridCacheGateway#onStopped().

So I need an expert opinion about DataStreamerImpl#doFlush().
1) Can I just drop unfinished futures in DataStreamerImpl#doFlush() when
someone is calling GridCacheGateway#onStopped()? I can track it by adding a
volatile boolean flag in the GridCacheGateway.
2) Or better to modify a futures execution DataStreamerImpl#load0() to use
onDone with an exception or something like that?

Methods which use or might use DataStreamerImpl under the lock:

1) GridCacheAdapter#localLoad()
2) GridCacheAdapter#localLoadAndUpdate()
3) GridCacheAdapter#localLoadCache()
4) GridDistributedCacheAdapter.GlobalRemoveAllJob#localExecute() (it
exectly happen in thread dump in ticket)
Reply | Threaded
Open this post in threaded view
|

Re: IGNITE-4365: Data grid in deadlock on stop by DataStreamerImpl

yzhdanov
Alex, can you please share a test that demonstrates the hang?

--Yakov

2017-06-29 14:27 GMT+03:00 Александр Меньшиков <[hidden email]>:

> Hello,
>
> I want to make ticket IGNITE-4365
> <https://issues.apache.org/jira/browse/IGNITE-4365>. The problem came
> from DataStreamerImpl.
> There are methods which use DataStreamerImpl under the lock
> (GridCacheGateway), but the method DataStreamerImpl#doFlush() has a
> "while(true)" loop. And in case when someone is calling the
> GridCacheGateway#onStopped(), application can get stuck in the loop in
> DataStreamerImpl#doFlush(), and in trying get a lock in
> GridCacheGateway#onStopped().
>
> So I need an expert opinion about DataStreamerImpl#doFlush().
> 1) Can I just drop unfinished futures in DataStreamerImpl#doFlush() when
> someone is calling GridCacheGateway#onStopped()? I can track it by adding a
> volatile boolean flag in the GridCacheGateway.
> 2) Or better to modify a futures execution DataStreamerImpl#load0() to use
> onDone with an exception or something like that?
>
> Methods which use or might use DataStreamerImpl under the lock:
>
> 1) GridCacheAdapter#localLoad()
> 2) GridCacheAdapter#localLoadAndUpdate()
> 3) GridCacheAdapter#localLoadCache()
> 4) GridDistributedCacheAdapter.GlobalRemoveAllJob#localExecute() (it
> exectly happen in thread dump in ticket)
>
Reply | Threaded
Open this post in threaded view
|

Re: IGNITE-4365: Data grid in deadlock on stop by DataStreamerImpl

Александр Меньшиков
I don't have it. I got all information from thread dump which you added to
the ticket: one thread stuck in the DataStreamerImpl#doFlush() (which was
called by GridDistributedCacheAdapter.GlobalRemoveAllJob#localExecute()),
and the other in the GridCacheGateway#onStopped() (which was called by
GridCacheProcessor#onExchangeDone()).

I read about a problem with reproducing (Alexey Kuznetsov's first comment
in JIRA) and made the decision to look at the different view.
Code still looks dangerous, so I don't think the problem has resolved
itself.

In thread dump there are 2 tests:
1) GridCacheNearTxForceKeyTest
2) CrossCacheTxRandomOperationsTest

They all passed in a single running.

2017-06-29 15:31 GMT+03:00 Yakov Zhdanov <[hidden email]>:

> Alex, can you please share a test that demonstrates the hang?
>
> --Yakov
>
> 2017-06-29 14:27 GMT+03:00 Александр Меньшиков <[hidden email]>:
>
>> Hello,
>>
>> I want to make ticket IGNITE-4365
>> <https://issues.apache.org/jira/browse/IGNITE-4365>. The problem came
>> from DataStreamerImpl.
>> There are methods which use DataStreamerImpl under the lock
>> (GridCacheGateway), but the method DataStreamerImpl#doFlush() has a
>> "while(true)" loop. And in case when someone is calling the
>> GridCacheGateway#onStopped(), application can get stuck in the loop in
>> DataStreamerImpl#doFlush(), and in trying get a lock in
>> GridCacheGateway#onStopped().
>>
>> So I need an expert opinion about DataStreamerImpl#doFlush().
>> 1) Can I just drop unfinished futures in DataStreamerImpl#doFlush() when
>> someone is calling GridCacheGateway#onStopped()? I can track it by adding a
>> volatile boolean flag in the GridCacheGateway.
>> 2) Or better to modify a futures execution DataStreamerImpl#load0() to
>> use onDone with an exception or something like that?
>>
>> Methods which use or might use DataStreamerImpl under the lock:
>>
>> 1) GridCacheAdapter#localLoad()
>> 2) GridCacheAdapter#localLoadAndUpdate()
>> 3) GridCacheAdapter#localLoadCache()
>> 4) GridDistributedCacheAdapter.GlobalRemoveAllJob#localExecute() (it
>> exectly happen in thread dump in ticket)
>>
>
>