Apache Ignite Developers - Legacy Mail Archive

Suggestion to improve deadlock detection

Classic

List

Threaded

22 messages Options

Vladimir Ozerov

Suggestion to improve deadlock detection

Igniters,

We are currently working on transactional SQL and distributed deadlocks are
serious problem for us. It looks like current deadlock detection mechanism
has several deficiencies:
1) It transfer keys! No go for SQL as we may have millions of keys.
2) By default we wait for a minute. Way too much IMO.

What if we change it as follows:
1) Collect XIDs of all preceding transactions while obtaining lock within
current transaction object. This way we will always have the list of TXes
we wait for.
2) Define TX deadlock coordinator node
3) Periodically (e.g. once per second), iterate over active transactions
and detect ones waiting for a lock for too long (e.g. >2-3 sec). Timeouts
could be adaptive depending on the workload and false-pasitive alarms rate.
4) Send info about those long-running guys to coordinator in a form Map[XID
-> List<XID>]
5) Rebuild global wait-for graph on coordinator and search for deadlocks
6) Choose the victim and send problematic wait-for graph to it
7) Victim collects necessary info (e.g. keys, SQL statements, thread IDs,
cache IDs, etc.) and throws an exception.

Advantages:
1) We ignore short transactions. So if there are tons of short TXes on
typical OLTP workload, we will never many of them
2) Only minimal set of data is sent between nodes, so we can exchange data
often without loosing performance.

Thoughts?

Vladimir.

dsetrakyan

Re: Suggestion to improve deadlock detection

Vladimir,

I am not sure I like it, mainly due to some coordinator node doing some
periodic checks. For the deadlock detection to work effectively, it has to
be done locally on every node. This may require that every tx request will
carry information about up to N previous keys it accessed, but the
detection will happen locally on the destination node.

What do you think?

D.

On Mon, Nov 20, 2017 at 11:50 AM, Vladimir Ozerov <[hidden email]>
wrote:

> Igniters,
>
> We are currently working on transactional SQL and distributed deadlocks are
> serious problem for us. It looks like current deadlock detection mechanism
> has several deficiencies:
> 1) It transfer keys! No go for SQL as we may have millions of keys.
> 2) By default we wait for a minute. Way too much IMO.
>
> What if we change it as follows:
> 1) Collect XIDs of all preceding transactions while obtaining lock within
> current transaction object. This way we will always have the list of TXes
> we wait for.
> 2) Define TX deadlock coordinator node
> 3) Periodically (e.g. once per second), iterate over active transactions
> and detect ones waiting for a lock for too long (e.g. >2-3 sec). Timeouts
> could be adaptive depending on the workload and false-pasitive alarms rate.
> 4) Send info about those long-running guys to coordinator in a form Map[XID
> -> List<XID>]
> 5) Rebuild global wait-for graph on coordinator and search for deadlocks
> 6) Choose the victim and send problematic wait-for graph to it
> 7) Victim collects necessary info (e.g. keys, SQL statements, thread IDs,
> cache IDs, etc.) and throws an exception.
>
> Advantages:
> 1) We ignore short transactions. So if there are tons of short TXes on
> typical OLTP workload, we will never many of them
> 2) Only minimal set of data is sent between nodes, so we can exchange data
> often without loosing performance.
>
> Thoughts?
>
> Vladimir.
>

Vladimir Ozerov

Re: Suggestion to improve deadlock detection

Dima,

What is wrong with coordinator approach? All it does is analyze small
number of TXes which wait for locks for too long.

вт, 21 нояб. 2017 г. в 1:16, Dmitriy Setrakyan <[hidden email]>:

> Vladimir,
>
> I am not sure I like it, mainly due to some coordinator node doing some
> periodic checks. For the deadlock detection to work effectively, it has to
> be done locally on every node. This may require that every tx request will
> carry information about up to N previous keys it accessed, but the
> detection will happen locally on the destination node.
>
> What do you think?
>
> D.
>
> On Mon, Nov 20, 2017 at 11:50 AM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > Igniters,
> >
> > We are currently working on transactional SQL and distributed deadlocks
> are
> > serious problem for us. It looks like current deadlock detection
> mechanism
> > has several deficiencies:
> > 1) It transfer keys! No go for SQL as we may have millions of keys.
> > 2) By default we wait for a minute. Way too much IMO.
> >
> > What if we change it as follows:
> > 1) Collect XIDs of all preceding transactions while obtaining lock within
> > current transaction object. This way we will always have the list of TXes
> > we wait for.
> > 2) Define TX deadlock coordinator node
> > 3) Periodically (e.g. once per second), iterate over active transactions
> > and detect ones waiting for a lock for too long (e.g. >2-3 sec). Timeouts
> > could be adaptive depending on the workload and false-pasitive alarms
> rate.
> > 4) Send info about those long-running guys to coordinator in a form
> Map[XID
> > -> List<XID>]
> > 5) Rebuild global wait-for graph on coordinator and search for deadlocks
> > 6) Choose the victim and send problematic wait-for graph to it
> > 7) Victim collects necessary info (e.g. keys, SQL statements, thread IDs,
> > cache IDs, etc.) and throws an exception.
> >
> > Advantages:
> > 1) We ignore short transactions. So if there are tons of short TXes on
> > typical OLTP workload, we will never many of them
> > 2) Only minimal set of data is sent between nodes, so we can exchange
> data
> > often without loosing performance.
> >
> > Thoughts?
> >
> > Vladimir.
> >
>

dsetrakyan

Re: Suggestion to improve deadlock detection

How does it know about all the Txs?

⁣D.

On Nov 20, 2017, 8:53 PM, at 8:53 PM, Vladimir Ozerov <[hidden email]> wrote:

>Dima,
>
>What is wrong with coordinator approach? All it does is analyze small
>number of TXes which wait for locks for too long.
>
>вт, 21 нояб. 2017 г. в 1:16, Dmitriy Setrakyan <[hidden email]>:
>
>> Vladimir,
>>
>> I am not sure I like it, mainly due to some coordinator node doing
>some
>> periodic checks. For the deadlock detection to work effectively, it
>has to
>> be done locally on every node. This may require that every tx request
>will
>> carry information about up to N previous keys it accessed, but the
>> detection will happen locally on the destination node.
>>
>> What do you think?
>>
>> D.
>>
>> On Mon, Nov 20, 2017 at 11:50 AM, Vladimir Ozerov
><[hidden email]>
>> wrote:
>>
>> > Igniters,
>> >
>> > We are currently working on transactional SQL and distributed
>deadlocks
>> are
>> > serious problem for us. It looks like current deadlock detection
>> mechanism
>> > has several deficiencies:
>> > 1) It transfer keys! No go for SQL as we may have millions of keys.
>> > 2) By default we wait for a minute. Way too much IMO.
>> >
>> > What if we change it as follows:
>> > 1) Collect XIDs of all preceding transactions while obtaining lock
>within
>> > current transaction object. This way we will always have the list
>of TXes
>> > we wait for.
>> > 2) Define TX deadlock coordinator node
>> > 3) Periodically (e.g. once per second), iterate over active
>transactions
>> > and detect ones waiting for a lock for too long (e.g. >2-3 sec).
>Timeouts
>> > could be adaptive depending on the workload and false-pasitive
>alarms
>> rate.
>> > 4) Send info about those long-running guys to coordinator in a form
>> Map[XID
>> > -> List<XID>]
>> > 5) Rebuild global wait-for graph on coordinator and search for
>deadlocks
>> > 6) Choose the victim and send problematic wait-for graph to it
>> > 7) Victim collects necessary info (e.g. keys, SQL statements,
>thread IDs,
>> > cache IDs, etc.) and throws an exception.
>> >
>> > Advantages:
>> > 1) We ignore short transactions. So if there are tons of short TXes
>on
>> > typical OLTP workload, we will never many of them
>> > 2) Only minimal set of data is sent between nodes, so we can
>exchange
>> data
>> > often without loosing performance.
>> >
>> > Thoughts?
>> >
>> > Vladimir.
>> >
>>

Vladimir Ozerov

Re: Suggestion to improve deadlock detection

It doesn’t need all txes. Instead, other nodes will send info about
suspicious txes to it from time to time.

вт, 21 нояб. 2017 г. в 8:04, Dmitriy Setrakyan <[hidden email]>:

> How does it know about all the Txs?
>
> ⁣D.
>
> On Nov 20, 2017, 8:53 PM, at 8:53 PM, Vladimir Ozerov <
> [hidden email]> wrote:
> >Dima,
> >
> >What is wrong with coordinator approach? All it does is analyze small
> >number of TXes which wait for locks for too long.
> >
> >вт, 21 нояб. 2017 г. в 1:16, Dmitriy Setrakyan <[hidden email]>:
> >
> >> Vladimir,
> >>
> >> I am not sure I like it, mainly due to some coordinator node doing
> >some
> >> periodic checks. For the deadlock detection to work effectively, it
> >has to
> >> be done locally on every node. This may require that every tx request
> >will
> >> carry information about up to N previous keys it accessed, but the
> >> detection will happen locally on the destination node.
> >>
> >> What do you think?
> >>
> >> D.
> >>
> >> On Mon, Nov 20, 2017 at 11:50 AM, Vladimir Ozerov
> ><[hidden email]>
> >> wrote:
> >>
> >> > Igniters,
> >> >
> >> > We are currently working on transactional SQL and distributed
> >deadlocks
> >> are
> >> > serious problem for us. It looks like current deadlock detection
> >> mechanism
> >> > has several deficiencies:
> >> > 1) It transfer keys! No go for SQL as we may have millions of keys.
> >> > 2) By default we wait for a minute. Way too much IMO.
> >> >
> >> > What if we change it as follows:
> >> > 1) Collect XIDs of all preceding transactions while obtaining lock
> >within
> >> > current transaction object. This way we will always have the list
> >of TXes
> >> > we wait for.
> >> > 2) Define TX deadlock coordinator node
> >> > 3) Periodically (e.g. once per second), iterate over active
> >transactions
> >> > and detect ones waiting for a lock for too long (e.g. >2-3 sec).
> >Timeouts
> >> > could be adaptive depending on the workload and false-pasitive
> >alarms
> >> rate.
> >> > 4) Send info about those long-running guys to coordinator in a form
> >> Map[XID
> >> > -> List<XID>]
> >> > 5) Rebuild global wait-for graph on coordinator and search for
> >deadlocks
> >> > 6) Choose the victim and send problematic wait-for graph to it
> >> > 7) Victim collects necessary info (e.g. keys, SQL statements,
> >thread IDs,
> >> > cache IDs, etc.) and throws an exception.
> >> >
> >> > Advantages:
> >> > 1) We ignore short transactions. So if there are tons of short TXes
> >on
> >> > typical OLTP workload, we will never many of them
> >> > 2) Only minimal set of data is sent between nodes, so we can
> >exchange
> >> data
> >> > often without loosing performance.
> >> >
> >> > Thoughts?
> >> >
> >> > Vladimir.
> >> >
> >>
>

dsetrakyan

Re: Suggestion to improve deadlock detection

On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <[hidden email]>
wrote:

> It doesn’t need all txes. Instead, other nodes will send info about
> suspicious txes to it from time to time.
>

I see your point, I think it might work.

Ivan Pavlukhin

Re: Suggestion to improve deadlock detection

Hi Igniters,

I would like to resume the discussion about a deadlock detector. I start
with a motivation for a further work on a subject. As I see current
implementation (entry point IgniteTxManager.detectDeadlock) starts a
detection only after a transaction was timed out. In my mind it is not
very good from a product usability standpoint. As you know, in a
situation of deadlock some keys become non-usable for an infinite amount
of time. Currently the only way to work around it is configuring a
timeout, but it could be rather tricky in practice to choose a
proper/universal value for it. So, I see the main point as:

Ability to break deadlocks without a need to configure timeouts explicitly.

I will return soon with some thoughts about implementation. Meanwhile,
does anybody have in mind any other usability points which I am missing?
Or is there any alternative approaches?

On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]> wrote:
> On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <[hidden email]>>
> wrote:>
>
> > It doesn’t need all txes. Instead, other nodes will send info about>
> > suspicious txes to it from time to time.>
> >>
>
> I see your point, I think it might work.>
>

Ivan Pavlukhin

Re: Suggestion to improve deadlock detection

Hi,

Next part as promised. A working item for me is a deadlock detector
for MVCC transactions [1]. The message is structured in 2 parts. First
is an analysis of the current state of affairs and possible options to
go. Second is a proposed option. First part is going to be not so
short so some might prefer to skip it.

ANALYSIS
The immediate question is "why we cannot use an existing deadlock
detector?". The differences between classic and MVCC transactions
implementation is the answer. Currently a collection of IgniteTxEntry
is used for detection. But such collection is not maintained for MVCC
transactions. So, it will not work out of box.
Also it looks like that current distributed iterative approach cannot
be low latency it the worst case because of doing possibly many
network requests sequentially.
So, what options do we have? Generally we should choose between
centralized and distributed approaches. By centralized approach I mean
existence of a dedicated deadlock detector located on a single node.
In the centralized approach we can face difficulties related to
failover as a node running deadlock detector can fail. In the
distributed approach extra network messaging overhead can strike
because different nodes participating in a deadlock can start
detection independently and send redundant messages. I see some
aspects which make sense for choosing implementation. Here they are
with an approach that is better (roughly speaking) in parentheses:
* Detection latency (centralized).
* Messaging overhead (centralized).
* Failover (distributed).
And also having a park of deadlock detectors sounds not very good. I
hope that it is possible to develop a common solution suitable for
both kinds of transactions. I suggest to pilot new solution with MVCC
and then adopt it for classic transactions.

PROPOSAL
Actually I propose to start with an centralized algorithm described by
Vladimir in the beginning of the thread. I will try to outline main
points of it.
1. Single deadlock detector exists in the cluster which maintains
transaction wait-for graph (WFG).
2. Each cluster node sends and invalidates wait-for edges to the detector.
3. The detector periodically searches cycles in WFG and chooses and
aborts a victim transaction if cycle is found.

Currently I have one fundamental question. Is there a possibility of
false detected deadlocks because of concurrent WFG updates?
Of course there are many points of improvements and optimizations. But
I would like to start from discussing key points.

Please share your thoughts!

[1] https://issues.apache.org/jira/browse/IGNITE-9322
ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <[hidden email]>:

>
> Hi Igniters,
>
> I would like to resume the discussion about a deadlock detector. I start
> with a motivation for a further work on a subject. As I see current
> implementation (entry point IgniteTxManager.detectDeadlock) starts a
> detection only after a transaction was timed out. In my mind it is not
> very good from a product usability standpoint. As you know, in a
> situation of deadlock some keys become non-usable for an infinite amount
> of time. Currently the only way to work around it is configuring a
> timeout, but it could be rather tricky in practice to choose a
> proper/universal value for it. So, I see the main point as:
>
> Ability to break deadlocks without a need to configure timeouts explicitly.
>
> I will return soon with some thoughts about implementation. Meanwhile,
> does anybody have in mind any other usability points which I am missing?
> Or is there any alternative approaches?
>
> On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]> wrote:
> > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <[hidden email]>>
> > wrote:>
> >
> > > It doesn’t need all txes. Instead, other nodes will send info about>
> > > suspicious txes to it from time to time.>
> > >>
> >
> > I see your point, I think it might work.>
> >

--
Best regards,
Ivan Pavlukhin

Vladimir Ozerov

Re: Suggestion to improve deadlock detection

Ivan,

This is interesting question. I think we should spend some time for formal
verification whether this algorithm works or not. Several articles you may
use as a startitng point: [1], [2]. From what I understand, Ignite fall
into "AND" model, and currently implemented algorithm is a variation of
"edge-chasing" approach as per Chandy, Misra and Haas [3], which is *proven
to be correct* in that it both detects deadlocks when they are present, and
do not produce false positives. But is might be too heavy for the system
under contention.

We need to search for any formal proof of correctness of proposed
algorithm. This area is already researched throughly enough, so we should
be able to find an answer quickly.

Vladimir.

[1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
[2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
[3]
https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf

On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <[hidden email]> wrote:

> Hi,
>
> Next part as promised. A working item for me is a deadlock detector
> for MVCC transactions [1]. The message is structured in 2 parts. First
> is an analysis of the current state of affairs and possible options to
> go. Second is a proposed option. First part is going to be not so
> short so some might prefer to skip it.
>
> ANALYSIS
> The immediate question is "why we cannot use an existing deadlock
> detector?". The differences between classic and MVCC transactions
> implementation is the answer. Currently a collection of IgniteTxEntry
> is used for detection. But such collection is not maintained for MVCC
> transactions. So, it will not work out of box.
> Also it looks like that current distributed iterative approach cannot
> be low latency it the worst case because of doing possibly many
> network requests sequentially.
> So, what options do we have? Generally we should choose between
> centralized and distributed approaches. By centralized approach I mean
> existence of a dedicated deadlock detector located on a single node.
> In the centralized approach we can face difficulties related to
> failover as a node running deadlock detector can fail. In the
> distributed approach extra network messaging overhead can strike
> because different nodes participating in a deadlock can start
> detection independently and send redundant messages. I see some
> aspects which make sense for choosing implementation. Here they are
> with an approach that is better (roughly speaking) in parentheses:
> * Detection latency (centralized).
> * Messaging overhead (centralized).
> * Failover (distributed).
> And also having a park of deadlock detectors sounds not very good. I
> hope that it is possible to develop a common solution suitable for
> both kinds of transactions. I suggest to pilot new solution with MVCC
> and then adopt it for classic transactions.
>
> PROPOSAL
> Actually I propose to start with an centralized algorithm described by
> Vladimir in the beginning of the thread. I will try to outline main
> points of it.
> 1. Single deadlock detector exists in the cluster which maintains
> transaction wait-for graph (WFG).
> 2. Each cluster node sends and invalidates wait-for edges to the detector.
> 3. The detector periodically searches cycles in WFG and chooses and
> aborts a victim transaction if cycle is found.
>
> Currently I have one fundamental question. Is there a possibility of
> false detected deadlocks because of concurrent WFG updates?
> Of course there are many points of improvements and optimizations. But
> I would like to start from discussing key points.
>
> Please share your thoughts!
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9322
> ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <[hidden email]>:
> >
> > Hi Igniters,
> >
> > I would like to resume the discussion about a deadlock detector. I start
> > with a motivation for a further work on a subject. As I see current
> > implementation (entry point IgniteTxManager.detectDeadlock) starts a
> > detection only after a transaction was timed out. In my mind it is not
> > very good from a product usability standpoint. As you know, in a
> > situation of deadlock some keys become non-usable for an infinite amount
> > of time. Currently the only way to work around it is configuring a
> > timeout, but it could be rather tricky in practice to choose a
> > proper/universal value for it. So, I see the main point as:
> >
> > Ability to break deadlocks without a need to configure timeouts
> explicitly.
> >
> > I will return soon with some thoughts about implementation. Meanwhile,
> > does anybody have in mind any other usability points which I am missing?
> > Or is there any alternative approaches?
> >
> > On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]> wrote:
> > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <[hidden email]
> >>
> > > wrote:>
> > >
> > > > It doesn’t need all txes. Instead, other nodes will send info about>
> > > > suspicious txes to it from time to time.>
> > > >>
> > >
> > > I see your point, I think it might work.>
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

Ivan Pavlukhin

Re: Suggestion to improve deadlock detection

Vladimir,

Thanks for the articles! I studied them and a couple of others. And I
would like to share a knowledge I found.

BACKGROUND
First of all our algorithm implemented in
org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
is not an edge-chasing algorithm. In essence a lock-waiting
transaction site polls nodes responsible for keys of interest one by
one and reconstructs global wait-for graph (WFG) locally.
A centralized algorithm discussed in this thread looks similar to
algorithms proposed by Ho [1]. The simplest of them (two-phased)
reports false deadlocks when unlock before transaction finish is
permitted. So, it seems that it only works when strict two-phase
locking (2PL) is obeyed. Another one (called one-phased) requires
tracking all locked keys by each transaction which is not desirable
for MVCC transactions.
Aforementioned edge-chasing algorithm by Chandy, Misra and Haas [2] is
proven to work even when 2PL is not obeyed.
Also performance target is not clear for me. It looks like centralized
approaches can use fewer messages and provide lower latency that
distributed ones. But would like to understand what are the orders of
latency and messaging overhead. Many of algorithms are described in
[3] and some performance details are mentioned. It is said "As per
[GRAY81], most deadlocks involve two to four transactions." I see one
more argument making deadlock detection algorithm latency not so
important. We can consider deadlock as unavoidable situation, but even
lightning fast detector will not save a database from performance
degradation in case of tons of deadlocks.
What is also interesting, Google Cloud Spanner employs simple
"wound-wait" deadlock prevention [4] instead of any detection
algorithm.

DISCUSSION
So, I see the following options:
1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
2. Centralized two-phased algorithm described by Ho [1] with
restriction that transactions must obey 2PL.
Personally, I tend to edge-chasing approach because it looks simple
and universal. Restricting ourselves with 2PL will restrict us in
features which could be implemented in future (e.g. transaction
savepoints), will not it?

WDYT?

Ivan

[1] https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
[2] https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
[3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
[4] https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance

ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <[hidden email]>:

>
> Ivan,
>
> This is interesting question. I think we should spend some time for formal
> verification whether this algorithm works or not. Several articles you may
> use as a startitng point: [1], [2]. From what I understand, Ignite fall
> into "AND" model, and currently implemented algorithm is a variation of
> "edge-chasing" approach as per Chandy, Misra and Haas [3], which is *proven
> to be correct* in that it both detects deadlocks when they are present, and
> do not produce false positives. But is might be too heavy for the system
> under contention.
>
> We need to search for any formal proof of correctness of proposed
> algorithm. This area is already researched throughly enough, so we should
> be able to find an answer quickly.
>
> Vladimir.
>
> [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> [3]
> https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
>
> On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <[hidden email]> wrote:
>
> > Hi,
> >
> > Next part as promised. A working item for me is a deadlock detector
> > for MVCC transactions [1]. The message is structured in 2 parts. First
> > is an analysis of the current state of affairs and possible options to
> > go. Second is a proposed option. First part is going to be not so
> > short so some might prefer to skip it.
> >
> > ANALYSIS
> > The immediate question is "why we cannot use an existing deadlock
> > detector?". The differences between classic and MVCC transactions
> > implementation is the answer. Currently a collection of IgniteTxEntry
> > is used for detection. But such collection is not maintained for MVCC
> > transactions. So, it will not work out of box.
> > Also it looks like that current distributed iterative approach cannot
> > be low latency it the worst case because of doing possibly many
> > network requests sequentially.
> > So, what options do we have? Generally we should choose between
> > centralized and distributed approaches. By centralized approach I mean
> > existence of a dedicated deadlock detector located on a single node.
> > In the centralized approach we can face difficulties related to
> > failover as a node running deadlock detector can fail. In the
> > distributed approach extra network messaging overhead can strike
> > because different nodes participating in a deadlock can start
> > detection independently and send redundant messages. I see some
> > aspects which make sense for choosing implementation. Here they are
> > with an approach that is better (roughly speaking) in parentheses:
> > * Detection latency (centralized).
> > * Messaging overhead (centralized).
> > * Failover (distributed).
> > And also having a park of deadlock detectors sounds not very good. I
> > hope that it is possible to develop a common solution suitable for
> > both kinds of transactions. I suggest to pilot new solution with MVCC
> > and then adopt it for classic transactions.
> >
> > PROPOSAL
> > Actually I propose to start with an centralized algorithm described by
> > Vladimir in the beginning of the thread. I will try to outline main
> > points of it.
> > 1. Single deadlock detector exists in the cluster which maintains
> > transaction wait-for graph (WFG).
> > 2. Each cluster node sends and invalidates wait-for edges to the detector.
> > 3. The detector periodically searches cycles in WFG and chooses and
> > aborts a victim transaction if cycle is found.
> >
> > Currently I have one fundamental question. Is there a possibility of
> > false detected deadlocks because of concurrent WFG updates?
> > Of course there are many points of improvements and optimizations. But
> > I would like to start from discussing key points.
> >
> > Please share your thoughts!
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <[hidden email]>:
> > >
> > > Hi Igniters,
> > >
> > > I would like to resume the discussion about a deadlock detector. I start
> > > with a motivation for a further work on a subject. As I see current
> > > implementation (entry point IgniteTxManager.detectDeadlock) starts a
> > > detection only after a transaction was timed out. In my mind it is not
> > > very good from a product usability standpoint. As you know, in a
> > > situation of deadlock some keys become non-usable for an infinite amount
> > > of time. Currently the only way to work around it is configuring a
> > > timeout, but it could be rather tricky in practice to choose a
> > > proper/universal value for it. So, I see the main point as:
> > >
> > > Ability to break deadlocks without a need to configure timeouts
> > explicitly.
> > >
> > > I will return soon with some thoughts about implementation. Meanwhile,
> > > does anybody have in mind any other usability points which I am missing?
> > > Or is there any alternative approaches?
> > >
> > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]> wrote:
> > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <[hidden email]
> > >>
> > > > wrote:>
> > > >
> > > > > It doesn’t need all txes. Instead, other nodes will send info about>
> > > > > suspicious txes to it from time to time.>
> > > > >>
> > > >
> > > > I see your point, I think it might work.>
> > > >
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >

--
Best regards,
Ivan Pavlukhin

Vladimir Ozerov

Re: Suggestion to improve deadlock detection

Hi Ivan,

Great analysis. Agree that edge-chasing looks like better candidate. First,
it will be applicable to both normal and MVCC transactions. Second, in MVCC
we probably will also need to release some locks when doing rollbacks. What
we should think about is failover - what if a node which was in the middle
of a graph fails? We need to craft fault-tolerance carefully.

Another critically important point is how to trigger deadlock detection. My
concerns about edge-chasing was not about latency, but about a number of
messages which travels between nodes. Good algorithm must produce little to
no messages in case of normal contenion while still providing protection in
case of real deadlocks. So how would we trigger fraph traversal for the
given transaction? May be we can start with hard timeout and then employ a
kind of adaptive increment in case high rate of false-positives are
observed. May be something else.

On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <[hidden email]> wrote:

> Vladimir,
>
> Thanks for the articles! I studied them and a couple of others. And I
> would like to share a knowledge I found.
>
> BACKGROUND
> First of all our algorithm implemented in
>
> org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> is not an edge-chasing algorithm. In essence a lock-waiting
> transaction site polls nodes responsible for keys of interest one by
> one and reconstructs global wait-for graph (WFG) locally.
> A centralized algorithm discussed in this thread looks similar to
> algorithms proposed by Ho [1]. The simplest of them (two-phased)
> reports false deadlocks when unlock before transaction finish is
> permitted. So, it seems that it only works when strict two-phase
> locking (2PL) is obeyed. Another one (called one-phased) requires
> tracking all locked keys by each transaction which is not desirable
> for MVCC transactions.
> Aforementioned edge-chasing algorithm by Chandy, Misra and Haas [2] is
> proven to work even when 2PL is not obeyed.
> Also performance target is not clear for me. It looks like centralized
> approaches can use fewer messages and provide lower latency that
> distributed ones. But would like to understand what are the orders of
> latency and messaging overhead. Many of algorithms are described in
> [3] and some performance details are mentioned. It is said "As per
> [GRAY81], most deadlocks involve two to four transactions." I see one
> more argument making deadlock detection algorithm latency not so
> important. We can consider deadlock as unavoidable situation, but even
> lightning fast detector will not save a database from performance
> degradation in case of tons of deadlocks.
> What is also interesting, Google Cloud Spanner employs simple
> "wound-wait" deadlock prevention [4] instead of any detection
> algorithm.
>
> DISCUSSION
> So, I see the following options:
> 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> 2. Centralized two-phased algorithm described by Ho [1] with
> restriction that transactions must obey 2PL.
> Personally, I tend to edge-chasing approach because it looks simple
> and universal. Restricting ourselves with 2PL will restrict us in
> features which could be implemented in future (e.g. transaction
> savepoints), will not it?
>
> WDYT?
>
> Ivan
>
> [1] https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> [2]
> https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> [3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> [4]
> https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
>
> ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <[hidden email]>:
> >
> > Ivan,
> >
> > This is interesting question. I think we should spend some time for
> formal
> > verification whether this algorithm works or not. Several articles you
> may
> > use as a startitng point: [1], [2]. From what I understand, Ignite fall
> > into "AND" model, and currently implemented algorithm is a variation of
> > "edge-chasing" approach as per Chandy, Misra and Haas [3], which is
> *proven
> > to be correct* in that it both detects deadlocks when they are present,
> and
> > do not produce false positives. But is might be too heavy for the system
> > under contention.
> >
> > We need to search for any formal proof of correctness of proposed
> > algorithm. This area is already researched throughly enough, so we should
> > be able to find an answer quickly.
> >
> > Vladimir.
> >
> > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > [3]
> >
> https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> >
> > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <[hidden email]>
> wrote:
> >
> > > Hi,
> > >
> > > Next part as promised. A working item for me is a deadlock detector
> > > for MVCC transactions [1]. The message is structured in 2 parts. First
> > > is an analysis of the current state of affairs and possible options to
> > > go. Second is a proposed option. First part is going to be not so
> > > short so some might prefer to skip it.
> > >
> > > ANALYSIS
> > > The immediate question is "why we cannot use an existing deadlock
> > > detector?". The differences between classic and MVCC transactions
> > > implementation is the answer. Currently a collection of IgniteTxEntry
> > > is used for detection. But such collection is not maintained for MVCC
> > > transactions. So, it will not work out of box.
> > > Also it looks like that current distributed iterative approach cannot
> > > be low latency it the worst case because of doing possibly many
> > > network requests sequentially.
> > > So, what options do we have? Generally we should choose between
> > > centralized and distributed approaches. By centralized approach I mean
> > > existence of a dedicated deadlock detector located on a single node.
> > > In the centralized approach we can face difficulties related to
> > > failover as a node running deadlock detector can fail. In the
> > > distributed approach extra network messaging overhead can strike
> > > because different nodes participating in a deadlock can start
> > > detection independently and send redundant messages. I see some
> > > aspects which make sense for choosing implementation. Here they are
> > > with an approach that is better (roughly speaking) in parentheses:
> > > * Detection latency (centralized).
> > > * Messaging overhead (centralized).
> > > * Failover (distributed).
> > > And also having a park of deadlock detectors sounds not very good. I
> > > hope that it is possible to develop a common solution suitable for
> > > both kinds of transactions. I suggest to pilot new solution with MVCC
> > > and then adopt it for classic transactions.
> > >
> > > PROPOSAL
> > > Actually I propose to start with an centralized algorithm described by
> > > Vladimir in the beginning of the thread. I will try to outline main
> > > points of it.
> > > 1. Single deadlock detector exists in the cluster which maintains
> > > transaction wait-for graph (WFG).
> > > 2. Each cluster node sends and invalidates wait-for edges to the
> detector.
> > > 3. The detector periodically searches cycles in WFG and chooses and
> > > aborts a victim transaction if cycle is found.
> > >
> > > Currently I have one fundamental question. Is there a possibility of
> > > false detected deadlocks because of concurrent WFG updates?
> > > Of course there are many points of improvements and optimizations. But
> > > I would like to start from discussing key points.
> > >
> > > Please share your thoughts!
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <[hidden email]>:
> > > >
> > > > Hi Igniters,
> > > >
> > > > I would like to resume the discussion about a deadlock detector. I
> start
> > > > with a motivation for a further work on a subject. As I see current
> > > > implementation (entry point IgniteTxManager.detectDeadlock) starts a
> > > > detection only after a transaction was timed out. In my mind it is
> not
> > > > very good from a product usability standpoint. As you know, in a
> > > > situation of deadlock some keys become non-usable for an infinite
> amount
> > > > of time. Currently the only way to work around it is configuring a
> > > > timeout, but it could be rather tricky in practice to choose a
> > > > proper/universal value for it. So, I see the main point as:
> > > >
> > > > Ability to break deadlocks without a need to configure timeouts
> > > explicitly.
> > > >
> > > > I will return soon with some thoughts about implementation.
> Meanwhile,
> > > > does anybody have in mind any other usability points which I am
> missing?
> > > > Or is there any alternative approaches?
> > > >
> > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]> wrote:
> > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <
> [hidden email]
> > > >>
> > > > > wrote:>
> > > > >
> > > > > > It doesn’t need all txes. Instead, other nodes will send info
> about>
> > > > > > suspicious txes to it from time to time.>
> > > > > >>
> > > > >
> > > > > I see your point, I think it might work.>
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

Ivan Pavlukhin

Re: Suggestion to improve deadlock detection

Hi Vladimir,

Regarding fault tolerance. It seems that it is not a problem for
edge-chasing approaches. A found deadlock is identified by a message
returned to a detection initiator with initiator's identifier. If
there is no deadlock then such message will not come. If some node
containing a deadlocked transaction fails then it will break the
deadlock. Am I missing something?

About messaging overhead. Indeed, it looks like edge-chasing
approaches can bring redundant messages. Perhaps, we can borrow some
ideas about optimization from [1]. And I also think about a timeout
before starting a deadlock detection.

Thoughts about adaptive timeouts lead me to some thoughts about
monitoring. I assume that long waiting for locks signals about not
optimal performance of the system. I think it would be great to have
means of monitoring transactions contention. It could help users to
improve theirs workloads. It also could help us to build some
reasoning how contention correlates with deadlock detection overhead.

[1] http://mentallandscape.com/Papers_podc84.pdf
вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <[hidden email]>:

>
> Hi Ivan,
>
> Great analysis. Agree that edge-chasing looks like better candidate. First,
> it will be applicable to both normal and MVCC transactions. Second, in MVCC
> we probably will also need to release some locks when doing rollbacks. What
> we should think about is failover - what if a node which was in the middle
> of a graph fails? We need to craft fault-tolerance carefully.
>
> Another critically important point is how to trigger deadlock detection. My
> concerns about edge-chasing was not about latency, but about a number of
> messages which travels between nodes. Good algorithm must produce little to
> no messages in case of normal contenion while still providing protection in
> case of real deadlocks. So how would we trigger fraph traversal for the
> given transaction? May be we can start with hard timeout and then employ a
> kind of adaptive increment in case high rate of false-positives are
> observed. May be something else.
>
> On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <[hidden email]> wrote:
>
> > Vladimir,
> >
> > Thanks for the articles! I studied them and a couple of others. And I
> > would like to share a knowledge I found.
> >
> > BACKGROUND
> > First of all our algorithm implemented in
> >
> > org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > is not an edge-chasing algorithm. In essence a lock-waiting
> > transaction site polls nodes responsible for keys of interest one by
> > one and reconstructs global wait-for graph (WFG) locally.
> > A centralized algorithm discussed in this thread looks similar to
> > algorithms proposed by Ho [1]. The simplest of them (two-phased)
> > reports false deadlocks when unlock before transaction finish is
> > permitted. So, it seems that it only works when strict two-phase
> > locking (2PL) is obeyed. Another one (called one-phased) requires
> > tracking all locked keys by each transaction which is not desirable
> > for MVCC transactions.
> > Aforementioned edge-chasing algorithm by Chandy, Misra and Haas [2] is
> > proven to work even when 2PL is not obeyed.
> > Also performance target is not clear for me. It looks like centralized
> > approaches can use fewer messages and provide lower latency that
> > distributed ones. But would like to understand what are the orders of
> > latency and messaging overhead. Many of algorithms are described in
> > [3] and some performance details are mentioned. It is said "As per
> > [GRAY81], most deadlocks involve two to four transactions." I see one
> > more argument making deadlock detection algorithm latency not so
> > important. We can consider deadlock as unavoidable situation, but even
> > lightning fast detector will not save a database from performance
> > degradation in case of tons of deadlocks.
> > What is also interesting, Google Cloud Spanner employs simple
> > "wound-wait" deadlock prevention [4] instead of any detection
> > algorithm.
> >
> > DISCUSSION
> > So, I see the following options:
> > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > 2. Centralized two-phased algorithm described by Ho [1] with
> > restriction that transactions must obey 2PL.
> > Personally, I tend to edge-chasing approach because it looks simple
> > and universal. Restricting ourselves with 2PL will restrict us in
> > features which could be implemented in future (e.g. transaction
> > savepoints), will not it?
> >
> > WDYT?
> >
> > Ivan
> >
> > [1] https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > [2]
> > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > [3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > [4]
> > https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> >
> > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <[hidden email]>:
> > >
> > > Ivan,
> > >
> > > This is interesting question. I think we should spend some time for
> > formal
> > > verification whether this algorithm works or not. Several articles you
> > may
> > > use as a startitng point: [1], [2]. From what I understand, Ignite fall
> > > into "AND" model, and currently implemented algorithm is a variation of
> > > "edge-chasing" approach as per Chandy, Misra and Haas [3], which is
> > *proven
> > > to be correct* in that it both detects deadlocks when they are present,
> > and
> > > do not produce false positives. But is might be too heavy for the system
> > > under contention.
> > >
> > > We need to search for any formal proof of correctness of proposed
> > > algorithm. This area is already researched throughly enough, so we should
> > > be able to find an answer quickly.
> > >
> > > Vladimir.
> > >
> > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > [3]
> > >
> > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > >
> > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <[hidden email]>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > Next part as promised. A working item for me is a deadlock detector
> > > > for MVCC transactions [1]. The message is structured in 2 parts. First
> > > > is an analysis of the current state of affairs and possible options to
> > > > go. Second is a proposed option. First part is going to be not so
> > > > short so some might prefer to skip it.
> > > >
> > > > ANALYSIS
> > > > The immediate question is "why we cannot use an existing deadlock
> > > > detector?". The differences between classic and MVCC transactions
> > > > implementation is the answer. Currently a collection of IgniteTxEntry
> > > > is used for detection. But such collection is not maintained for MVCC
> > > > transactions. So, it will not work out of box.
> > > > Also it looks like that current distributed iterative approach cannot
> > > > be low latency it the worst case because of doing possibly many
> > > > network requests sequentially.
> > > > So, what options do we have? Generally we should choose between
> > > > centralized and distributed approaches. By centralized approach I mean
> > > > existence of a dedicated deadlock detector located on a single node.
> > > > In the centralized approach we can face difficulties related to
> > > > failover as a node running deadlock detector can fail. In the
> > > > distributed approach extra network messaging overhead can strike
> > > > because different nodes participating in a deadlock can start
> > > > detection independently and send redundant messages. I see some
> > > > aspects which make sense for choosing implementation. Here they are
> > > > with an approach that is better (roughly speaking) in parentheses:
> > > > * Detection latency (centralized).
> > > > * Messaging overhead (centralized).
> > > > * Failover (distributed).
> > > > And also having a park of deadlock detectors sounds not very good. I
> > > > hope that it is possible to develop a common solution suitable for
> > > > both kinds of transactions. I suggest to pilot new solution with MVCC
> > > > and then adopt it for classic transactions.
> > > >
> > > > PROPOSAL
> > > > Actually I propose to start with an centralized algorithm described by
> > > > Vladimir in the beginning of the thread. I will try to outline main
> > > > points of it.
> > > > 1. Single deadlock detector exists in the cluster which maintains
> > > > transaction wait-for graph (WFG).
> > > > 2. Each cluster node sends and invalidates wait-for edges to the
> > detector.
> > > > 3. The detector periodically searches cycles in WFG and chooses and
> > > > aborts a victim transaction if cycle is found.
> > > >
> > > > Currently I have one fundamental question. Is there a possibility of
> > > > false detected deadlocks because of concurrent WFG updates?
> > > > Of course there are many points of improvements and optimizations. But
> > > > I would like to start from discussing key points.
> > > >
> > > > Please share your thoughts!
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <[hidden email]>:
> > > > >
> > > > > Hi Igniters,
> > > > >
> > > > > I would like to resume the discussion about a deadlock detector. I
> > start
> > > > > with a motivation for a further work on a subject. As I see current
> > > > > implementation (entry point IgniteTxManager.detectDeadlock) starts a
> > > > > detection only after a transaction was timed out. In my mind it is
> > not
> > > > > very good from a product usability standpoint. As you know, in a
> > > > > situation of deadlock some keys become non-usable for an infinite
> > amount
> > > > > of time. Currently the only way to work around it is configuring a
> > > > > timeout, but it could be rather tricky in practice to choose a
> > > > > proper/universal value for it. So, I see the main point as:
> > > > >
> > > > > Ability to break deadlocks without a need to configure timeouts
> > > > explicitly.
> > > > >
> > > > > I will return soon with some thoughts about implementation.
> > Meanwhile,
> > > > > does anybody have in mind any other usability points which I am
> > missing?
> > > > > Or is there any alternative approaches?
> > > > >
> > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]> wrote:
> > > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <
> > [hidden email]
> > > > >>
> > > > > > wrote:>
> > > > > >
> > > > > > > It doesn’t need all txes. Instead, other nodes will send info
> > about>
> > > > > > > suspicious txes to it from time to time.>
> > > > > > >>
> > > > > >
> > > > > > I see your point, I think it might work.>
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Ivan Pavlukhin
> > > >
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >

--
Best regards,
Ivan Pavlukhin

Vladimir Ozerov

Re: Suggestion to improve deadlock detection

Ivan,

The problem is that in our system a transaction may wait for N locks
simultaneously. This may form complex graphs which spread between many
nodes. Now consider that I have a deadlock between 4 nodes: A -> B -> *C*
-> D -> A. I've sent a message from a and never reached D because C failed.
Well, may be that is good, because some transactions will be rolled back.
But when they are rolled back, another cycles from pending multi-way
deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the question is
whether we will be able to detect them reliably. I think that we may use
the following assumption:
1) If data node fails, relevant transactions will be rolled-back
2) It means that some other transactions will make at least partial
progress with locks acquisition

So, we can introduce a kind of counter which will be advanced on every
acquired lock on a given node. And define a rule that transaction deadlock
request for the given pair (NODE_ID, COUNTER) should be requested at most
once. This way, even if specific deadlock check request is lost due to node
failure, and another deadlock has formed, then some other node will
re-trigger deadlock change request sooner or later, as it's counter
advanced.

Makes sense?

On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <[hidden email]> wrote:

> Hi Vladimir,
>
> Regarding fault tolerance. It seems that it is not a problem for
> edge-chasing approaches. A found deadlock is identified by a message
> returned to a detection initiator with initiator's identifier. If
> there is no deadlock then such message will not come. If some node
> containing a deadlocked transaction fails then it will break the
> deadlock. Am I missing something?
>
> About messaging overhead. Indeed, it looks like edge-chasing
> approaches can bring redundant messages. Perhaps, we can borrow some
> ideas about optimization from [1]. And I also think about a timeout
> before starting a deadlock detection.
>
> Thoughts about adaptive timeouts lead me to some thoughts about
> monitoring. I assume that long waiting for locks signals about not
> optimal performance of the system. I think it would be great to have
> means of monitoring transactions contention. It could help users to
> improve theirs workloads. It also could help us to build some
> reasoning how contention correlates with deadlock detection overhead.
>
> [1] http://mentallandscape.com/Papers_podc84.pdf
> вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <[hidden email]>:
> >
> > Hi Ivan,
> >
> > Great analysis. Agree that edge-chasing looks like better candidate.
> First,
> > it will be applicable to both normal and MVCC transactions. Second, in
> MVCC
> > we probably will also need to release some locks when doing rollbacks.
> What
> > we should think about is failover - what if a node which was in the
> middle
> > of a graph fails? We need to craft fault-tolerance carefully.
> >
> > Another critically important point is how to trigger deadlock detection.
> My
> > concerns about edge-chasing was not about latency, but about a number of
> > messages which travels between nodes. Good algorithm must produce little
> to
> > no messages in case of normal contenion while still providing protection
> in
> > case of real deadlocks. So how would we trigger fraph traversal for the
> > given transaction? May be we can start with hard timeout and then employ
> a
> > kind of adaptive increment in case high rate of false-positives are
> > observed. May be something else.
> >
> > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <[hidden email]>
> wrote:
> >
> > > Vladimir,
> > >
> > > Thanks for the articles! I studied them and a couple of others. And I
> > > would like to share a knowledge I found.
> > >
> > > BACKGROUND
> > > First of all our algorithm implemented in
> > >
> > >
> org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > > is not an edge-chasing algorithm. In essence a lock-waiting
> > > transaction site polls nodes responsible for keys of interest one by
> > > one and reconstructs global wait-for graph (WFG) locally.
> > > A centralized algorithm discussed in this thread looks similar to
> > > algorithms proposed by Ho [1]. The simplest of them (two-phased)
> > > reports false deadlocks when unlock before transaction finish is
> > > permitted. So, it seems that it only works when strict two-phase
> > > locking (2PL) is obeyed. Another one (called one-phased) requires
> > > tracking all locked keys by each transaction which is not desirable
> > > for MVCC transactions.
> > > Aforementioned edge-chasing algorithm by Chandy, Misra and Haas [2] is
> > > proven to work even when 2PL is not obeyed.
> > > Also performance target is not clear for me. It looks like centralized
> > > approaches can use fewer messages and provide lower latency that
> > > distributed ones. But would like to understand what are the orders of
> > > latency and messaging overhead. Many of algorithms are described in
> > > [3] and some performance details are mentioned. It is said "As per
> > > [GRAY81], most deadlocks involve two to four transactions." I see one
> > > more argument making deadlock detection algorithm latency not so
> > > important. We can consider deadlock as unavoidable situation, but even
> > > lightning fast detector will not save a database from performance
> > > degradation in case of tons of deadlocks.
> > > What is also interesting, Google Cloud Spanner employs simple
> > > "wound-wait" deadlock prevention [4] instead of any detection
> > > algorithm.
> > >
> > > DISCUSSION
> > > So, I see the following options:
> > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > > 2. Centralized two-phased algorithm described by Ho [1] with
> > > restriction that transactions must obey 2PL.
> > > Personally, I tend to edge-chasing approach because it looks simple
> > > and universal. Restricting ourselves with 2PL will restrict us in
> > > features which could be implemented in future (e.g. transaction
> > > savepoints), will not it?
> > >
> > > WDYT?
> > >
> > > Ivan
> > >
> > > [1] https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > > [2]
> > >
> https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > [3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > > [4]
> > >
> https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> > >
> > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <[hidden email]>:
> > > >
> > > > Ivan,
> > > >
> > > > This is interesting question. I think we should spend some time for
> > > formal
> > > > verification whether this algorithm works or not. Several articles
> you
> > > may
> > > > use as a startitng point: [1], [2]. From what I understand, Ignite
> fall
> > > > into "AND" model, and currently implemented algorithm is a variation
> of
> > > > "edge-chasing" approach as per Chandy, Misra and Haas [3], which is
> > > *proven
> > > > to be correct* in that it both detects deadlocks when they are
> present,
> > > and
> > > > do not produce false positives. But is might be too heavy for the
> system
> > > > under contention.
> > > >
> > > > We need to search for any formal proof of correctness of proposed
> > > > algorithm. This area is already researched throughly enough, so we
> should
> > > > be able to find an answer quickly.
> > > >
> > > > Vladimir.
> > > >
> > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > > [3]
> > > >
> > >
> https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > >
> > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <[hidden email]>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Next part as promised. A working item for me is a deadlock detector
> > > > > for MVCC transactions [1]. The message is structured in 2 parts.
> First
> > > > > is an analysis of the current state of affairs and possible
> options to
> > > > > go. Second is a proposed option. First part is going to be not so
> > > > > short so some might prefer to skip it.
> > > > >
> > > > > ANALYSIS
> > > > > The immediate question is "why we cannot use an existing deadlock
> > > > > detector?". The differences between classic and MVCC transactions
> > > > > implementation is the answer. Currently a collection of
> IgniteTxEntry
> > > > > is used for detection. But such collection is not maintained for
> MVCC
> > > > > transactions. So, it will not work out of box.
> > > > > Also it looks like that current distributed iterative approach
> cannot
> > > > > be low latency it the worst case because of doing possibly many
> > > > > network requests sequentially.
> > > > > So, what options do we have? Generally we should choose between
> > > > > centralized and distributed approaches. By centralized approach I
> mean
> > > > > existence of a dedicated deadlock detector located on a single
> node.
> > > > > In the centralized approach we can face difficulties related to
> > > > > failover as a node running deadlock detector can fail. In the
> > > > > distributed approach extra network messaging overhead can strike
> > > > > because different nodes participating in a deadlock can start
> > > > > detection independently and send redundant messages. I see some
> > > > > aspects which make sense for choosing implementation. Here they are
> > > > > with an approach that is better (roughly speaking) in parentheses:
> > > > > * Detection latency (centralized).
> > > > > * Messaging overhead (centralized).
> > > > > * Failover (distributed).
> > > > > And also having a park of deadlock detectors sounds not very good.
> I
> > > > > hope that it is possible to develop a common solution suitable for
> > > > > both kinds of transactions. I suggest to pilot new solution with
> MVCC
> > > > > and then adopt it for classic transactions.
> > > > >
> > > > > PROPOSAL
> > > > > Actually I propose to start with an centralized algorithm
> described by
> > > > > Vladimir in the beginning of the thread. I will try to outline main
> > > > > points of it.
> > > > > 1. Single deadlock detector exists in the cluster which maintains
> > > > > transaction wait-for graph (WFG).
> > > > > 2. Each cluster node sends and invalidates wait-for edges to the
> > > detector.
> > > > > 3. The detector periodically searches cycles in WFG and chooses and
> > > > > aborts a victim transaction if cycle is found.
> > > > >
> > > > > Currently I have one fundamental question. Is there a possibility
> of
> > > > > false detected deadlocks because of concurrent WFG updates?
> > > > > Of course there are many points of improvements and optimizations.
> But
> > > > > I would like to start from discussing key points.
> > > > >
> > > > > Please share your thoughts!
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <[hidden email]>:
> > > > > >
> > > > > > Hi Igniters,
> > > > > >
> > > > > > I would like to resume the discussion about a deadlock detector.
> I
> > > start
> > > > > > with a motivation for a further work on a subject. As I see
> current
> > > > > > implementation (entry point IgniteTxManager.detectDeadlock)
> starts a
> > > > > > detection only after a transaction was timed out. In my mind it
> is
> > > not
> > > > > > very good from a product usability standpoint. As you know, in a
> > > > > > situation of deadlock some keys become non-usable for an infinite
> > > amount
> > > > > > of time. Currently the only way to work around it is configuring
> a
> > > > > > timeout, but it could be rather tricky in practice to choose a
> > > > > > proper/universal value for it. So, I see the main point as:
> > > > > >
> > > > > > Ability to break deadlocks without a need to configure timeouts
> > > > > explicitly.
> > > > > >
> > > > > > I will return soon with some thoughts about implementation.
> > > Meanwhile,
> > > > > > does anybody have in mind any other usability points which I am
> > > missing?
> > > > > > Or is there any alternative approaches?
> > > > > >
> > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]>
> wrote:
> > > > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <
> > > [hidden email]
> > > > > >>
> > > > > > > wrote:>
> > > > > > >
> > > > > > > > It doesn’t need all txes. Instead, other nodes will send
> info
> > > about>
> > > > > > > > suspicious txes to it from time to time.>
> > > > > > > >>
> > > > > > >
> > > > > > > I see your point, I think it might work.>
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

Ivan Pavlukhin

Re: Suggestion to improve deadlock detection

Vladimir,

I think it might work. So, if nobody minds I can start prototyping
edge-chasing approach.
пн, 26 нояб. 2018 г. в 14:32, Vladimir Ozerov <[hidden email]>:

>
> Ivan,
>
> The problem is that in our system a transaction may wait for N locks
> simultaneously. This may form complex graphs which spread between many
> nodes. Now consider that I have a deadlock between 4 nodes: A -> B -> *C*
> -> D -> A. I've sent a message from a and never reached D because C failed.
> Well, may be that is good, because some transactions will be rolled back.
> But when they are rolled back, another cycles from pending multi-way
> deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the question is
> whether we will be able to detect them reliably. I think that we may use
> the following assumption:
> 1) If data node fails, relevant transactions will be rolled-back
> 2) It means that some other transactions will make at least partial
> progress with locks acquisition
>
> So, we can introduce a kind of counter which will be advanced on every
> acquired lock on a given node. And define a rule that transaction deadlock
> request for the given pair (NODE_ID, COUNTER) should be requested at most
> once. This way, even if specific deadlock check request is lost due to node
> failure, and another deadlock has formed, then some other node will
> re-trigger deadlock change request sooner or later, as it's counter
> advanced.
>
> Makes sense?
>
> On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <[hidden email]> wrote:
>
> > Hi Vladimir,
> >
> > Regarding fault tolerance. It seems that it is not a problem for
> > edge-chasing approaches. A found deadlock is identified by a message
> > returned to a detection initiator with initiator's identifier. If
> > there is no deadlock then such message will not come. If some node
> > containing a deadlocked transaction fails then it will break the
> > deadlock. Am I missing something?
> >
> > About messaging overhead. Indeed, it looks like edge-chasing
> > approaches can bring redundant messages. Perhaps, we can borrow some
> > ideas about optimization from [1]. And I also think about a timeout
> > before starting a deadlock detection.
> >
> > Thoughts about adaptive timeouts lead me to some thoughts about
> > monitoring. I assume that long waiting for locks signals about not
> > optimal performance of the system. I think it would be great to have
> > means of monitoring transactions contention. It could help users to
> > improve theirs workloads. It also could help us to build some
> > reasoning how contention correlates with deadlock detection overhead.
> >
> > [1] http://mentallandscape.com/Papers_podc84.pdf
> > вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <[hidden email]>:
> > >
> > > Hi Ivan,
> > >
> > > Great analysis. Agree that edge-chasing looks like better candidate.
> > First,
> > > it will be applicable to both normal and MVCC transactions. Second, in
> > MVCC
> > > we probably will also need to release some locks when doing rollbacks.
> > What
> > > we should think about is failover - what if a node which was in the
> > middle
> > > of a graph fails? We need to craft fault-tolerance carefully.
> > >
> > > Another critically important point is how to trigger deadlock detection.
> > My
> > > concerns about edge-chasing was not about latency, but about a number of
> > > messages which travels between nodes. Good algorithm must produce little
> > to
> > > no messages in case of normal contenion while still providing protection
> > in
> > > case of real deadlocks. So how would we trigger fraph traversal for the
> > > given transaction? May be we can start with hard timeout and then employ
> > a
> > > kind of adaptive increment in case high rate of false-positives are
> > > observed. May be something else.
> > >
> > > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <[hidden email]>
> > wrote:
> > >
> > > > Vladimir,
> > > >
> > > > Thanks for the articles! I studied them and a couple of others. And I
> > > > would like to share a knowledge I found.
> > > >
> > > > BACKGROUND
> > > > First of all our algorithm implemented in
> > > >
> > > >
> > org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > > > is not an edge-chasing algorithm. In essence a lock-waiting
> > > > transaction site polls nodes responsible for keys of interest one by
> > > > one and reconstructs global wait-for graph (WFG) locally.
> > > > A centralized algorithm discussed in this thread looks similar to
> > > > algorithms proposed by Ho [1]. The simplest of them (two-phased)
> > > > reports false deadlocks when unlock before transaction finish is
> > > > permitted. So, it seems that it only works when strict two-phase
> > > > locking (2PL) is obeyed. Another one (called one-phased) requires
> > > > tracking all locked keys by each transaction which is not desirable
> > > > for MVCC transactions.
> > > > Aforementioned edge-chasing algorithm by Chandy, Misra and Haas [2] is
> > > > proven to work even when 2PL is not obeyed.
> > > > Also performance target is not clear for me. It looks like centralized
> > > > approaches can use fewer messages and provide lower latency that
> > > > distributed ones. But would like to understand what are the orders of
> > > > latency and messaging overhead. Many of algorithms are described in
> > > > [3] and some performance details are mentioned. It is said "As per
> > > > [GRAY81], most deadlocks involve two to four transactions." I see one
> > > > more argument making deadlock detection algorithm latency not so
> > > > important. We can consider deadlock as unavoidable situation, but even
> > > > lightning fast detector will not save a database from performance
> > > > degradation in case of tons of deadlocks.
> > > > What is also interesting, Google Cloud Spanner employs simple
> > > > "wound-wait" deadlock prevention [4] instead of any detection
> > > > algorithm.
> > > >
> > > > DISCUSSION
> > > > So, I see the following options:
> > > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > > > 2. Centralized two-phased algorithm described by Ho [1] with
> > > > restriction that transactions must obey 2PL.
> > > > Personally, I tend to edge-chasing approach because it looks simple
> > > > and universal. Restricting ourselves with 2PL will restrict us in
> > > > features which could be implemented in future (e.g. transaction
> > > > savepoints), will not it?
> > > >
> > > > WDYT?
> > > >
> > > > Ivan
> > > >
> > > > [1] https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > > > [2]
> > > >
> > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > [3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > > > [4]
> > > >
> > https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> > > >
> > > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <[hidden email]>:
> > > > >
> > > > > Ivan,
> > > > >
> > > > > This is interesting question. I think we should spend some time for
> > > > formal
> > > > > verification whether this algorithm works or not. Several articles
> > you
> > > > may
> > > > > use as a startitng point: [1], [2]. From what I understand, Ignite
> > fall
> > > > > into "AND" model, and currently implemented algorithm is a variation
> > of
> > > > > "edge-chasing" approach as per Chandy, Misra and Haas [3], which is
> > > > *proven
> > > > > to be correct* in that it both detects deadlocks when they are
> > present,
> > > > and
> > > > > do not produce false positives. But is might be too heavy for the
> > system
> > > > > under contention.
> > > > >
> > > > > We need to search for any formal proof of correctness of proposed
> > > > > algorithm. This area is already researched throughly enough, so we
> > should
> > > > > be able to find an answer quickly.
> > > > >
> > > > > Vladimir.
> > > > >
> > > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > > > [3]
> > > > >
> > > >
> > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > >
> > > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <[hidden email]>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Next part as promised. A working item for me is a deadlock detector
> > > > > > for MVCC transactions [1]. The message is structured in 2 parts.
> > First
> > > > > > is an analysis of the current state of affairs and possible
> > options to
> > > > > > go. Second is a proposed option. First part is going to be not so
> > > > > > short so some might prefer to skip it.
> > > > > >
> > > > > > ANALYSIS
> > > > > > The immediate question is "why we cannot use an existing deadlock
> > > > > > detector?". The differences between classic and MVCC transactions
> > > > > > implementation is the answer. Currently a collection of
> > IgniteTxEntry
> > > > > > is used for detection. But such collection is not maintained for
> > MVCC
> > > > > > transactions. So, it will not work out of box.
> > > > > > Also it looks like that current distributed iterative approach
> > cannot
> > > > > > be low latency it the worst case because of doing possibly many
> > > > > > network requests sequentially.
> > > > > > So, what options do we have? Generally we should choose between
> > > > > > centralized and distributed approaches. By centralized approach I
> > mean
> > > > > > existence of a dedicated deadlock detector located on a single
> > node.
> > > > > > In the centralized approach we can face difficulties related to
> > > > > > failover as a node running deadlock detector can fail. In the
> > > > > > distributed approach extra network messaging overhead can strike
> > > > > > because different nodes participating in a deadlock can start
> > > > > > detection independently and send redundant messages. I see some
> > > > > > aspects which make sense for choosing implementation. Here they are
> > > > > > with an approach that is better (roughly speaking) in parentheses:
> > > > > > * Detection latency (centralized).
> > > > > > * Messaging overhead (centralized).
> > > > > > * Failover (distributed).
> > > > > > And also having a park of deadlock detectors sounds not very good.
> > I
> > > > > > hope that it is possible to develop a common solution suitable for
> > > > > > both kinds of transactions. I suggest to pilot new solution with
> > MVCC
> > > > > > and then adopt it for classic transactions.
> > > > > >
> > > > > > PROPOSAL
> > > > > > Actually I propose to start with an centralized algorithm
> > described by
> > > > > > Vladimir in the beginning of the thread. I will try to outline main
> > > > > > points of it.
> > > > > > 1. Single deadlock detector exists in the cluster which maintains
> > > > > > transaction wait-for graph (WFG).
> > > > > > 2. Each cluster node sends and invalidates wait-for edges to the
> > > > detector.
> > > > > > 3. The detector periodically searches cycles in WFG and chooses and
> > > > > > aborts a victim transaction if cycle is found.
> > > > > >
> > > > > > Currently I have one fundamental question. Is there a possibility
> > of
> > > > > > false detected deadlocks because of concurrent WFG updates?
> > > > > > Of course there are many points of improvements and optimizations.
> > But
> > > > > > I would like to start from discussing key points.
> > > > > >
> > > > > > Please share your thoughts!
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <[hidden email]>:
> > > > > > >
> > > > > > > Hi Igniters,
> > > > > > >
> > > > > > > I would like to resume the discussion about a deadlock detector.
> > I
> > > > start
> > > > > > > with a motivation for a further work on a subject. As I see
> > current
> > > > > > > implementation (entry point IgniteTxManager.detectDeadlock)
> > starts a
> > > > > > > detection only after a transaction was timed out. In my mind it
> > is
> > > > not
> > > > > > > very good from a product usability standpoint. As you know, in a
> > > > > > > situation of deadlock some keys become non-usable for an infinite
> > > > amount
> > > > > > > of time. Currently the only way to work around it is configuring
> > a
> > > > > > > timeout, but it could be rather tricky in practice to choose a
> > > > > > > proper/universal value for it. So, I see the main point as:
> > > > > > >
> > > > > > > Ability to break deadlocks without a need to configure timeouts
> > > > > > explicitly.
> > > > > > >
> > > > > > > I will return soon with some thoughts about implementation.
> > > > Meanwhile,
> > > > > > > does anybody have in mind any other usability points which I am
> > > > missing?
> > > > > > > Or is there any alternative approaches?
> > > > > > >
> > > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]>
> > wrote:
> > > > > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <
> > > > [hidden email]
> > > > > > >>
> > > > > > > > wrote:>
> > > > > > > >
> > > > > > > > > It doesn’t need all txes. Instead, other nodes will send
> > info
> > > > about>
> > > > > > > > > suspicious txes to it from time to time.>
> > > > > > > > >>
> > > > > > > >
> > > > > > > > I see your point, I think it might work.>
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Ivan Pavlukhin
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Ivan Pavlukhin
> > > >
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >

--
Best regards,
Ivan Pavlukhin

Ivan Pavlukhin

Re: Suggestion to improve deadlock detection

Hi folks,

During implementation of edge-chasing deadlock detection algorithm in
scope of [1] it has been realized that basically we have 2 options for
"chasing" strategy. I will use terms Near when GridNearTxLocal is
assumed and Dht when GridDhtTxLocal (tx which updates primary
partition). So, here is 2 mentioned strategies:
1. Send initial probe when Dht tx blocks waiting for another tx to
release a key. Initial probe is sent from waiting Dht tx to Near tx
holding a lock. If receiving Near tx is blocked as well then it relays
the probe to Dht tx it awaits response from. So, the probe is
traveling like Dht -> Near -> Dht -> Near -> ... Let's call the
approach "forward-only".
2. Send probes only between Near transactions. This approach requires
additional request-response call which Near tx issues to Dht node to
check if tx sending a request is waiting for another tx. The call
returns identifier of a Near tx blocking tx sending a request (if
sending tx is in fact blocked). Then waiting Near tx relays a probe to
blocked Near tx. Let's call that approach "lock-checking".

I would like to define several points to consider while comparing
approaches (please point out more if I miss something):
1. Correctness. Especially regarding reporting false deadlocks.
2. Extensibility. Rollback to savepoint and generalization for classic
transactions should be considered.
3. Messaging overhead.
4. Simplicity of implementation.

You can find an implementation of "lock-checking" approach in PR
attached to the ticket [1]. Currently it works correctly only for
transactions obeying strict two-phase locking (2PL). It is fairly easy
to adopt PR to implement "forward-only" approach. Which will work
correct in conditions of 2PL. "lock-checking" algorithm uses 1.5 times
more messages than "forward-only". Implementation of "forward-only"
algorithm is simpler in a sense that it does not require introducing
lock-wait-check messages and future. But as for me it does not look as
big deal. I think that implementation efforts for both approaches are
almost equal so far.

But corner stone here is an extensibility. Let's imagine that we are
going to implement rollback to savepoint. One suggested approach to
extend "forward-only" approach for that case is invalidating sent
probe. Basically, a tx initiated deadlock check assigns unique id to a
probe. And this probe id is invalidated when the tx wakes up from
wait. If that tx receives back a probe with invalidated id it simply
discards it. If id is not invalidated it means a deadlock. But false
deadlock still can be detected. I will provide an example when it does
not work correctly.

Please note, that our transactions can request multiple locks
simultaneously (so-called AND model). Also rollback to savepoint
releases locks obtained after creating a savepoint. Let's use
following notation. "*" means savepoint. "^" means rollback to
previous savepoint. "!" means acquiring a lock. "?" means waiting for
a lock. "&" means requesting multiple locks simultaneously. See the
following transaction execution scenario for 3 transactions and 3
keys:
TA | TB | TC
| | *
| 1! | 2!
1? & 3! | |
| 2? |
| | ^
| | 3!

We can notice that suggested scenario does not introduce deadlock
because locks are requested in consistent order among all
transactions. But in different moments there exist waiting relations
TA w TB, TB w TC, TC w TA. So, we suspect a false deadlock could be
detected here. Also one can notice that a relation TC w TA is
established after TB w TC is destroyed. Messages on a timeline diagram
demonstrates how "forward-only with probe invalidation" approach can
detect a false deadlock here [2]. On the picture L means "lock", U
"unlock", W "wait". Red cross means reporting a false deadlock.
Intuition here is that while probe is in flight one wait-for edge can
be destroyed (TB w TC) and another one can be created (TC w TA). So,
the approach can report false deadlocks. Also, note that a false
deadlock can be found even when locking order is consistent!

After a while I will describe how "lock-checking" can be adopted to
support rollback to savepoint. Disclaimer here is that it will require
even more messages.
----

I think that we can already start a discussion.
Actually, while false deadlocks can be surprising we perhaps can
tolerate it because our transaction implementation assumes that
transactions can be rolled back by system. "forward-only" approach
requires lesser messages but in big-O terms both approaches have the
same complexity. Both correctness and complexity impact unfortunately
have hardly predictable impact for real workflows. In ideal world
thorough testing could give the answers.

Please, share your vision.

[1] https://issues.apache.org/jira/browse/IGNITE-9322
[2] https://gist.githubusercontent.com/pavlukhin/c8c7c6266eeab56048c31f5cdfb31d20/raw/7d1aef9abcd014ec9fdf69840168768ced638b6c/msg_diagram.jpg
пн, 26 нояб. 2018 г. в 15:27, Павлухин Иван <[hidden email]>:

>
> Vladimir,
>
> I think it might work. So, if nobody minds I can start prototyping
> edge-chasing approach.
> пн, 26 нояб. 2018 г. в 14:32, Vladimir Ozerov <[hidden email]>:
> >
> > Ivan,
> >
> > The problem is that in our system a transaction may wait for N locks
> > simultaneously. This may form complex graphs which spread between many
> > nodes. Now consider that I have a deadlock between 4 nodes: A -> B -> *C*
> > -> D -> A. I've sent a message from a and never reached D because C failed.
> > Well, may be that is good, because some transactions will be rolled back.
> > But when they are rolled back, another cycles from pending multi-way
> > deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the question is
> > whether we will be able to detect them reliably. I think that we may use
> > the following assumption:
> > 1) If data node fails, relevant transactions will be rolled-back
> > 2) It means that some other transactions will make at least partial
> > progress with locks acquisition
> >
> > So, we can introduce a kind of counter which will be advanced on every
> > acquired lock on a given node. And define a rule that transaction deadlock
> > request for the given pair (NODE_ID, COUNTER) should be requested at most
> > once. This way, even if specific deadlock check request is lost due to node
> > failure, and another deadlock has formed, then some other node will
> > re-trigger deadlock change request sooner or later, as it's counter
> > advanced.
> >
> > Makes sense?
> >
> > On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <[hidden email]> wrote:
> >
> > > Hi Vladimir,
> > >
> > > Regarding fault tolerance. It seems that it is not a problem for
> > > edge-chasing approaches. A found deadlock is identified by a message
> > > returned to a detection initiator with initiator's identifier. If
> > > there is no deadlock then such message will not come. If some node
> > > containing a deadlocked transaction fails then it will break the
> > > deadlock. Am I missing something?
> > >
> > > About messaging overhead. Indeed, it looks like edge-chasing
> > > approaches can bring redundant messages. Perhaps, we can borrow some
> > > ideas about optimization from [1]. And I also think about a timeout
> > > before starting a deadlock detection.
> > >
> > > Thoughts about adaptive timeouts lead me to some thoughts about
> > > monitoring. I assume that long waiting for locks signals about not
> > > optimal performance of the system. I think it would be great to have
> > > means of monitoring transactions contention. It could help users to
> > > improve theirs workloads. It also could help us to build some
> > > reasoning how contention correlates with deadlock detection overhead.
> > >
> > > [1] http://mentallandscape.com/Papers_podc84.pdf
> > > вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <[hidden email]>:
> > > >
> > > > Hi Ivan,
> > > >
> > > > Great analysis. Agree that edge-chasing looks like better candidate.
> > > First,
> > > > it will be applicable to both normal and MVCC transactions. Second, in
> > > MVCC
> > > > we probably will also need to release some locks when doing rollbacks.
> > > What
> > > > we should think about is failover - what if a node which was in the
> > > middle
> > > > of a graph fails? We need to craft fault-tolerance carefully.
> > > >
> > > > Another critically important point is how to trigger deadlock detection.
> > > My
> > > > concerns about edge-chasing was not about latency, but about a number of
> > > > messages which travels between nodes. Good algorithm must produce little
> > > to
> > > > no messages in case of normal contenion while still providing protection
> > > in
> > > > case of real deadlocks. So how would we trigger fraph traversal for the
> > > > given transaction? May be we can start with hard timeout and then employ
> > > a
> > > > kind of adaptive increment in case high rate of false-positives are
> > > > observed. May be something else.
> > > >
> > > > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <[hidden email]>
> > > wrote:
> > > >
> > > > > Vladimir,
> > > > >
> > > > > Thanks for the articles! I studied them and a couple of others. And I
> > > > > would like to share a knowledge I found.
> > > > >
> > > > > BACKGROUND
> > > > > First of all our algorithm implemented in
> > > > >
> > > > >
> > > org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > > > > is not an edge-chasing algorithm. In essence a lock-waiting
> > > > > transaction site polls nodes responsible for keys of interest one by
> > > > > one and reconstructs global wait-for graph (WFG) locally.
> > > > > A centralized algorithm discussed in this thread looks similar to
> > > > > algorithms proposed by Ho [1]. The simplest of them (two-phased)
> > > > > reports false deadlocks when unlock before transaction finish is
> > > > > permitted. So, it seems that it only works when strict two-phase
> > > > > locking (2PL) is obeyed. Another one (called one-phased) requires
> > > > > tracking all locked keys by each transaction which is not desirable
> > > > > for MVCC transactions.
> > > > > Aforementioned edge-chasing algorithm by Chandy, Misra and Haas [2] is
> > > > > proven to work even when 2PL is not obeyed.
> > > > > Also performance target is not clear for me. It looks like centralized
> > > > > approaches can use fewer messages and provide lower latency that
> > > > > distributed ones. But would like to understand what are the orders of
> > > > > latency and messaging overhead. Many of algorithms are described in
> > > > > [3] and some performance details are mentioned. It is said "As per
> > > > > [GRAY81], most deadlocks involve two to four transactions." I see one
> > > > > more argument making deadlock detection algorithm latency not so
> > > > > important. We can consider deadlock as unavoidable situation, but even
> > > > > lightning fast detector will not save a database from performance
> > > > > degradation in case of tons of deadlocks.
> > > > > What is also interesting, Google Cloud Spanner employs simple
> > > > > "wound-wait" deadlock prevention [4] instead of any detection
> > > > > algorithm.
> > > > >
> > > > > DISCUSSION
> > > > > So, I see the following options:
> > > > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > > > > 2. Centralized two-phased algorithm described by Ho [1] with
> > > > > restriction that transactions must obey 2PL.
> > > > > Personally, I tend to edge-chasing approach because it looks simple
> > > > > and universal. Restricting ourselves with 2PL will restrict us in
> > > > > features which could be implemented in future (e.g. transaction
> > > > > savepoints), will not it?
> > > > >
> > > > > WDYT?
> > > > >
> > > > > Ivan
> > > > >
> > > > > [1] https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > > > > [2]
> > > > >
> > > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > [3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > > > > [4]
> > > > >
> > > https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> > > > >
> > > > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <[hidden email]>:
> > > > > >
> > > > > > Ivan,
> > > > > >
> > > > > > This is interesting question. I think we should spend some time for
> > > > > formal
> > > > > > verification whether this algorithm works or not. Several articles
> > > you
> > > > > may
> > > > > > use as a startitng point: [1], [2]. From what I understand, Ignite
> > > fall
> > > > > > into "AND" model, and currently implemented algorithm is a variation
> > > of
> > > > > > "edge-chasing" approach as per Chandy, Misra and Haas [3], which is
> > > > > *proven
> > > > > > to be correct* in that it both detects deadlocks when they are
> > > present,
> > > > > and
> > > > > > do not produce false positives. But is might be too heavy for the
> > > system
> > > > > > under contention.
> > > > > >
> > > > > > We need to search for any formal proof of correctness of proposed
> > > > > > algorithm. This area is already researched throughly enough, so we
> > > should
> > > > > > be able to find an answer quickly.
> > > > > >
> > > > > > Vladimir.
> > > > > >
> > > > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > > > > [3]
> > > > > >
> > > > >
> > > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > >
> > > > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <[hidden email]>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Next part as promised. A working item for me is a deadlock detector
> > > > > > > for MVCC transactions [1]. The message is structured in 2 parts.
> > > First
> > > > > > > is an analysis of the current state of affairs and possible
> > > options to
> > > > > > > go. Second is a proposed option. First part is going to be not so
> > > > > > > short so some might prefer to skip it.
> > > > > > >
> > > > > > > ANALYSIS
> > > > > > > The immediate question is "why we cannot use an existing deadlock
> > > > > > > detector?". The differences between classic and MVCC transactions
> > > > > > > implementation is the answer. Currently a collection of
> > > IgniteTxEntry
> > > > > > > is used for detection. But such collection is not maintained for
> > > MVCC
> > > > > > > transactions. So, it will not work out of box.
> > > > > > > Also it looks like that current distributed iterative approach
> > > cannot
> > > > > > > be low latency it the worst case because of doing possibly many
> > > > > > > network requests sequentially.
> > > > > > > So, what options do we have? Generally we should choose between
> > > > > > > centralized and distributed approaches. By centralized approach I
> > > mean
> > > > > > > existence of a dedicated deadlock detector located on a single
> > > node.
> > > > > > > In the centralized approach we can face difficulties related to
> > > > > > > failover as a node running deadlock detector can fail. In the
> > > > > > > distributed approach extra network messaging overhead can strike
> > > > > > > because different nodes participating in a deadlock can start
> > > > > > > detection independently and send redundant messages. I see some
> > > > > > > aspects which make sense for choosing implementation. Here they are
> > > > > > > with an approach that is better (roughly speaking) in parentheses:
> > > > > > > * Detection latency (centralized).
> > > > > > > * Messaging overhead (centralized).
> > > > > > > * Failover (distributed).
> > > > > > > And also having a park of deadlock detectors sounds not very good.
> > > I
> > > > > > > hope that it is possible to develop a common solution suitable for
> > > > > > > both kinds of transactions. I suggest to pilot new solution with
> > > MVCC
> > > > > > > and then adopt it for classic transactions.
> > > > > > >
> > > > > > > PROPOSAL
> > > > > > > Actually I propose to start with an centralized algorithm
> > > described by
> > > > > > > Vladimir in the beginning of the thread. I will try to outline main
> > > > > > > points of it.
> > > > > > > 1. Single deadlock detector exists in the cluster which maintains
> > > > > > > transaction wait-for graph (WFG).
> > > > > > > 2. Each cluster node sends and invalidates wait-for edges to the
> > > > > detector.
> > > > > > > 3. The detector periodically searches cycles in WFG and chooses and
> > > > > > > aborts a victim transaction if cycle is found.
> > > > > > >
> > > > > > > Currently I have one fundamental question. Is there a possibility
> > > of
> > > > > > > false detected deadlocks because of concurrent WFG updates?
> > > > > > > Of course there are many points of improvements and optimizations.
> > > But
> > > > > > > I would like to start from discussing key points.
> > > > > > >
> > > > > > > Please share your thoughts!
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <[hidden email]>:
> > > > > > > >
> > > > > > > > Hi Igniters,
> > > > > > > >
> > > > > > > > I would like to resume the discussion about a deadlock detector.
> > > I
> > > > > start
> > > > > > > > with a motivation for a further work on a subject. As I see
> > > current
> > > > > > > > implementation (entry point IgniteTxManager.detectDeadlock)
> > > starts a
> > > > > > > > detection only after a transaction was timed out. In my mind it
> > > is
> > > > > not
> > > > > > > > very good from a product usability standpoint. As you know, in a
> > > > > > > > situation of deadlock some keys become non-usable for an infinite
> > > > > amount
> > > > > > > > of time. Currently the only way to work around it is configuring
> > > a
> > > > > > > > timeout, but it could be rather tricky in practice to choose a
> > > > > > > > proper/universal value for it. So, I see the main point as:
> > > > > > > >
> > > > > > > > Ability to break deadlocks without a need to configure timeouts
> > > > > > > explicitly.
> > > > > > > >
> > > > > > > > I will return soon with some thoughts about implementation.
> > > > > Meanwhile,
> > > > > > > > does anybody have in mind any other usability points which I am
> > > > > missing?
> > > > > > > > Or is there any alternative approaches?
> > > > > > > >
> > > > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]>
> > > wrote:
> > > > > > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <
> > > > > [hidden email]
> > > > > > > >>
> > > > > > > > > wrote:>
> > > > > > > > >
> > > > > > > > > > It doesn’t need all txes. Instead, other nodes will send
> > > info
> > > > > about>
> > > > > > > > > > suspicious txes to it from time to time.>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > > > I see your point, I think it might work.>
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Ivan Pavlukhin
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin

--
Best regards,
Ivan Pavlukhin

gvvinblade

Re: Suggestion to improve deadlock detection

Ivan,

I would prefer forward-only implementation even knowing it allows false
positive check results.

Why I think so:

1) From my experience, when we have any future is waiting for reply, we
have to take a failover into consideration.
Usually failover implementations are way more complex than an initial
algorithm itself.
Forward-only version doesn't need any failover implementation as it expects
failed probes (the probes didn't face a deadlock ).

2) Simple lightweight feature implementation is a really good point to
start - it may be extended if needed but really often such extending
doesn't need at all.

3) Any implementation allow false positive result - we are working with
distributed system, there are a bunch of processes happens at the same time
and,
for example, concurrently happened rollback on timeout may solve a deadlock
but probe is finished at this time, so- there is a rollback on deadlock
also as a result.

4) The described case (when false positive result is probable) isn't usual
but, potentially, extremely rare, I don't think we should solve it since it
doesn't break the grid.

Regards,
Igor

вт, 18 дек. 2018 г. в 17:57, Павлухин Иван <[hidden email]>:

> Hi folks,
>
> During implementation of edge-chasing deadlock detection algorithm in
> scope of [1] it has been realized that basically we have 2 options for
> "chasing" strategy. I will use terms Near when GridNearTxLocal is
> assumed and Dht when GridDhtTxLocal (tx which updates primary
> partition). So, here is 2 mentioned strategies:
> 1. Send initial probe when Dht tx blocks waiting for another tx to
> release a key. Initial probe is sent from waiting Dht tx to Near tx
> holding a lock. If receiving Near tx is blocked as well then it relays
> the probe to Dht tx it awaits response from. So, the probe is
> traveling like Dht -> Near -> Dht -> Near -> ... Let's call the
> approach "forward-only".
> 2. Send probes only between Near transactions. This approach requires
> additional request-response call which Near tx issues to Dht node to
> check if tx sending a request is waiting for another tx. The call
> returns identifier of a Near tx blocking tx sending a request (if
> sending tx is in fact blocked). Then waiting Near tx relays a probe to
> blocked Near tx. Let's call that approach "lock-checking".
>
> I would like to define several points to consider while comparing
> approaches (please point out more if I miss something):
> 1. Correctness. Especially regarding reporting false deadlocks.
> 2. Extensibility. Rollback to savepoint and generalization for classic
> transactions should be considered.
> 3. Messaging overhead.
> 4. Simplicity of implementation.
>
> You can find an implementation of "lock-checking" approach in PR
> attached to the ticket [1]. Currently it works correctly only for
> transactions obeying strict two-phase locking (2PL). It is fairly easy
> to adopt PR to implement "forward-only" approach. Which will work
> correct in conditions of 2PL. "lock-checking" algorithm uses 1.5 times
> more messages than "forward-only". Implementation of "forward-only"
> algorithm is simpler in a sense that it does not require introducing
> lock-wait-check messages and future. But as for me it does not look as
> big deal. I think that implementation efforts for both approaches are
> almost equal so far.
>
> But corner stone here is an extensibility. Let's imagine that we are
> going to implement rollback to savepoint. One suggested approach to
> extend "forward-only" approach for that case is invalidating sent
> probe. Basically, a tx initiated deadlock check assigns unique id to a
> probe. And this probe id is invalidated when the tx wakes up from
> wait. If that tx receives back a probe with invalidated id it simply
> discards it. If id is not invalidated it means a deadlock. But false
> deadlock still can be detected. I will provide an example when it does
> not work correctly.
>
> Please note, that our transactions can request multiple locks
> simultaneously (so-called AND model). Also rollback to savepoint
> releases locks obtained after creating a savepoint. Let's use
> following notation. "*" means savepoint. "^" means rollback to
> previous savepoint. "!" means acquiring a lock. "?" means waiting for
> a lock. "&" means requesting multiple locks simultaneously. See the
> following transaction execution scenario for 3 transactions and 3
> keys:
> TA | TB | TC
> | | *
> | 1! | 2!
> 1? & 3! | |
> | 2? |
> | | ^
> | | 3!
>
> We can notice that suggested scenario does not introduce deadlock
> because locks are requested in consistent order among all
> transactions. But in different moments there exist waiting relations
> TA w TB, TB w TC, TC w TA. So, we suspect a false deadlock could be
> detected here. Also one can notice that a relation TC w TA is
> established after TB w TC is destroyed. Messages on a timeline diagram
> demonstrates how "forward-only with probe invalidation" approach can
> detect a false deadlock here [2]. On the picture L means "lock", U
> "unlock", W "wait". Red cross means reporting a false deadlock.
> Intuition here is that while probe is in flight one wait-for edge can
> be destroyed (TB w TC) and another one can be created (TC w TA). So,
> the approach can report false deadlocks. Also, note that a false
> deadlock can be found even when locking order is consistent!
>
> After a while I will describe how "lock-checking" can be adopted to
> support rollback to savepoint. Disclaimer here is that it will require
> even more messages.
> ----
>
> I think that we can already start a discussion.
> Actually, while false deadlocks can be surprising we perhaps can
> tolerate it because our transaction implementation assumes that
> transactions can be rolled back by system. "forward-only" approach
> requires lesser messages but in big-O terms both approaches have the
> same complexity. Both correctness and complexity impact unfortunately
> have hardly predictable impact for real workflows. In ideal world
> thorough testing could give the answers.
>
> Please, share your vision.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9322
> [2]
> https://gist.githubusercontent.com/pavlukhin/c8c7c6266eeab56048c31f5cdfb31d20/raw/7d1aef9abcd014ec9fdf69840168768ced638b6c/msg_diagram.jpg
> пн, 26 нояб. 2018 г. в 15:27, Павлухин Иван <[hidden email]>:
> >
> > Vladimir,
> >
> > I think it might work. So, if nobody minds I can start prototyping
> > edge-chasing approach.
> > пн, 26 нояб. 2018 г. в 14:32, Vladimir Ozerov <[hidden email]>:
> > >
> > > Ivan,
> > >
> > > The problem is that in our system a transaction may wait for N locks
> > > simultaneously. This may form complex graphs which spread between many
> > > nodes. Now consider that I have a deadlock between 4 nodes: A -> B ->
> *C*
> > > -> D -> A. I've sent a message from a and never reached D because C
> failed.
> > > Well, may be that is good, because some transactions will be rolled
> back.
> > > But when they are rolled back, another cycles from pending multi-way
> > > deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the
> question is
> > > whether we will be able to detect them reliably. I think that we may
> use
> > > the following assumption:
> > > 1) If data node fails, relevant transactions will be rolled-back
> > > 2) It means that some other transactions will make at least partial
> > > progress with locks acquisition
> > >
> > > So, we can introduce a kind of counter which will be advanced on every
> > > acquired lock on a given node. And define a rule that transaction
> deadlock
> > > request for the given pair (NODE_ID, COUNTER) should be requested at
> most
> > > once. This way, even if specific deadlock check request is lost due to
> node
> > > failure, and another deadlock has formed, then some other node will
> > > re-trigger deadlock change request sooner or later, as it's counter
> > > advanced.
> > >
> > > Makes sense?
> > >
> > > On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <[hidden email]>
> wrote:
> > >
> > > > Hi Vladimir,
> > > >
> > > > Regarding fault tolerance. It seems that it is not a problem for
> > > > edge-chasing approaches. A found deadlock is identified by a message
> > > > returned to a detection initiator with initiator's identifier. If
> > > > there is no deadlock then such message will not come. If some node
> > > > containing a deadlocked transaction fails then it will break the
> > > > deadlock. Am I missing something?
> > > >
> > > > About messaging overhead. Indeed, it looks like edge-chasing
> > > > approaches can bring redundant messages. Perhaps, we can borrow some
> > > > ideas about optimization from [1]. And I also think about a timeout
> > > > before starting a deadlock detection.
> > > >
> > > > Thoughts about adaptive timeouts lead me to some thoughts about
> > > > monitoring. I assume that long waiting for locks signals about not
> > > > optimal performance of the system. I think it would be great to have
> > > > means of monitoring transactions contention. It could help users to
> > > > improve theirs workloads. It also could help us to build some
> > > > reasoning how contention correlates with deadlock detection overhead.
> > > >
> > > > [1] http://mentallandscape.com/Papers_podc84.pdf
> > > > вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <[hidden email]
> >:
> > > > >
> > > > > Hi Ivan,
> > > > >
> > > > > Great analysis. Agree that edge-chasing looks like better
> candidate.
> > > > First,
> > > > > it will be applicable to both normal and MVCC transactions.
> Second, in
> > > > MVCC
> > > > > we probably will also need to release some locks when doing
> rollbacks.
> > > > What
> > > > > we should think about is failover - what if a node which was in the
> > > > middle
> > > > > of a graph fails? We need to craft fault-tolerance carefully.
> > > > >
> > > > > Another critically important point is how to trigger deadlock
> detection.
> > > > My
> > > > > concerns about edge-chasing was not about latency, but about a
> number of
> > > > > messages which travels between nodes. Good algorithm must produce
> little
> > > > to
> > > > > no messages in case of normal contenion while still providing
> protection
> > > > in
> > > > > case of real deadlocks. So how would we trigger fraph traversal
> for the
> > > > > given transaction? May be we can start with hard timeout and then
> employ
> > > > a
> > > > > kind of adaptive increment in case high rate of false-positives are
> > > > > observed. May be something else.
> > > > >
> > > > > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <
> [hidden email]>
> > > > wrote:
> > > > >
> > > > > > Vladimir,
> > > > > >
> > > > > > Thanks for the articles! I studied them and a couple of others.
> And I
> > > > > > would like to share a knowledge I found.
> > > > > >
> > > > > > BACKGROUND
> > > > > > First of all our algorithm implemented in
> > > > > >
> > > > > >
> > > >
> org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > > > > > is not an edge-chasing algorithm. In essence a lock-waiting
> > > > > > transaction site polls nodes responsible for keys of interest
> one by
> > > > > > one and reconstructs global wait-for graph (WFG) locally.
> > > > > > A centralized algorithm discussed in this thread looks similar to
> > > > > > algorithms proposed by Ho [1]. The simplest of them (two-phased)
> > > > > > reports false deadlocks when unlock before transaction finish is
> > > > > > permitted. So, it seems that it only works when strict two-phase
> > > > > > locking (2PL) is obeyed. Another one (called one-phased) requires
> > > > > > tracking all locked keys by each transaction which is not
> desirable
> > > > > > for MVCC transactions.
> > > > > > Aforementioned edge-chasing algorithm by Chandy, Misra and Haas
> [2] is
> > > > > > proven to work even when 2PL is not obeyed.
> > > > > > Also performance target is not clear for me. It looks like
> centralized
> > > > > > approaches can use fewer messages and provide lower latency that
> > > > > > distributed ones. But would like to understand what are the
> orders of
> > > > > > latency and messaging overhead. Many of algorithms are described
> in
> > > > > > [3] and some performance details are mentioned. It is said "As
> per
> > > > > > [GRAY81], most deadlocks involve two to four transactions." I
> see one
> > > > > > more argument making deadlock detection algorithm latency not so
> > > > > > important. We can consider deadlock as unavoidable situation,
> but even
> > > > > > lightning fast detector will not save a database from performance
> > > > > > degradation in case of tons of deadlocks.
> > > > > > What is also interesting, Google Cloud Spanner employs simple
> > > > > > "wound-wait" deadlock prevention [4] instead of any detection
> > > > > > algorithm.
> > > > > >
> > > > > > DISCUSSION
> > > > > > So, I see the following options:
> > > > > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > > > > > 2. Centralized two-phased algorithm described by Ho [1] with
> > > > > > restriction that transactions must obey 2PL.
> > > > > > Personally, I tend to edge-chasing approach because it looks
> simple
> > > > > > and universal. Restricting ourselves with 2PL will restrict us in
> > > > > > features which could be implemented in future (e.g. transaction
> > > > > > savepoints), will not it?
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > > > Ivan
> > > > > >
> > > > > > [1]
> https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > > > > > [2]
> > > > > >
> > > >
> https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > [3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > > > > > [4]
> > > > > >
> > > >
> https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> > > > > >
> > > > > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <
> [hidden email]>:
> > > > > > >
> > > > > > > Ivan,
> > > > > > >
> > > > > > > This is interesting question. I think we should spend some
> time for
> > > > > > formal
> > > > > > > verification whether this algorithm works or not. Several
> articles
> > > > you
> > > > > > may
> > > > > > > use as a startitng point: [1], [2]. From what I understand,
> Ignite
> > > > fall
> > > > > > > into "AND" model, and currently implemented algorithm is a
> variation
> > > > of
> > > > > > > "edge-chasing" approach as per Chandy, Misra and Haas [3],
> which is
> > > > > > *proven
> > > > > > > to be correct* in that it both detects deadlocks when they are
> > > > present,
> > > > > > and
> > > > > > > do not produce false positives. But is might be too heavy for
> the
> > > > system
> > > > > > > under contention.
> > > > > > >
> > > > > > > We need to search for any formal proof of correctness of
> proposed
> > > > > > > algorithm. This area is already researched throughly enough,
> so we
> > > > should
> > > > > > > be able to find an answer quickly.
> > > > > > >
> > > > > > > Vladimir.
> > > > > > >
> > > > > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > > > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > > > > > [3]
> > > > > > >
> > > > > >
> > > >
> https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > >
> > > > > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <
> [hidden email]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Next part as promised. A working item for me is a deadlock
> detector
> > > > > > > > for MVCC transactions [1]. The message is structured in 2
> parts.
> > > > First
> > > > > > > > is an analysis of the current state of affairs and possible
> > > > options to
> > > > > > > > go. Second is a proposed option. First part is going to be
> not so
> > > > > > > > short so some might prefer to skip it.
> > > > > > > >
> > > > > > > > ANALYSIS
> > > > > > > > The immediate question is "why we cannot use an existing
> deadlock
> > > > > > > > detector?". The differences between classic and MVCC
> transactions
> > > > > > > > implementation is the answer. Currently a collection of
> > > > IgniteTxEntry
> > > > > > > > is used for detection. But such collection is not maintained
> for
> > > > MVCC
> > > > > > > > transactions. So, it will not work out of box.
> > > > > > > > Also it looks like that current distributed iterative
> approach
> > > > cannot
> > > > > > > > be low latency it the worst case because of doing possibly
> many
> > > > > > > > network requests sequentially.
> > > > > > > > So, what options do we have? Generally we should choose
> between
> > > > > > > > centralized and distributed approaches. By centralized
> approach I
> > > > mean
> > > > > > > > existence of a dedicated deadlock detector located on a
> single
> > > > node.
> > > > > > > > In the centralized approach we can face difficulties related
> to
> > > > > > > > failover as a node running deadlock detector can fail. In the
> > > > > > > > distributed approach extra network messaging overhead can
> strike
> > > > > > > > because different nodes participating in a deadlock can start
> > > > > > > > detection independently and send redundant messages. I see
> some
> > > > > > > > aspects which make sense for choosing implementation. Here
> they are
> > > > > > > > with an approach that is better (roughly speaking) in
> parentheses:
> > > > > > > > * Detection latency (centralized).
> > > > > > > > * Messaging overhead (centralized).
> > > > > > > > * Failover (distributed).
> > > > > > > > And also having a park of deadlock detectors sounds not very
> good.
> > > > I
> > > > > > > > hope that it is possible to develop a common solution
> suitable for
> > > > > > > > both kinds of transactions. I suggest to pilot new solution
> with
> > > > MVCC
> > > > > > > > and then adopt it for classic transactions.
> > > > > > > >
> > > > > > > > PROPOSAL
> > > > > > > > Actually I propose to start with an centralized algorithm
> > > > described by
> > > > > > > > Vladimir in the beginning of the thread. I will try to
> outline main
> > > > > > > > points of it.
> > > > > > > > 1. Single deadlock detector exists in the cluster which
> maintains
> > > > > > > > transaction wait-for graph (WFG).
> > > > > > > > 2. Each cluster node sends and invalidates wait-for edges to
> the
> > > > > > detector.
> > > > > > > > 3. The detector periodically searches cycles in WFG and
> chooses and
> > > > > > > > aborts a victim transaction if cycle is found.
> > > > > > > >
> > > > > > > > Currently I have one fundamental question. Is there a
> possibility
> > > > of
> > > > > > > > false detected deadlocks because of concurrent WFG updates?
> > > > > > > > Of course there are many points of improvements and
> optimizations.
> > > > But
> > > > > > > > I would like to start from discussing key points.
> > > > > > > >
> > > > > > > > Please share your thoughts!
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <
> [hidden email]>:
> > > > > > > > >
> > > > > > > > > Hi Igniters,
> > > > > > > > >
> > > > > > > > > I would like to resume the discussion about a deadlock
> detector.
> > > > I
> > > > > > start
> > > > > > > > > with a motivation for a further work on a subject. As I see
> > > > current
> > > > > > > > > implementation (entry point IgniteTxManager.detectDeadlock)
> > > > starts a
> > > > > > > > > detection only after a transaction was timed out. In my
> mind it
> > > > is
> > > > > > not
> > > > > > > > > very good from a product usability standpoint. As you
> know, in a
> > > > > > > > > situation of deadlock some keys become non-usable for an
> infinite
> > > > > > amount
> > > > > > > > > of time. Currently the only way to work around it is
> configuring
> > > > a
> > > > > > > > > timeout, but it could be rather tricky in practice to
> choose a
> > > > > > > > > proper/universal value for it. So, I see the main point as:
> > > > > > > > >
> > > > > > > > > Ability to break deadlocks without a need to configure
> timeouts
> > > > > > > > explicitly.
> > > > > > > > >
> > > > > > > > > I will return soon with some thoughts about implementation.
> > > > > > Meanwhile,
> > > > > > > > > does anybody have in mind any other usability points which
> I am
> > > > > > missing?
> > > > > > > > > Or is there any alternative approaches?
> > > > > > > > >
> > > > > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]
> >
> > > > wrote:
> > > > > > > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <
> > > > > > [hidden email]
> > > > > > > > >>
> > > > > > > > > > wrote:>
> > > > > > > > > >
> > > > > > > > > > > It doesn’t need all txes. Instead, other nodes will
> send
> > > > info
> > > > > > about>
> > > > > > > > > > > suspicious txes to it from time to time.>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > > I see your point, I think it might work.>
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Ivan Pavlukhin
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Ivan Pavlukhin
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Ivan Pavlukhin
> > > >
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

Ivan Pavlukhin

Re: Suggestion to improve deadlock detection

Igor,

I see your points. And I agree to start with "lightweight
implementation". Today we have 2PL and there is no activity on
implementing rollback to savepoint. And if we do it in the future we
will have to return to the subject of deadlock detection anyway.

I will proceed with "forward-only" approach.
вт, 18 дек. 2018 г. в 18:33, Seliverstov Igor <[hidden email]>:

>
> Ivan,
>
> I would prefer forward-only implementation even knowing it allows false
> positive check results.
>
> Why I think so:
>
> 1) From my experience, when we have any future is waiting for reply, we
> have to take a failover into consideration.
> Usually failover implementations are way more complex than an initial
> algorithm itself.
> Forward-only version doesn't need any failover implementation as it expects
> failed probes (the probes didn't face a deadlock ).
>
> 2) Simple lightweight feature implementation is a really good point to
> start - it may be extended if needed but really often such extending
> doesn't need at all.
>
> 3) Any implementation allow false positive result - we are working with
> distributed system, there are a bunch of processes happens at the same time
> and,
> for example, concurrently happened rollback on timeout may solve a deadlock
> but probe is finished at this time, so- there is a rollback on deadlock
> also as a result.
>
> 4) The described case (when false positive result is probable) isn't usual
> but, potentially, extremely rare, I don't think we should solve it since it
> doesn't break the grid.
>
> Regards,
> Igor
>
> вт, 18 дек. 2018 г. в 17:57, Павлухин Иван <[hidden email]>:
>
> > Hi folks,
> >
> > During implementation of edge-chasing deadlock detection algorithm in
> > scope of [1] it has been realized that basically we have 2 options for
> > "chasing" strategy. I will use terms Near when GridNearTxLocal is
> > assumed and Dht when GridDhtTxLocal (tx which updates primary
> > partition). So, here is 2 mentioned strategies:
> > 1. Send initial probe when Dht tx blocks waiting for another tx to
> > release a key. Initial probe is sent from waiting Dht tx to Near tx
> > holding a lock. If receiving Near tx is blocked as well then it relays
> > the probe to Dht tx it awaits response from. So, the probe is
> > traveling like Dht -> Near -> Dht -> Near -> ... Let's call the
> > approach "forward-only".
> > 2. Send probes only between Near transactions. This approach requires
> > additional request-response call which Near tx issues to Dht node to
> > check if tx sending a request is waiting for another tx. The call
> > returns identifier of a Near tx blocking tx sending a request (if
> > sending tx is in fact blocked). Then waiting Near tx relays a probe to
> > blocked Near tx. Let's call that approach "lock-checking".
> >
> > I would like to define several points to consider while comparing
> > approaches (please point out more if I miss something):
> > 1. Correctness. Especially regarding reporting false deadlocks.
> > 2. Extensibility. Rollback to savepoint and generalization for classic
> > transactions should be considered.
> > 3. Messaging overhead.
> > 4. Simplicity of implementation.
> >
> > You can find an implementation of "lock-checking" approach in PR
> > attached to the ticket [1]. Currently it works correctly only for
> > transactions obeying strict two-phase locking (2PL). It is fairly easy
> > to adopt PR to implement "forward-only" approach. Which will work
> > correct in conditions of 2PL. "lock-checking" algorithm uses 1.5 times
> > more messages than "forward-only". Implementation of "forward-only"
> > algorithm is simpler in a sense that it does not require introducing
> > lock-wait-check messages and future. But as for me it does not look as
> > big deal. I think that implementation efforts for both approaches are
> > almost equal so far.
> >
> > But corner stone here is an extensibility. Let's imagine that we are
> > going to implement rollback to savepoint. One suggested approach to
> > extend "forward-only" approach for that case is invalidating sent
> > probe. Basically, a tx initiated deadlock check assigns unique id to a
> > probe. And this probe id is invalidated when the tx wakes up from
> > wait. If that tx receives back a probe with invalidated id it simply
> > discards it. If id is not invalidated it means a deadlock. But false
> > deadlock still can be detected. I will provide an example when it does
> > not work correctly.
> >
> > Please note, that our transactions can request multiple locks
> > simultaneously (so-called AND model). Also rollback to savepoint
> > releases locks obtained after creating a savepoint. Let's use
> > following notation. "*" means savepoint. "^" means rollback to
> > previous savepoint. "!" means acquiring a lock. "?" means waiting for
> > a lock. "&" means requesting multiple locks simultaneously. See the
> > following transaction execution scenario for 3 transactions and 3
> > keys:
> > TA | TB | TC
> > | | *
> > | 1! | 2!
> > 1? & 3! | |
> > | 2? |
> > | | ^
> > | | 3!
> >
> > We can notice that suggested scenario does not introduce deadlock
> > because locks are requested in consistent order among all
> > transactions. But in different moments there exist waiting relations
> > TA w TB, TB w TC, TC w TA. So, we suspect a false deadlock could be
> > detected here. Also one can notice that a relation TC w TA is
> > established after TB w TC is destroyed. Messages on a timeline diagram
> > demonstrates how "forward-only with probe invalidation" approach can
> > detect a false deadlock here [2]. On the picture L means "lock", U
> > "unlock", W "wait". Red cross means reporting a false deadlock.
> > Intuition here is that while probe is in flight one wait-for edge can
> > be destroyed (TB w TC) and another one can be created (TC w TA). So,
> > the approach can report false deadlocks. Also, note that a false
> > deadlock can be found even when locking order is consistent!
> >
> > After a while I will describe how "lock-checking" can be adopted to
> > support rollback to savepoint. Disclaimer here is that it will require
> > even more messages.
> > ----
> >
> > I think that we can already start a discussion.
> > Actually, while false deadlocks can be surprising we perhaps can
> > tolerate it because our transaction implementation assumes that
> > transactions can be rolled back by system. "forward-only" approach
> > requires lesser messages but in big-O terms both approaches have the
> > same complexity. Both correctness and complexity impact unfortunately
> > have hardly predictable impact for real workflows. In ideal world
> > thorough testing could give the answers.
> >
> > Please, share your vision.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > [2]
> > https://gist.githubusercontent.com/pavlukhin/c8c7c6266eeab56048c31f5cdfb31d20/raw/7d1aef9abcd014ec9fdf69840168768ced638b6c/msg_diagram.jpg
> > пн, 26 нояб. 2018 г. в 15:27, Павлухин Иван <[hidden email]>:
> > >
> > > Vladimir,
> > >
> > > I think it might work. So, if nobody minds I can start prototyping
> > > edge-chasing approach.
> > > пн, 26 нояб. 2018 г. в 14:32, Vladimir Ozerov <[hidden email]>:
> > > >
> > > > Ivan,
> > > >
> > > > The problem is that in our system a transaction may wait for N locks
> > > > simultaneously. This may form complex graphs which spread between many
> > > > nodes. Now consider that I have a deadlock between 4 nodes: A -> B ->
> > *C*
> > > > -> D -> A. I've sent a message from a and never reached D because C
> > failed.
> > > > Well, may be that is good, because some transactions will be rolled
> > back.
> > > > But when they are rolled back, another cycles from pending multi-way
> > > > deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the
> > question is
> > > > whether we will be able to detect them reliably. I think that we may
> > use
> > > > the following assumption:
> > > > 1) If data node fails, relevant transactions will be rolled-back
> > > > 2) It means that some other transactions will make at least partial
> > > > progress with locks acquisition
> > > >
> > > > So, we can introduce a kind of counter which will be advanced on every
> > > > acquired lock on a given node. And define a rule that transaction
> > deadlock
> > > > request for the given pair (NODE_ID, COUNTER) should be requested at
> > most
> > > > once. This way, even if specific deadlock check request is lost due to
> > node
> > > > failure, and another deadlock has formed, then some other node will
> > > > re-trigger deadlock change request sooner or later, as it's counter
> > > > advanced.
> > > >
> > > > Makes sense?
> > > >
> > > > On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <[hidden email]>
> > wrote:
> > > >
> > > > > Hi Vladimir,
> > > > >
> > > > > Regarding fault tolerance. It seems that it is not a problem for
> > > > > edge-chasing approaches. A found deadlock is identified by a message
> > > > > returned to a detection initiator with initiator's identifier. If
> > > > > there is no deadlock then such message will not come. If some node
> > > > > containing a deadlocked transaction fails then it will break the
> > > > > deadlock. Am I missing something?
> > > > >
> > > > > About messaging overhead. Indeed, it looks like edge-chasing
> > > > > approaches can bring redundant messages. Perhaps, we can borrow some
> > > > > ideas about optimization from [1]. And I also think about a timeout
> > > > > before starting a deadlock detection.
> > > > >
> > > > > Thoughts about adaptive timeouts lead me to some thoughts about
> > > > > monitoring. I assume that long waiting for locks signals about not
> > > > > optimal performance of the system. I think it would be great to have
> > > > > means of monitoring transactions contention. It could help users to
> > > > > improve theirs workloads. It also could help us to build some
> > > > > reasoning how contention correlates with deadlock detection overhead.
> > > > >
> > > > > [1] http://mentallandscape.com/Papers_podc84.pdf
> > > > > вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <[hidden email]
> > >:
> > > > > >
> > > > > > Hi Ivan,
> > > > > >
> > > > > > Great analysis. Agree that edge-chasing looks like better
> > candidate.
> > > > > First,
> > > > > > it will be applicable to both normal and MVCC transactions.
> > Second, in
> > > > > MVCC
> > > > > > we probably will also need to release some locks when doing
> > rollbacks.
> > > > > What
> > > > > > we should think about is failover - what if a node which was in the
> > > > > middle
> > > > > > of a graph fails? We need to craft fault-tolerance carefully.
> > > > > >
> > > > > > Another critically important point is how to trigger deadlock
> > detection.
> > > > > My
> > > > > > concerns about edge-chasing was not about latency, but about a
> > number of
> > > > > > messages which travels between nodes. Good algorithm must produce
> > little
> > > > > to
> > > > > > no messages in case of normal contenion while still providing
> > protection
> > > > > in
> > > > > > case of real deadlocks. So how would we trigger fraph traversal
> > for the
> > > > > > given transaction? May be we can start with hard timeout and then
> > employ
> > > > > a
> > > > > > kind of adaptive increment in case high rate of false-positives are
> > > > > > observed. May be something else.
> > > > > >
> > > > > > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <
> > [hidden email]>
> > > > > wrote:
> > > > > >
> > > > > > > Vladimir,
> > > > > > >
> > > > > > > Thanks for the articles! I studied them and a couple of others.
> > And I
> > > > > > > would like to share a knowledge I found.
> > > > > > >
> > > > > > > BACKGROUND
> > > > > > > First of all our algorithm implemented in
> > > > > > >
> > > > > > >
> > > > >
> > org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > > > > > > is not an edge-chasing algorithm. In essence a lock-waiting
> > > > > > > transaction site polls nodes responsible for keys of interest
> > one by
> > > > > > > one and reconstructs global wait-for graph (WFG) locally.
> > > > > > > A centralized algorithm discussed in this thread looks similar to
> > > > > > > algorithms proposed by Ho [1]. The simplest of them (two-phased)
> > > > > > > reports false deadlocks when unlock before transaction finish is
> > > > > > > permitted. So, it seems that it only works when strict two-phase
> > > > > > > locking (2PL) is obeyed. Another one (called one-phased) requires
> > > > > > > tracking all locked keys by each transaction which is not
> > desirable
> > > > > > > for MVCC transactions.
> > > > > > > Aforementioned edge-chasing algorithm by Chandy, Misra and Haas
> > [2] is
> > > > > > > proven to work even when 2PL is not obeyed.
> > > > > > > Also performance target is not clear for me. It looks like
> > centralized
> > > > > > > approaches can use fewer messages and provide lower latency that
> > > > > > > distributed ones. But would like to understand what are the
> > orders of
> > > > > > > latency and messaging overhead. Many of algorithms are described
> > in
> > > > > > > [3] and some performance details are mentioned. It is said "As
> > per
> > > > > > > [GRAY81], most deadlocks involve two to four transactions." I
> > see one
> > > > > > > more argument making deadlock detection algorithm latency not so
> > > > > > > important. We can consider deadlock as unavoidable situation,
> > but even
> > > > > > > lightning fast detector will not save a database from performance
> > > > > > > degradation in case of tons of deadlocks.
> > > > > > > What is also interesting, Google Cloud Spanner employs simple
> > > > > > > "wound-wait" deadlock prevention [4] instead of any detection
> > > > > > > algorithm.
> > > > > > >
> > > > > > > DISCUSSION
> > > > > > > So, I see the following options:
> > > > > > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > > > > > > 2. Centralized two-phased algorithm described by Ho [1] with
> > > > > > > restriction that transactions must obey 2PL.
> > > > > > > Personally, I tend to edge-chasing approach because it looks
> > simple
> > > > > > > and universal. Restricting ourselves with 2PL will restrict us in
> > > > > > > features which could be implemented in future (e.g. transaction
> > > > > > > savepoints), will not it?
> > > > > > >
> > > > > > > WDYT?
> > > > > > >
> > > > > > > Ivan
> > > > > > >
> > > > > > > [1]
> > https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > > > > > > [2]
> > > > > > >
> > > > >
> > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > > [3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > > > > > > [4]
> > > > > > >
> > > > >
> > https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> > > > > > >
> > > > > > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <
> > [hidden email]>:
> > > > > > > >
> > > > > > > > Ivan,
> > > > > > > >
> > > > > > > > This is interesting question. I think we should spend some
> > time for
> > > > > > > formal
> > > > > > > > verification whether this algorithm works or not. Several
> > articles
> > > > > you
> > > > > > > may
> > > > > > > > use as a startitng point: [1], [2]. From what I understand,
> > Ignite
> > > > > fall
> > > > > > > > into "AND" model, and currently implemented algorithm is a
> > variation
> > > > > of
> > > > > > > > "edge-chasing" approach as per Chandy, Misra and Haas [3],
> > which is
> > > > > > > *proven
> > > > > > > > to be correct* in that it both detects deadlocks when they are
> > > > > present,
> > > > > > > and
> > > > > > > > do not produce false positives. But is might be too heavy for
> > the
> > > > > system
> > > > > > > > under contention.
> > > > > > > >
> > > > > > > > We need to search for any formal proof of correctness of
> > proposed
> > > > > > > > algorithm. This area is already researched throughly enough,
> > so we
> > > > > should
> > > > > > > > be able to find an answer quickly.
> > > > > > > >
> > > > > > > > Vladimir.
> > > > > > > >
> > > > > > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > > > > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > > > > > > [3]
> > > > > > > >
> > > > > > >
> > > > >
> > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > > >
> > > > > > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <
> > [hidden email]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > Next part as promised. A working item for me is a deadlock
> > detector
> > > > > > > > > for MVCC transactions [1]. The message is structured in 2
> > parts.
> > > > > First
> > > > > > > > > is an analysis of the current state of affairs and possible
> > > > > options to
> > > > > > > > > go. Second is a proposed option. First part is going to be
> > not so
> > > > > > > > > short so some might prefer to skip it.
> > > > > > > > >
> > > > > > > > > ANALYSIS
> > > > > > > > > The immediate question is "why we cannot use an existing
> > deadlock
> > > > > > > > > detector?". The differences between classic and MVCC
> > transactions
> > > > > > > > > implementation is the answer. Currently a collection of
> > > > > IgniteTxEntry
> > > > > > > > > is used for detection. But such collection is not maintained
> > for
> > > > > MVCC
> > > > > > > > > transactions. So, it will not work out of box.
> > > > > > > > > Also it looks like that current distributed iterative
> > approach
> > > > > cannot
> > > > > > > > > be low latency it the worst case because of doing possibly
> > many
> > > > > > > > > network requests sequentially.
> > > > > > > > > So, what options do we have? Generally we should choose
> > between
> > > > > > > > > centralized and distributed approaches. By centralized
> > approach I
> > > > > mean
> > > > > > > > > existence of a dedicated deadlock detector located on a
> > single
> > > > > node.
> > > > > > > > > In the centralized approach we can face difficulties related
> > to
> > > > > > > > > failover as a node running deadlock detector can fail. In the
> > > > > > > > > distributed approach extra network messaging overhead can
> > strike
> > > > > > > > > because different nodes participating in a deadlock can start
> > > > > > > > > detection independently and send redundant messages. I see
> > some
> > > > > > > > > aspects which make sense for choosing implementation. Here
> > they are
> > > > > > > > > with an approach that is better (roughly speaking) in
> > parentheses:
> > > > > > > > > * Detection latency (centralized).
> > > > > > > > > * Messaging overhead (centralized).
> > > > > > > > > * Failover (distributed).
> > > > > > > > > And also having a park of deadlock detectors sounds not very
> > good.
> > > > > I
> > > > > > > > > hope that it is possible to develop a common solution
> > suitable for
> > > > > > > > > both kinds of transactions. I suggest to pilot new solution
> > with
> > > > > MVCC
> > > > > > > > > and then adopt it for classic transactions.
> > > > > > > > >
> > > > > > > > > PROPOSAL
> > > > > > > > > Actually I propose to start with an centralized algorithm
> > > > > described by
> > > > > > > > > Vladimir in the beginning of the thread. I will try to
> > outline main
> > > > > > > > > points of it.
> > > > > > > > > 1. Single deadlock detector exists in the cluster which
> > maintains
> > > > > > > > > transaction wait-for graph (WFG).
> > > > > > > > > 2. Each cluster node sends and invalidates wait-for edges to
> > the
> > > > > > > detector.
> > > > > > > > > 3. The detector periodically searches cycles in WFG and
> > chooses and
> > > > > > > > > aborts a victim transaction if cycle is found.
> > > > > > > > >
> > > > > > > > > Currently I have one fundamental question. Is there a
> > possibility
> > > > > of
> > > > > > > > > false detected deadlocks because of concurrent WFG updates?
> > > > > > > > > Of course there are many points of improvements and
> > optimizations.
> > > > > But
> > > > > > > > > I would like to start from discussing key points.
> > > > > > > > >
> > > > > > > > > Please share your thoughts!
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <
> > [hidden email]>:
> > > > > > > > > >
> > > > > > > > > > Hi Igniters,
> > > > > > > > > >
> > > > > > > > > > I would like to resume the discussion about a deadlock
> > detector.
> > > > > I
> > > > > > > start
> > > > > > > > > > with a motivation for a further work on a subject. As I see
> > > > > current
> > > > > > > > > > implementation (entry point IgniteTxManager.detectDeadlock)
> > > > > starts a
> > > > > > > > > > detection only after a transaction was timed out. In my
> > mind it
> > > > > is
> > > > > > > not
> > > > > > > > > > very good from a product usability standpoint. As you
> > know, in a
> > > > > > > > > > situation of deadlock some keys become non-usable for an
> > infinite
> > > > > > > amount
> > > > > > > > > > of time. Currently the only way to work around it is
> > configuring
> > > > > a
> > > > > > > > > > timeout, but it could be rather tricky in practice to
> > choose a
> > > > > > > > > > proper/universal value for it. So, I see the main point as:
> > > > > > > > > >
> > > > > > > > > > Ability to break deadlocks without a need to configure
> > timeouts
> > > > > > > > > explicitly.
> > > > > > > > > >
> > > > > > > > > > I will return soon with some thoughts about implementation.
> > > > > > > Meanwhile,
> > > > > > > > > > does anybody have in mind any other usability points which
> > I am
> > > > > > > missing?
> > > > > > > > > > Or is there any alternative approaches?
> > > > > > > > > >
> > > > > > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]
> > >
> > > > > wrote:
> > > > > > > > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <
> > > > > > > [hidden email]
> > > > > > > > > >>
> > > > > > > > > > > wrote:>
> > > > > > > > > > >
> > > > > > > > > > > > It doesn’t need all txes. Instead, other nodes will
> > send
> > > > > info
> > > > > > > about>
> > > > > > > > > > > > suspicious txes to it from time to time.>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > > I see your point, I think it might work.>
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > > Ivan Pavlukhin
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Ivan Pavlukhin
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >

--
Best regards,
Ivan Pavlukhin

Ivan Pavlukhin

Re: Suggestion to improve deadlock detection

Hi,

I prepared a patch with deadlock detection algorithm implementation
(you can find it from ticket [1]).

Now I would like to discuss options for configuring deadlock
detection. From a usability standpoint following need to be supported:
1. Turn deadlock detection on/off (on by default).
2. Configure a delay between starting waiting for a lock and sending
first deadlock detection message (e.g. 1 second by default).
3. Something else?

I can suggest following options:
1. Use a single long for configuration. Values < 0 means disabled
deadlock detection. 0 means starting detection without a delay. Values
> 0 specifies a delay in milliseconds before starting detection.
2. Use a boolean flag for turning detection on/off. Use a long value
for configuring delay.

Also there are number of ways to pass this values to Ignite. I do no
have much experience in adding new configuration options to Ignite and
I need an advice here. For example, it could be done by adding
additional properties to TransactionConfiguration class. Another way
is using java system properties.

Igor, Vladimir and others what do you think?

[1] https://issues.apache.org/jira/browse/IGNITE-9322
ср, 19 дек. 2018 г. в 15:51, Павлухин Иван <[hidden email]>:

>
> Igor,
>
> I see your points. And I agree to start with "lightweight
> implementation". Today we have 2PL and there is no activity on
> implementing rollback to savepoint. And if we do it in the future we
> will have to return to the subject of deadlock detection anyway.
>
> I will proceed with "forward-only" approach.
> вт, 18 дек. 2018 г. в 18:33, Seliverstov Igor <[hidden email]>:
> >
> > Ivan,
> >
> > I would prefer forward-only implementation even knowing it allows false
> > positive check results.
> >
> > Why I think so:
> >
> > 1) From my experience, when we have any future is waiting for reply, we
> > have to take a failover into consideration.
> > Usually failover implementations are way more complex than an initial
> > algorithm itself.
> > Forward-only version doesn't need any failover implementation as it expects
> > failed probes (the probes didn't face a deadlock ).
> >
> > 2) Simple lightweight feature implementation is a really good point to
> > start - it may be extended if needed but really often such extending
> > doesn't need at all.
> >
> > 3) Any implementation allow false positive result - we are working with
> > distributed system, there are a bunch of processes happens at the same time
> > and,
> > for example, concurrently happened rollback on timeout may solve a deadlock
> > but probe is finished at this time, so- there is a rollback on deadlock
> > also as a result.
> >
> > 4) The described case (when false positive result is probable) isn't usual
> > but, potentially, extremely rare, I don't think we should solve it since it
> > doesn't break the grid.
> >
> > Regards,
> > Igor
> >
> > вт, 18 дек. 2018 г. в 17:57, Павлухин Иван <[hidden email]>:
> >
> > > Hi folks,
> > >
> > > During implementation of edge-chasing deadlock detection algorithm in
> > > scope of [1] it has been realized that basically we have 2 options for
> > > "chasing" strategy. I will use terms Near when GridNearTxLocal is
> > > assumed and Dht when GridDhtTxLocal (tx which updates primary
> > > partition). So, here is 2 mentioned strategies:
> > > 1. Send initial probe when Dht tx blocks waiting for another tx to
> > > release a key. Initial probe is sent from waiting Dht tx to Near tx
> > > holding a lock. If receiving Near tx is blocked as well then it relays
> > > the probe to Dht tx it awaits response from. So, the probe is
> > > traveling like Dht -> Near -> Dht -> Near -> ... Let's call the
> > > approach "forward-only".
> > > 2. Send probes only between Near transactions. This approach requires
> > > additional request-response call which Near tx issues to Dht node to
> > > check if tx sending a request is waiting for another tx. The call
> > > returns identifier of a Near tx blocking tx sending a request (if
> > > sending tx is in fact blocked). Then waiting Near tx relays a probe to
> > > blocked Near tx. Let's call that approach "lock-checking".
> > >
> > > I would like to define several points to consider while comparing
> > > approaches (please point out more if I miss something):
> > > 1. Correctness. Especially regarding reporting false deadlocks.
> > > 2. Extensibility. Rollback to savepoint and generalization for classic
> > > transactions should be considered.
> > > 3. Messaging overhead.
> > > 4. Simplicity of implementation.
> > >
> > > You can find an implementation of "lock-checking" approach in PR
> > > attached to the ticket [1]. Currently it works correctly only for
> > > transactions obeying strict two-phase locking (2PL). It is fairly easy
> > > to adopt PR to implement "forward-only" approach. Which will work
> > > correct in conditions of 2PL. "lock-checking" algorithm uses 1.5 times
> > > more messages than "forward-only". Implementation of "forward-only"
> > > algorithm is simpler in a sense that it does not require introducing
> > > lock-wait-check messages and future. But as for me it does not look as
> > > big deal. I think that implementation efforts for both approaches are
> > > almost equal so far.
> > >
> > > But corner stone here is an extensibility. Let's imagine that we are
> > > going to implement rollback to savepoint. One suggested approach to
> > > extend "forward-only" approach for that case is invalidating sent
> > > probe. Basically, a tx initiated deadlock check assigns unique id to a
> > > probe. And this probe id is invalidated when the tx wakes up from
> > > wait. If that tx receives back a probe with invalidated id it simply
> > > discards it. If id is not invalidated it means a deadlock. But false
> > > deadlock still can be detected. I will provide an example when it does
> > > not work correctly.
> > >
> > > Please note, that our transactions can request multiple locks
> > > simultaneously (so-called AND model). Also rollback to savepoint
> > > releases locks obtained after creating a savepoint. Let's use
> > > following notation. "*" means savepoint. "^" means rollback to
> > > previous savepoint. "!" means acquiring a lock. "?" means waiting for
> > > a lock. "&" means requesting multiple locks simultaneously. See the
> > > following transaction execution scenario for 3 transactions and 3
> > > keys:
> > > TA | TB | TC
> > > | | *
> > > | 1! | 2!
> > > 1? & 3! | |
> > > | 2? |
> > > | | ^
> > > | | 3!
> > >
> > > We can notice that suggested scenario does not introduce deadlock
> > > because locks are requested in consistent order among all
> > > transactions. But in different moments there exist waiting relations
> > > TA w TB, TB w TC, TC w TA. So, we suspect a false deadlock could be
> > > detected here. Also one can notice that a relation TC w TA is
> > > established after TB w TC is destroyed. Messages on a timeline diagram
> > > demonstrates how "forward-only with probe invalidation" approach can
> > > detect a false deadlock here [2]. On the picture L means "lock", U
> > > "unlock", W "wait". Red cross means reporting a false deadlock.
> > > Intuition here is that while probe is in flight one wait-for edge can
> > > be destroyed (TB w TC) and another one can be created (TC w TA). So,
> > > the approach can report false deadlocks. Also, note that a false
> > > deadlock can be found even when locking order is consistent!
> > >
> > > After a while I will describe how "lock-checking" can be adopted to
> > > support rollback to savepoint. Disclaimer here is that it will require
> > > even more messages.
> > > ----
> > >
> > > I think that we can already start a discussion.
> > > Actually, while false deadlocks can be surprising we perhaps can
> > > tolerate it because our transaction implementation assumes that
> > > transactions can be rolled back by system. "forward-only" approach
> > > requires lesser messages but in big-O terms both approaches have the
> > > same complexity. Both correctness and complexity impact unfortunately
> > > have hardly predictable impact for real workflows. In ideal world
> > > thorough testing could give the answers.
> > >
> > > Please, share your vision.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > [2]
> > > https://gist.githubusercontent.com/pavlukhin/c8c7c6266eeab56048c31f5cdfb31d20/raw/7d1aef9abcd014ec9fdf69840168768ced638b6c/msg_diagram.jpg
> > > пн, 26 нояб. 2018 г. в 15:27, Павлухин Иван <[hidden email]>:
> > > >
> > > > Vladimir,
> > > >
> > > > I think it might work. So, if nobody minds I can start prototyping
> > > > edge-chasing approach.
> > > > пн, 26 нояб. 2018 г. в 14:32, Vladimir Ozerov <[hidden email]>:
> > > > >
> > > > > Ivan,
> > > > >
> > > > > The problem is that in our system a transaction may wait for N locks
> > > > > simultaneously. This may form complex graphs which spread between many
> > > > > nodes. Now consider that I have a deadlock between 4 nodes: A -> B ->
> > > *C*
> > > > > -> D -> A. I've sent a message from a and never reached D because C
> > > failed.
> > > > > Well, may be that is good, because some transactions will be rolled
> > > back.
> > > > > But when they are rolled back, another cycles from pending multi-way
> > > > > deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the
> > > question is
> > > > > whether we will be able to detect them reliably. I think that we may
> > > use
> > > > > the following assumption:
> > > > > 1) If data node fails, relevant transactions will be rolled-back
> > > > > 2) It means that some other transactions will make at least partial
> > > > > progress with locks acquisition
> > > > >
> > > > > So, we can introduce a kind of counter which will be advanced on every
> > > > > acquired lock on a given node. And define a rule that transaction
> > > deadlock
> > > > > request for the given pair (NODE_ID, COUNTER) should be requested at
> > > most
> > > > > once. This way, even if specific deadlock check request is lost due to
> > > node
> > > > > failure, and another deadlock has formed, then some other node will
> > > > > re-trigger deadlock change request sooner or later, as it's counter
> > > > > advanced.
> > > > >
> > > > > Makes sense?
> > > > >
> > > > > On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi Vladimir,
> > > > > >
> > > > > > Regarding fault tolerance. It seems that it is not a problem for
> > > > > > edge-chasing approaches. A found deadlock is identified by a message
> > > > > > returned to a detection initiator with initiator's identifier. If
> > > > > > there is no deadlock then such message will not come. If some node
> > > > > > containing a deadlocked transaction fails then it will break the
> > > > > > deadlock. Am I missing something?
> > > > > >
> > > > > > About messaging overhead. Indeed, it looks like edge-chasing
> > > > > > approaches can bring redundant messages. Perhaps, we can borrow some
> > > > > > ideas about optimization from [1]. And I also think about a timeout
> > > > > > before starting a deadlock detection.
> > > > > >
> > > > > > Thoughts about adaptive timeouts lead me to some thoughts about
> > > > > > monitoring. I assume that long waiting for locks signals about not
> > > > > > optimal performance of the system. I think it would be great to have
> > > > > > means of monitoring transactions contention. It could help users to
> > > > > > improve theirs workloads. It also could help us to build some
> > > > > > reasoning how contention correlates with deadlock detection overhead.
> > > > > >
> > > > > > [1] http://mentallandscape.com/Papers_podc84.pdf
> > > > > > вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <[hidden email]
> > > >:
> > > > > > >
> > > > > > > Hi Ivan,
> > > > > > >
> > > > > > > Great analysis. Agree that edge-chasing looks like better
> > > candidate.
> > > > > > First,
> > > > > > > it will be applicable to both normal and MVCC transactions.
> > > Second, in
> > > > > > MVCC
> > > > > > > we probably will also need to release some locks when doing
> > > rollbacks.
> > > > > > What
> > > > > > > we should think about is failover - what if a node which was in the
> > > > > > middle
> > > > > > > of a graph fails? We need to craft fault-tolerance carefully.
> > > > > > >
> > > > > > > Another critically important point is how to trigger deadlock
> > > detection.
> > > > > > My
> > > > > > > concerns about edge-chasing was not about latency, but about a
> > > number of
> > > > > > > messages which travels between nodes. Good algorithm must produce
> > > little
> > > > > > to
> > > > > > > no messages in case of normal contenion while still providing
> > > protection
> > > > > > in
> > > > > > > case of real deadlocks. So how would we trigger fraph traversal
> > > for the
> > > > > > > given transaction? May be we can start with hard timeout and then
> > > employ
> > > > > > a
> > > > > > > kind of adaptive increment in case high rate of false-positives are
> > > > > > > observed. May be something else.
> > > > > > >
> > > > > > > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <
> > > [hidden email]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Vladimir,
> > > > > > > >
> > > > > > > > Thanks for the articles! I studied them and a couple of others.
> > > And I
> > > > > > > > would like to share a knowledge I found.
> > > > > > > >
> > > > > > > > BACKGROUND
> > > > > > > > First of all our algorithm implemented in
> > > > > > > >
> > > > > > > >
> > > > > >
> > > org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > > > > > > > is not an edge-chasing algorithm. In essence a lock-waiting
> > > > > > > > transaction site polls nodes responsible for keys of interest
> > > one by
> > > > > > > > one and reconstructs global wait-for graph (WFG) locally.
> > > > > > > > A centralized algorithm discussed in this thread looks similar to
> > > > > > > > algorithms proposed by Ho [1]. The simplest of them (two-phased)
> > > > > > > > reports false deadlocks when unlock before transaction finish is
> > > > > > > > permitted. So, it seems that it only works when strict two-phase
> > > > > > > > locking (2PL) is obeyed. Another one (called one-phased) requires
> > > > > > > > tracking all locked keys by each transaction which is not
> > > desirable
> > > > > > > > for MVCC transactions.
> > > > > > > > Aforementioned edge-chasing algorithm by Chandy, Misra and Haas
> > > [2] is
> > > > > > > > proven to work even when 2PL is not obeyed.
> > > > > > > > Also performance target is not clear for me. It looks like
> > > centralized
> > > > > > > > approaches can use fewer messages and provide lower latency that
> > > > > > > > distributed ones. But would like to understand what are the
> > > orders of
> > > > > > > > latency and messaging overhead. Many of algorithms are described
> > > in
> > > > > > > > [3] and some performance details are mentioned. It is said "As
> > > per
> > > > > > > > [GRAY81], most deadlocks involve two to four transactions." I
> > > see one
> > > > > > > > more argument making deadlock detection algorithm latency not so
> > > > > > > > important. We can consider deadlock as unavoidable situation,
> > > but even
> > > > > > > > lightning fast detector will not save a database from performance
> > > > > > > > degradation in case of tons of deadlocks.
> > > > > > > > What is also interesting, Google Cloud Spanner employs simple
> > > > > > > > "wound-wait" deadlock prevention [4] instead of any detection
> > > > > > > > algorithm.
> > > > > > > >
> > > > > > > > DISCUSSION
> > > > > > > > So, I see the following options:
> > > > > > > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > > > > > > > 2. Centralized two-phased algorithm described by Ho [1] with
> > > > > > > > restriction that transactions must obey 2PL.
> > > > > > > > Personally, I tend to edge-chasing approach because it looks
> > > simple
> > > > > > > > and universal. Restricting ourselves with 2PL will restrict us in
> > > > > > > > features which could be implemented in future (e.g. transaction
> > > > > > > > savepoints), will not it?
> > > > > > > >
> > > > > > > > WDYT?
> > > > > > > >
> > > > > > > > Ivan
> > > > > > > >
> > > > > > > > [1]
> > > https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > > > > > > > [2]
> > > > > > > >
> > > > > >
> > > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > > > [3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > > > > > > > [4]
> > > > > > > >
> > > > > >
> > > https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> > > > > > > >
> > > > > > > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <
> > > [hidden email]>:
> > > > > > > > >
> > > > > > > > > Ivan,
> > > > > > > > >
> > > > > > > > > This is interesting question. I think we should spend some
> > > time for
> > > > > > > > formal
> > > > > > > > > verification whether this algorithm works or not. Several
> > > articles
> > > > > > you
> > > > > > > > may
> > > > > > > > > use as a startitng point: [1], [2]. From what I understand,
> > > Ignite
> > > > > > fall
> > > > > > > > > into "AND" model, and currently implemented algorithm is a
> > > variation
> > > > > > of
> > > > > > > > > "edge-chasing" approach as per Chandy, Misra and Haas [3],
> > > which is
> > > > > > > > *proven
> > > > > > > > > to be correct* in that it both detects deadlocks when they are
> > > > > > present,
> > > > > > > > and
> > > > > > > > > do not produce false positives. But is might be too heavy for
> > > the
> > > > > > system
> > > > > > > > > under contention.
> > > > > > > > >
> > > > > > > > > We need to search for any formal proof of correctness of
> > > proposed
> > > > > > > > > algorithm. This area is already researched throughly enough,
> > > so we
> > > > > > should
> > > > > > > > > be able to find an answer quickly.
> > > > > > > > >
> > > > > > > > > Vladimir.
> > > > > > > > >
> > > > > > > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > > > > > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > > > >
> > > > > > > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <
> > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > Next part as promised. A working item for me is a deadlock
> > > detector
> > > > > > > > > > for MVCC transactions [1]. The message is structured in 2
> > > parts.
> > > > > > First
> > > > > > > > > > is an analysis of the current state of affairs and possible
> > > > > > options to
> > > > > > > > > > go. Second is a proposed option. First part is going to be
> > > not so
> > > > > > > > > > short so some might prefer to skip it.
> > > > > > > > > >
> > > > > > > > > > ANALYSIS
> > > > > > > > > > The immediate question is "why we cannot use an existing
> > > deadlock
> > > > > > > > > > detector?". The differences between classic and MVCC
> > > transactions
> > > > > > > > > > implementation is the answer. Currently a collection of
> > > > > > IgniteTxEntry
> > > > > > > > > > is used for detection. But such collection is not maintained
> > > for
> > > > > > MVCC
> > > > > > > > > > transactions. So, it will not work out of box.
> > > > > > > > > > Also it looks like that current distributed iterative
> > > approach
> > > > > > cannot
> > > > > > > > > > be low latency it the worst case because of doing possibly
> > > many
> > > > > > > > > > network requests sequentially.
> > > > > > > > > > So, what options do we have? Generally we should choose
> > > between
> > > > > > > > > > centralized and distributed approaches. By centralized
> > > approach I
> > > > > > mean
> > > > > > > > > > existence of a dedicated deadlock detector located on a
> > > single
> > > > > > node.
> > > > > > > > > > In the centralized approach we can face difficulties related
> > > to
> > > > > > > > > > failover as a node running deadlock detector can fail. In the
> > > > > > > > > > distributed approach extra network messaging overhead can
> > > strike
> > > > > > > > > > because different nodes participating in a deadlock can start
> > > > > > > > > > detection independently and send redundant messages. I see
> > > some
> > > > > > > > > > aspects which make sense for choosing implementation. Here
> > > they are
> > > > > > > > > > with an approach that is better (roughly speaking) in
> > > parentheses:
> > > > > > > > > > * Detection latency (centralized).
> > > > > > > > > > * Messaging overhead (centralized).
> > > > > > > > > > * Failover (distributed).
> > > > > > > > > > And also having a park of deadlock detectors sounds not very
> > > good.
> > > > > > I
> > > > > > > > > > hope that it is possible to develop a common solution
> > > suitable for
> > > > > > > > > > both kinds of transactions. I suggest to pilot new solution
> > > with
> > > > > > MVCC
> > > > > > > > > > and then adopt it for classic transactions.
> > > > > > > > > >
> > > > > > > > > > PROPOSAL
> > > > > > > > > > Actually I propose to start with an centralized algorithm
> > > > > > described by
> > > > > > > > > > Vladimir in the beginning of the thread. I will try to
> > > outline main
> > > > > > > > > > points of it.
> > > > > > > > > > 1. Single deadlock detector exists in the cluster which
> > > maintains
> > > > > > > > > > transaction wait-for graph (WFG).
> > > > > > > > > > 2. Each cluster node sends and invalidates wait-for edges to
> > > the
> > > > > > > > detector.
> > > > > > > > > > 3. The detector periodically searches cycles in WFG and
> > > chooses and
> > > > > > > > > > aborts a victim transaction if cycle is found.
> > > > > > > > > >
> > > > > > > > > > Currently I have one fundamental question. Is there a
> > > possibility
> > > > > > of
> > > > > > > > > > false detected deadlocks because of concurrent WFG updates?
> > > > > > > > > > Of course there are many points of improvements and
> > > optimizations.
> > > > > > But
> > > > > > > > > > I would like to start from discussing key points.
> > > > > > > > > >
> > > > > > > > > > Please share your thoughts!
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > > > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <
> > > [hidden email]>:
> > > > > > > > > > >
> > > > > > > > > > > Hi Igniters,
> > > > > > > > > > >
> > > > > > > > > > > I would like to resume the discussion about a deadlock
> > > detector.
> > > > > > I
> > > > > > > > start
> > > > > > > > > > > with a motivation for a further work on a subject. As I see
> > > > > > current
> > > > > > > > > > > implementation (entry point IgniteTxManager.detectDeadlock)
> > > > > > starts a
> > > > > > > > > > > detection only after a transaction was timed out. In my
> > > mind it
> > > > > > is
> > > > > > > > not
> > > > > > > > > > > very good from a product usability standpoint. As you
> > > know, in a
> > > > > > > > > > > situation of deadlock some keys become non-usable for an
> > > infinite
> > > > > > > > amount
> > > > > > > > > > > of time. Currently the only way to work around it is
> > > configuring
> > > > > > a
> > > > > > > > > > > timeout, but it could be rather tricky in practice to
> > > choose a
> > > > > > > > > > > proper/universal value for it. So, I see the main point as:
> > > > > > > > > > >
> > > > > > > > > > > Ability to break deadlocks without a need to configure
> > > timeouts
> > > > > > > > > > explicitly.
> > > > > > > > > > >
> > > > > > > > > > > I will return soon with some thoughts about implementation.
> > > > > > > > Meanwhile,
> > > > > > > > > > > does anybody have in mind any other usability points which
> > > I am
> > > > > > > > missing?
> > > > > > > > > > > Or is there any alternative approaches?
> > > > > > > > > > >
> > > > > > > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <[hidden email]
> > > >
> > > > > > wrote:
> > > > > > > > > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov <
> > > > > > > > [hidden email]
> > > > > > > > > > >>
> > > > > > > > > > > > wrote:>
> > > > > > > > > > > >
> > > > > > > > > > > > > It doesn’t need all txes. Instead, other nodes will
> > > send
> > > > > > info
> > > > > > > > about>
> > > > > > > > > > > > > suspicious txes to it from time to time.>
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > > > I see your point, I think it might work.>
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Ivan Pavlukhin
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Ivan Pavlukhin
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Ivan Pavlukhin
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin

--
Best regards,
Ivan Pavlukhin

gvvinblade

Re: Suggestion to improve deadlock detection

Ivan,

We have to design configuration carefully.
There are possible issues with compatibility and this API is public, it
might be difficult to redesign it after while.

Since there is a ticket about MVCC related configuration parts [1] (TxLog
memory region/size, vacuum thread count and intervals, etc)
I suggest solving the issue in scope of that ticket.

At now, let's introduce a couple of system variables for an ability to
change the behavior while development phase.

[1] https://issues.apache.org/jira/browse/IGNITE-7966

Regards,
Igor

чт, 20 дек. 2018 г. в 16:29, Павлухин Иван <[hidden email]>:

> Hi,
>
> I prepared a patch with deadlock detection algorithm implementation
> (you can find it from ticket [1]).
>
> Now I would like to discuss options for configuring deadlock
> detection. From a usability standpoint following need to be supported:
> 1. Turn deadlock detection on/off (on by default).
> 2. Configure a delay between starting waiting for a lock and sending
> first deadlock detection message (e.g. 1 second by default).
> 3. Something else?
>
> I can suggest following options:
> 1. Use a single long for configuration. Values < 0 means disabled
> deadlock detection. 0 means starting detection without a delay. Values
> > 0 specifies a delay in milliseconds before starting detection.
> 2. Use a boolean flag for turning detection on/off. Use a long value
> for configuring delay.
>
> Also there are number of ways to pass this values to Ignite. I do no
> have much experience in adding new configuration options to Ignite and
> I need an advice here. For example, it could be done by adding
> additional properties to TransactionConfiguration class. Another way
> is using java system properties.
>
> Igor, Vladimir and others what do you think?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9322
> ср, 19 дек. 2018 г. в 15:51, Павлухин Иван <[hidden email]>:
> >
> > Igor,
> >
> > I see your points. And I agree to start with "lightweight
> > implementation". Today we have 2PL and there is no activity on
> > implementing rollback to savepoint. And if we do it in the future we
> > will have to return to the subject of deadlock detection anyway.
> >
> > I will proceed with "forward-only" approach.
> > вт, 18 дек. 2018 г. в 18:33, Seliverstov Igor <[hidden email]>:
> > >
> > > Ivan,
> > >
> > > I would prefer forward-only implementation even knowing it allows false
> > > positive check results.
> > >
> > > Why I think so:
> > >
> > > 1) From my experience, when we have any future is waiting for reply, we
> > > have to take a failover into consideration.
> > > Usually failover implementations are way more complex than an initial
> > > algorithm itself.
> > > Forward-only version doesn't need any failover implementation as it
> expects
> > > failed probes (the probes didn't face a deadlock ).
> > >
> > > 2) Simple lightweight feature implementation is a really good point to
> > > start - it may be extended if needed but really often such extending
> > > doesn't need at all.
> > >
> > > 3) Any implementation allow false positive result - we are working with
> > > distributed system, there are a bunch of processes happens at the same
> time
> > > and,
> > > for example, concurrently happened rollback on timeout may solve a
> deadlock
> > > but probe is finished at this time, so- there is a rollback on deadlock
> > > also as a result.
> > >
> > > 4) The described case (when false positive result is probable) isn't
> usual
> > > but, potentially, extremely rare, I don't think we should solve it
> since it
> > > doesn't break the grid.
> > >
> > > Regards,
> > > Igor
> > >
> > > вт, 18 дек. 2018 г. в 17:57, Павлухин Иван <[hidden email]>:
> > >
> > > > Hi folks,
> > > >
> > > > During implementation of edge-chasing deadlock detection algorithm in
> > > > scope of [1] it has been realized that basically we have 2 options
> for
> > > > "chasing" strategy. I will use terms Near when GridNearTxLocal is
> > > > assumed and Dht when GridDhtTxLocal (tx which updates primary
> > > > partition). So, here is 2 mentioned strategies:
> > > > 1. Send initial probe when Dht tx blocks waiting for another tx to
> > > > release a key. Initial probe is sent from waiting Dht tx to Near tx
> > > > holding a lock. If receiving Near tx is blocked as well then it
> relays
> > > > the probe to Dht tx it awaits response from. So, the probe is
> > > > traveling like Dht -> Near -> Dht -> Near -> ... Let's call the
> > > > approach "forward-only".
> > > > 2. Send probes only between Near transactions. This approach requires
> > > > additional request-response call which Near tx issues to Dht node to
> > > > check if tx sending a request is waiting for another tx. The call
> > > > returns identifier of a Near tx blocking tx sending a request (if
> > > > sending tx is in fact blocked). Then waiting Near tx relays a probe
> to
> > > > blocked Near tx. Let's call that approach "lock-checking".
> > > >
> > > > I would like to define several points to consider while comparing
> > > > approaches (please point out more if I miss something):
> > > > 1. Correctness. Especially regarding reporting false deadlocks.
> > > > 2. Extensibility. Rollback to savepoint and generalization for
> classic
> > > > transactions should be considered.
> > > > 3. Messaging overhead.
> > > > 4. Simplicity of implementation.
> > > >
> > > > You can find an implementation of "lock-checking" approach in PR
> > > > attached to the ticket [1]. Currently it works correctly only for
> > > > transactions obeying strict two-phase locking (2PL). It is fairly
> easy
> > > > to adopt PR to implement "forward-only" approach. Which will work
> > > > correct in conditions of 2PL. "lock-checking" algorithm uses 1.5
> times
> > > > more messages than "forward-only". Implementation of "forward-only"
> > > > algorithm is simpler in a sense that it does not require introducing
> > > > lock-wait-check messages and future. But as for me it does not look
> as
> > > > big deal. I think that implementation efforts for both approaches are
> > > > almost equal so far.
> > > >
> > > > But corner stone here is an extensibility. Let's imagine that we are
> > > > going to implement rollback to savepoint. One suggested approach to
> > > > extend "forward-only" approach for that case is invalidating sent
> > > > probe. Basically, a tx initiated deadlock check assigns unique id to
> a
> > > > probe. And this probe id is invalidated when the tx wakes up from
> > > > wait. If that tx receives back a probe with invalidated id it simply
> > > > discards it. If id is not invalidated it means a deadlock. But false
> > > > deadlock still can be detected. I will provide an example when it
> does
> > > > not work correctly.
> > > >
> > > > Please note, that our transactions can request multiple locks
> > > > simultaneously (so-called AND model). Also rollback to savepoint
> > > > releases locks obtained after creating a savepoint. Let's use
> > > > following notation. "*" means savepoint. "^" means rollback to
> > > > previous savepoint. "!" means acquiring a lock. "?" means waiting for
> > > > a lock. "&" means requesting multiple locks simultaneously. See the
> > > > following transaction execution scenario for 3 transactions and 3
> > > > keys:
> > > > TA | TB | TC
> > > > | | *
> > > > | 1! | 2!
> > > > 1? & 3! | |
> > > > | 2? |
> > > > | | ^
> > > > | | 3!
> > > >
> > > > We can notice that suggested scenario does not introduce deadlock
> > > > because locks are requested in consistent order among all
> > > > transactions. But in different moments there exist waiting relations
> > > > TA w TB, TB w TC, TC w TA. So, we suspect a false deadlock could be
> > > > detected here. Also one can notice that a relation TC w TA is
> > > > established after TB w TC is destroyed. Messages on a timeline
> diagram
> > > > demonstrates how "forward-only with probe invalidation" approach can
> > > > detect a false deadlock here [2]. On the picture L means "lock", U
> > > > "unlock", W "wait". Red cross means reporting a false deadlock.
> > > > Intuition here is that while probe is in flight one wait-for edge can
> > > > be destroyed (TB w TC) and another one can be created (TC w TA). So,
> > > > the approach can report false deadlocks. Also, note that a false
> > > > deadlock can be found even when locking order is consistent!
> > > >
> > > > After a while I will describe how "lock-checking" can be adopted to
> > > > support rollback to savepoint. Disclaimer here is that it will
> require
> > > > even more messages.
> > > > ----
> > > >
> > > > I think that we can already start a discussion.
> > > > Actually, while false deadlocks can be surprising we perhaps can
> > > > tolerate it because our transaction implementation assumes that
> > > > transactions can be rolled back by system. "forward-only" approach
> > > > requires lesser messages but in big-O terms both approaches have the
> > > > same complexity. Both correctness and complexity impact unfortunately
> > > > have hardly predictable impact for real workflows. In ideal world
> > > > thorough testing could give the answers.
> > > >
> > > > Please, share your vision.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > [2]
> > > >
> https://gist.githubusercontent.com/pavlukhin/c8c7c6266eeab56048c31f5cdfb31d20/raw/7d1aef9abcd014ec9fdf69840168768ced638b6c/msg_diagram.jpg
> > > > пн, 26 нояб. 2018 г. в 15:27, Павлухин Иван <[hidden email]>:
> > > > >
> > > > > Vladimir,
> > > > >
> > > > > I think it might work. So, if nobody minds I can start prototyping
> > > > > edge-chasing approach.
> > > > > пн, 26 нояб. 2018 г. в 14:32, Vladimir Ozerov <
> [hidden email]>:
> > > > > >
> > > > > > Ivan,
> > > > > >
> > > > > > The problem is that in our system a transaction may wait for N
> locks
> > > > > > simultaneously. This may form complex graphs which spread
> between many
> > > > > > nodes. Now consider that I have a deadlock between 4 nodes: A ->
> B ->
> > > > *C*
> > > > > > -> D -> A. I've sent a message from a and never reached D
> because C
> > > > failed.
> > > > > > Well, may be that is good, because some transactions will be
> rolled
> > > > back.
> > > > > > But when they are rolled back, another cycles from pending
> multi-way
> > > > > > deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the
> > > > question is
> > > > > > whether we will be able to detect them reliably. I think that we
> may
> > > > use
> > > > > > the following assumption:
> > > > > > 1) If data node fails, relevant transactions will be rolled-back
> > > > > > 2) It means that some other transactions will make at least
> partial
> > > > > > progress with locks acquisition
> > > > > >
> > > > > > So, we can introduce a kind of counter which will be advanced on
> every
> > > > > > acquired lock on a given node. And define a rule that transaction
> > > > deadlock
> > > > > > request for the given pair (NODE_ID, COUNTER) should be
> requested at
> > > > most
> > > > > > once. This way, even if specific deadlock check request is lost
> due to
> > > > node
> > > > > > failure, and another deadlock has formed, then some other node
> will
> > > > > > re-trigger deadlock change request sooner or later, as it's
> counter
> > > > > > advanced.
> > > > > >
> > > > > > Makes sense?
> > > > > >
> > > > > > On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <
> [hidden email]>
> > > > wrote:
> > > > > >
> > > > > > > Hi Vladimir,
> > > > > > >
> > > > > > > Regarding fault tolerance. It seems that it is not a problem
> for
> > > > > > > edge-chasing approaches. A found deadlock is identified by a
> message
> > > > > > > returned to a detection initiator with initiator's identifier.
> If
> > > > > > > there is no deadlock then such message will not come. If some
> node
> > > > > > > containing a deadlocked transaction fails then it will break
> the
> > > > > > > deadlock. Am I missing something?
> > > > > > >
> > > > > > > About messaging overhead. Indeed, it looks like edge-chasing
> > > > > > > approaches can bring redundant messages. Perhaps, we can
> borrow some
> > > > > > > ideas about optimization from [1]. And I also think about a
> timeout
> > > > > > > before starting a deadlock detection.
> > > > > > >
> > > > > > > Thoughts about adaptive timeouts lead me to some thoughts about
> > > > > > > monitoring. I assume that long waiting for locks signals about
> not
> > > > > > > optimal performance of the system. I think it would be great
> to have
> > > > > > > means of monitoring transactions contention. It could help
> users to
> > > > > > > improve theirs workloads. It also could help us to build some
> > > > > > > reasoning how contention correlates with deadlock detection
> overhead.
> > > > > > >
> > > > > > > [1] http://mentallandscape.com/Papers_podc84.pdf
> > > > > > > вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <
> [hidden email]
> > > > >:
> > > > > > > >
> > > > > > > > Hi Ivan,
> > > > > > > >
> > > > > > > > Great analysis. Agree that edge-chasing looks like better
> > > > candidate.
> > > > > > > First,
> > > > > > > > it will be applicable to both normal and MVCC transactions.
> > > > Second, in
> > > > > > > MVCC
> > > > > > > > we probably will also need to release some locks when doing
> > > > rollbacks.
> > > > > > > What
> > > > > > > > we should think about is failover - what if a node which was
> in the
> > > > > > > middle
> > > > > > > > of a graph fails? We need to craft fault-tolerance carefully.
> > > > > > > >
> > > > > > > > Another critically important point is how to trigger deadlock
> > > > detection.
> > > > > > > My
> > > > > > > > concerns about edge-chasing was not about latency, but about
> a
> > > > number of
> > > > > > > > messages which travels between nodes. Good algorithm must
> produce
> > > > little
> > > > > > > to
> > > > > > > > no messages in case of normal contenion while still providing
> > > > protection
> > > > > > > in
> > > > > > > > case of real deadlocks. So how would we trigger fraph
> traversal
> > > > for the
> > > > > > > > given transaction? May be we can start with hard timeout and
> then
> > > > employ
> > > > > > > a
> > > > > > > > kind of adaptive increment in case high rate of
> false-positives are
> > > > > > > > observed. May be something else.
> > > > > > > >
> > > > > > > > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <
> > > > [hidden email]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Vladimir,
> > > > > > > > >
> > > > > > > > > Thanks for the articles! I studied them and a couple of
> others.
> > > > And I
> > > > > > > > > would like to share a knowledge I found.
> > > > > > > > >
> > > > > > > > > BACKGROUND
> > > > > > > > > First of all our algorithm implemented in
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > >
> org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > > > > > > > > is not an edge-chasing algorithm. In essence a lock-waiting
> > > > > > > > > transaction site polls nodes responsible for keys of
> interest
> > > > one by
> > > > > > > > > one and reconstructs global wait-for graph (WFG) locally.
> > > > > > > > > A centralized algorithm discussed in this thread looks
> similar to
> > > > > > > > > algorithms proposed by Ho [1]. The simplest of them
> (two-phased)
> > > > > > > > > reports false deadlocks when unlock before transaction
> finish is
> > > > > > > > > permitted. So, it seems that it only works when strict
> two-phase
> > > > > > > > > locking (2PL) is obeyed. Another one (called one-phased)
> requires
> > > > > > > > > tracking all locked keys by each transaction which is not
> > > > desirable
> > > > > > > > > for MVCC transactions.
> > > > > > > > > Aforementioned edge-chasing algorithm by Chandy, Misra and
> Haas
> > > > [2] is
> > > > > > > > > proven to work even when 2PL is not obeyed.
> > > > > > > > > Also performance target is not clear for me. It looks like
> > > > centralized
> > > > > > > > > approaches can use fewer messages and provide lower
> latency that
> > > > > > > > > distributed ones. But would like to understand what are the
> > > > orders of
> > > > > > > > > latency and messaging overhead. Many of algorithms are
> described
> > > > in
> > > > > > > > > [3] and some performance details are mentioned. It is said
> "As
> > > > per
> > > > > > > > > [GRAY81], most deadlocks involve two to four
> transactions." I
> > > > see one
> > > > > > > > > more argument making deadlock detection algorithm latency
> not so
> > > > > > > > > important. We can consider deadlock as unavoidable
> situation,
> > > > but even
> > > > > > > > > lightning fast detector will not save a database from
> performance
> > > > > > > > > degradation in case of tons of deadlocks.
> > > > > > > > > What is also interesting, Google Cloud Spanner employs
> simple
> > > > > > > > > "wound-wait" deadlock prevention [4] instead of any
> detection
> > > > > > > > > algorithm.
> > > > > > > > >
> > > > > > > > > DISCUSSION
> > > > > > > > > So, I see the following options:
> > > > > > > > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > > > > > > > > 2. Centralized two-phased algorithm described by Ho [1]
> with
> > > > > > > > > restriction that transactions must obey 2PL.
> > > > > > > > > Personally, I tend to edge-chasing approach because it
> looks
> > > > simple
> > > > > > > > > and universal. Restricting ourselves with 2PL will
> restrict us in
> > > > > > > > > features which could be implemented in future (e.g.
> transaction
> > > > > > > > > savepoints), will not it?
> > > > > > > > >
> > > > > > > > > WDYT?
> > > > > > > > >
> > > > > > > > > Ivan
> > > > > > > > >
> > > > > > > > > [1]
> > > > https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > >
> > > >
> https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > > > > [3]
> https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > > > > > > > > [4]
> > > > > > > > >
> > > > > > >
> > > >
> https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> > > > > > > > >
> > > > > > > > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <
> > > > [hidden email]>:
> > > > > > > > > >
> > > > > > > > > > Ivan,
> > > > > > > > > >
> > > > > > > > > > This is interesting question. I think we should spend
> some
> > > > time for
> > > > > > > > > formal
> > > > > > > > > > verification whether this algorithm works or not. Several
> > > > articles
> > > > > > > you
> > > > > > > > > may
> > > > > > > > > > use as a startitng point: [1], [2]. From what I
> understand,
> > > > Ignite
> > > > > > > fall
> > > > > > > > > > into "AND" model, and currently implemented algorithm is
> a
> > > > variation
> > > > > > > of
> > > > > > > > > > "edge-chasing" approach as per Chandy, Misra and Haas
> [3],
> > > > which is
> > > > > > > > > *proven
> > > > > > > > > > to be correct* in that it both detects deadlocks when
> they are
> > > > > > > present,
> > > > > > > > > and
> > > > > > > > > > do not produce false positives. But is might be too
> heavy for
> > > > the
> > > > > > > system
> > > > > > > > > > under contention.
> > > > > > > > > >
> > > > > > > > > > We need to search for any formal proof of correctness of
> > > > proposed
> > > > > > > > > > algorithm. This area is already researched throughly
> enough,
> > > > so we
> > > > > > > should
> > > > > > > > > > be able to find an answer quickly.
> > > > > > > > > >
> > > > > > > > > > Vladimir.
> > > > > > > > > >
> > > > > > > > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > > > > > > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > > > > > > > > [3]
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > >
> https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <
> > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > Next part as promised. A working item for me is a
> deadlock
> > > > detector
> > > > > > > > > > > for MVCC transactions [1]. The message is structured
> in 2
> > > > parts.
> > > > > > > First
> > > > > > > > > > > is an analysis of the current state of affairs and
> possible
> > > > > > > options to
> > > > > > > > > > > go. Second is a proposed option. First part is going
> to be
> > > > not so
> > > > > > > > > > > short so some might prefer to skip it.
> > > > > > > > > > >
> > > > > > > > > > > ANALYSIS
> > > > > > > > > > > The immediate question is "why we cannot use an
> existing
> > > > deadlock
> > > > > > > > > > > detector?". The differences between classic and MVCC
> > > > transactions
> > > > > > > > > > > implementation is the answer. Currently a collection of
> > > > > > > IgniteTxEntry
> > > > > > > > > > > is used for detection. But such collection is not
> maintained
> > > > for
> > > > > > > MVCC
> > > > > > > > > > > transactions. So, it will not work out of box.
> > > > > > > > > > > Also it looks like that current distributed iterative
> > > > approach
> > > > > > > cannot
> > > > > > > > > > > be low latency it the worst case because of doing
> possibly
> > > > many
> > > > > > > > > > > network requests sequentially.
> > > > > > > > > > > So, what options do we have? Generally we should choose
> > > > between
> > > > > > > > > > > centralized and distributed approaches. By centralized
> > > > approach I
> > > > > > > mean
> > > > > > > > > > > existence of a dedicated deadlock detector located on a
> > > > single
> > > > > > > node.
> > > > > > > > > > > In the centralized approach we can face difficulties
> related
> > > > to
> > > > > > > > > > > failover as a node running deadlock detector can fail.
> In the
> > > > > > > > > > > distributed approach extra network messaging overhead
> can
> > > > strike
> > > > > > > > > > > because different nodes participating in a deadlock
> can start
> > > > > > > > > > > detection independently and send redundant messages. I
> see
> > > > some
> > > > > > > > > > > aspects which make sense for choosing implementation.
> Here
> > > > they are
> > > > > > > > > > > with an approach that is better (roughly speaking) in
> > > > parentheses:
> > > > > > > > > > > * Detection latency (centralized).
> > > > > > > > > > > * Messaging overhead (centralized).
> > > > > > > > > > > * Failover (distributed).
> > > > > > > > > > > And also having a park of deadlock detectors sounds
> not very
> > > > good.
> > > > > > > I
> > > > > > > > > > > hope that it is possible to develop a common solution
> > > > suitable for
> > > > > > > > > > > both kinds of transactions. I suggest to pilot new
> solution
> > > > with
> > > > > > > MVCC
> > > > > > > > > > > and then adopt it for classic transactions.
> > > > > > > > > > >
> > > > > > > > > > > PROPOSAL
> > > > > > > > > > > Actually I propose to start with an centralized
> algorithm
> > > > > > > described by
> > > > > > > > > > > Vladimir in the beginning of the thread. I will try to
> > > > outline main
> > > > > > > > > > > points of it.
> > > > > > > > > > > 1. Single deadlock detector exists in the cluster which
> > > > maintains
> > > > > > > > > > > transaction wait-for graph (WFG).
> > > > > > > > > > > 2. Each cluster node sends and invalidates wait-for
> edges to
> > > > the
> > > > > > > > > detector.
> > > > > > > > > > > 3. The detector periodically searches cycles in WFG and
> > > > chooses and
> > > > > > > > > > > aborts a victim transaction if cycle is found.
> > > > > > > > > > >
> > > > > > > > > > > Currently I have one fundamental question. Is there a
> > > > possibility
> > > > > > > of
> > > > > > > > > > > false detected deadlocks because of concurrent WFG
> updates?
> > > > > > > > > > > Of course there are many points of improvements and
> > > > optimizations.
> > > > > > > But
> > > > > > > > > > > I would like to start from discussing key points.
> > > > > > > > > > >
> > > > > > > > > > > Please share your thoughts!
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > > > > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <
> > > > [hidden email]>:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Igniters,
> > > > > > > > > > > >
> > > > > > > > > > > > I would like to resume the discussion about a
> deadlock
> > > > detector.
> > > > > > > I
> > > > > > > > > start
> > > > > > > > > > > > with a motivation for a further work on a subject.
> As I see
> > > > > > > current
> > > > > > > > > > > > implementation (entry point
> IgniteTxManager.detectDeadlock)
> > > > > > > starts a
> > > > > > > > > > > > detection only after a transaction was timed out. In
> my
> > > > mind it
> > > > > > > is
> > > > > > > > > not
> > > > > > > > > > > > very good from a product usability standpoint. As you
> > > > know, in a
> > > > > > > > > > > > situation of deadlock some keys become non-usable
> for an
> > > > infinite
> > > > > > > > > amount
> > > > > > > > > > > > of time. Currently the only way to work around it is
> > > > configuring
> > > > > > > a
> > > > > > > > > > > > timeout, but it could be rather tricky in practice to
> > > > choose a
> > > > > > > > > > > > proper/universal value for it. So, I see the main
> point as:
> > > > > > > > > > > >
> > > > > > > > > > > > Ability to break deadlocks without a need to
> configure
> > > > timeouts
> > > > > > > > > > > explicitly.
> > > > > > > > > > > >
> > > > > > > > > > > > I will return soon with some thoughts about
> implementation.
> > > > > > > > > Meanwhile,
> > > > > > > > > > > > does anybody have in mind any other usability points
> which
> > > > I am
> > > > > > > > > missing?
> > > > > > > > > > > > Or is there any alternative approaches?
> > > > > > > > > > > >
> > > > > > > > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <
> [hidden email]
> > > > >
> > > > > > > wrote:
> > > > > > > > > > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov
> <
> > > > > > > > > [hidden email]
> > > > > > > > > > > >>
> > > > > > > > > > > > > wrote:>
> > > > > > > > > > > > >
> > > > > > > > > > > > > > It doesn’t need all txes. Instead, other nodes
> will
> > > > send
> > > > > > > info
> > > > > > > > > about>
> > > > > > > > > > > > > > suspicious txes to it from time to time.>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > > > I see your point, I think it might work.>
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > > Ivan Pavlukhin
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Ivan Pavlukhin
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Ivan Pavlukhin
> > > >
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>

Ivan Pavlukhin

Re: Suggestion to improve deadlock detection

Igor, thanks for your feedback! Let's do according to your proposal.

пн, 24 дек. 2018 г. в 14:57, Seliverstov Igor <[hidden email]>:

>
> Ivan,
>
> We have to design configuration carefully.
> There are possible issues with compatibility and this API is public, it
> might be difficult to redesign it after while.
>
> Since there is a ticket about MVCC related configuration parts [1] (TxLog
> memory region/size, vacuum thread count and intervals, etc)
> I suggest solving the issue in scope of that ticket.
>
> At now, let's introduce a couple of system variables for an ability to
> change the behavior while development phase.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7966
>
> Regards,
> Igor
>
> чт, 20 дек. 2018 г. в 16:29, Павлухин Иван <[hidden email]>:
>
> > Hi,
> >
> > I prepared a patch with deadlock detection algorithm implementation
> > (you can find it from ticket [1]).
> >
> > Now I would like to discuss options for configuring deadlock
> > detection. From a usability standpoint following need to be supported:
> > 1. Turn deadlock detection on/off (on by default).
> > 2. Configure a delay between starting waiting for a lock and sending
> > first deadlock detection message (e.g. 1 second by default).
> > 3. Something else?
> >
> > I can suggest following options:
> > 1. Use a single long for configuration. Values < 0 means disabled
> > deadlock detection. 0 means starting detection without a delay. Values
> > > 0 specifies a delay in milliseconds before starting detection.
> > 2. Use a boolean flag for turning detection on/off. Use a long value
> > for configuring delay.
> >
> > Also there are number of ways to pass this values to Ignite. I do no
> > have much experience in adding new configuration options to Ignite and
> > I need an advice here. For example, it could be done by adding
> > additional properties to TransactionConfiguration class. Another way
> > is using java system properties.
> >
> > Igor, Vladimir and others what do you think?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > ср, 19 дек. 2018 г. в 15:51, Павлухин Иван <[hidden email]>:
> > >
> > > Igor,
> > >
> > > I see your points. And I agree to start with "lightweight
> > > implementation". Today we have 2PL and there is no activity on
> > > implementing rollback to savepoint. And if we do it in the future we
> > > will have to return to the subject of deadlock detection anyway.
> > >
> > > I will proceed with "forward-only" approach.
> > > вт, 18 дек. 2018 г. в 18:33, Seliverstov Igor <[hidden email]>:
> > > >
> > > > Ivan,
> > > >
> > > > I would prefer forward-only implementation even knowing it allows false
> > > > positive check results.
> > > >
> > > > Why I think so:
> > > >
> > > > 1) From my experience, when we have any future is waiting for reply, we
> > > > have to take a failover into consideration.
> > > > Usually failover implementations are way more complex than an initial
> > > > algorithm itself.
> > > > Forward-only version doesn't need any failover implementation as it
> > expects
> > > > failed probes (the probes didn't face a deadlock ).
> > > >
> > > > 2) Simple lightweight feature implementation is a really good point to
> > > > start - it may be extended if needed but really often such extending
> > > > doesn't need at all.
> > > >
> > > > 3) Any implementation allow false positive result - we are working with
> > > > distributed system, there are a bunch of processes happens at the same
> > time
> > > > and,
> > > > for example, concurrently happened rollback on timeout may solve a
> > deadlock
> > > > but probe is finished at this time, so- there is a rollback on deadlock
> > > > also as a result.
> > > >
> > > > 4) The described case (when false positive result is probable) isn't
> > usual
> > > > but, potentially, extremely rare, I don't think we should solve it
> > since it
> > > > doesn't break the grid.
> > > >
> > > > Regards,
> > > > Igor
> > > >
> > > > вт, 18 дек. 2018 г. в 17:57, Павлухин Иван <[hidden email]>:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > During implementation of edge-chasing deadlock detection algorithm in
> > > > > scope of [1] it has been realized that basically we have 2 options
> > for
> > > > > "chasing" strategy. I will use terms Near when GridNearTxLocal is
> > > > > assumed and Dht when GridDhtTxLocal (tx which updates primary
> > > > > partition). So, here is 2 mentioned strategies:
> > > > > 1. Send initial probe when Dht tx blocks waiting for another tx to
> > > > > release a key. Initial probe is sent from waiting Dht tx to Near tx
> > > > > holding a lock. If receiving Near tx is blocked as well then it
> > relays
> > > > > the probe to Dht tx it awaits response from. So, the probe is
> > > > > traveling like Dht -> Near -> Dht -> Near -> ... Let's call the
> > > > > approach "forward-only".
> > > > > 2. Send probes only between Near transactions. This approach requires
> > > > > additional request-response call which Near tx issues to Dht node to
> > > > > check if tx sending a request is waiting for another tx. The call
> > > > > returns identifier of a Near tx blocking tx sending a request (if
> > > > > sending tx is in fact blocked). Then waiting Near tx relays a probe
> > to
> > > > > blocked Near tx. Let's call that approach "lock-checking".
> > > > >
> > > > > I would like to define several points to consider while comparing
> > > > > approaches (please point out more if I miss something):
> > > > > 1. Correctness. Especially regarding reporting false deadlocks.
> > > > > 2. Extensibility. Rollback to savepoint and generalization for
> > classic
> > > > > transactions should be considered.
> > > > > 3. Messaging overhead.
> > > > > 4. Simplicity of implementation.
> > > > >
> > > > > You can find an implementation of "lock-checking" approach in PR
> > > > > attached to the ticket [1]. Currently it works correctly only for
> > > > > transactions obeying strict two-phase locking (2PL). It is fairly
> > easy
> > > > > to adopt PR to implement "forward-only" approach. Which will work
> > > > > correct in conditions of 2PL. "lock-checking" algorithm uses 1.5
> > times
> > > > > more messages than "forward-only". Implementation of "forward-only"
> > > > > algorithm is simpler in a sense that it does not require introducing
> > > > > lock-wait-check messages and future. But as for me it does not look
> > as
> > > > > big deal. I think that implementation efforts for both approaches are
> > > > > almost equal so far.
> > > > >
> > > > > But corner stone here is an extensibility. Let's imagine that we are
> > > > > going to implement rollback to savepoint. One suggested approach to
> > > > > extend "forward-only" approach for that case is invalidating sent
> > > > > probe. Basically, a tx initiated deadlock check assigns unique id to
> > a
> > > > > probe. And this probe id is invalidated when the tx wakes up from
> > > > > wait. If that tx receives back a probe with invalidated id it simply
> > > > > discards it. If id is not invalidated it means a deadlock. But false
> > > > > deadlock still can be detected. I will provide an example when it
> > does
> > > > > not work correctly.
> > > > >
> > > > > Please note, that our transactions can request multiple locks
> > > > > simultaneously (so-called AND model). Also rollback to savepoint
> > > > > releases locks obtained after creating a savepoint. Let's use
> > > > > following notation. "*" means savepoint. "^" means rollback to
> > > > > previous savepoint. "!" means acquiring a lock. "?" means waiting for
> > > > > a lock. "&" means requesting multiple locks simultaneously. See the
> > > > > following transaction execution scenario for 3 transactions and 3
> > > > > keys:
> > > > > TA | TB | TC
> > > > > | | *
> > > > > | 1! | 2!
> > > > > 1? & 3! | |
> > > > > | 2? |
> > > > > | | ^
> > > > > | | 3!
> > > > >
> > > > > We can notice that suggested scenario does not introduce deadlock
> > > > > because locks are requested in consistent order among all
> > > > > transactions. But in different moments there exist waiting relations
> > > > > TA w TB, TB w TC, TC w TA. So, we suspect a false deadlock could be
> > > > > detected here. Also one can notice that a relation TC w TA is
> > > > > established after TB w TC is destroyed. Messages on a timeline
> > diagram
> > > > > demonstrates how "forward-only with probe invalidation" approach can
> > > > > detect a false deadlock here [2]. On the picture L means "lock", U
> > > > > "unlock", W "wait". Red cross means reporting a false deadlock.
> > > > > Intuition here is that while probe is in flight one wait-for edge can
> > > > > be destroyed (TB w TC) and another one can be created (TC w TA). So,
> > > > > the approach can report false deadlocks. Also, note that a false
> > > > > deadlock can be found even when locking order is consistent!
> > > > >
> > > > > After a while I will describe how "lock-checking" can be adopted to
> > > > > support rollback to savepoint. Disclaimer here is that it will
> > require
> > > > > even more messages.
> > > > > ----
> > > > >
> > > > > I think that we can already start a discussion.
> > > > > Actually, while false deadlocks can be surprising we perhaps can
> > > > > tolerate it because our transaction implementation assumes that
> > > > > transactions can be rolled back by system. "forward-only" approach
> > > > > requires lesser messages but in big-O terms both approaches have the
> > > > > same complexity. Both correctness and complexity impact unfortunately
> > > > > have hardly predictable impact for real workflows. In ideal world
> > > > > thorough testing could give the answers.
> > > > >
> > > > > Please, share your vision.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > [2]
> > > > >
> > https://gist.githubusercontent.com/pavlukhin/c8c7c6266eeab56048c31f5cdfb31d20/raw/7d1aef9abcd014ec9fdf69840168768ced638b6c/msg_diagram.jpg
> > > > > пн, 26 нояб. 2018 г. в 15:27, Павлухин Иван <[hidden email]>:
> > > > > >
> > > > > > Vladimir,
> > > > > >
> > > > > > I think it might work. So, if nobody minds I can start prototyping
> > > > > > edge-chasing approach.
> > > > > > пн, 26 нояб. 2018 г. в 14:32, Vladimir Ozerov <
> > [hidden email]>:
> > > > > > >
> > > > > > > Ivan,
> > > > > > >
> > > > > > > The problem is that in our system a transaction may wait for N
> > locks
> > > > > > > simultaneously. This may form complex graphs which spread
> > between many
> > > > > > > nodes. Now consider that I have a deadlock between 4 nodes: A ->
> > B ->
> > > > > *C*
> > > > > > > -> D -> A. I've sent a message from a and never reached D
> > because C
> > > > > failed.
> > > > > > > Well, may be that is good, because some transactions will be
> > rolled
> > > > > back.
> > > > > > > But when they are rolled back, another cycles from pending
> > multi-way
> > > > > > > deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the
> > > > > question is
> > > > > > > whether we will be able to detect them reliably. I think that we
> > may
> > > > > use
> > > > > > > the following assumption:
> > > > > > > 1) If data node fails, relevant transactions will be rolled-back
> > > > > > > 2) It means that some other transactions will make at least
> > partial
> > > > > > > progress with locks acquisition
> > > > > > >
> > > > > > > So, we can introduce a kind of counter which will be advanced on
> > every
> > > > > > > acquired lock on a given node. And define a rule that transaction
> > > > > deadlock
> > > > > > > request for the given pair (NODE_ID, COUNTER) should be
> > requested at
> > > > > most
> > > > > > > once. This way, even if specific deadlock check request is lost
> > due to
> > > > > node
> > > > > > > failure, and another deadlock has formed, then some other node
> > will
> > > > > > > re-trigger deadlock change request sooner or later, as it's
> > counter
> > > > > > > advanced.
> > > > > > >
> > > > > > > Makes sense?
> > > > > > >
> > > > > > > On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <
> > [hidden email]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Vladimir,
> > > > > > > >
> > > > > > > > Regarding fault tolerance. It seems that it is not a problem
> > for
> > > > > > > > edge-chasing approaches. A found deadlock is identified by a
> > message
> > > > > > > > returned to a detection initiator with initiator's identifier.
> > If
> > > > > > > > there is no deadlock then such message will not come. If some
> > node
> > > > > > > > containing a deadlocked transaction fails then it will break
> > the
> > > > > > > > deadlock. Am I missing something?
> > > > > > > >
> > > > > > > > About messaging overhead. Indeed, it looks like edge-chasing
> > > > > > > > approaches can bring redundant messages. Perhaps, we can
> > borrow some
> > > > > > > > ideas about optimization from [1]. And I also think about a
> > timeout
> > > > > > > > before starting a deadlock detection.
> > > > > > > >
> > > > > > > > Thoughts about adaptive timeouts lead me to some thoughts about
> > > > > > > > monitoring. I assume that long waiting for locks signals about
> > not
> > > > > > > > optimal performance of the system. I think it would be great
> > to have
> > > > > > > > means of monitoring transactions contention. It could help
> > users to
> > > > > > > > improve theirs workloads. It also could help us to build some
> > > > > > > > reasoning how contention correlates with deadlock detection
> > overhead.
> > > > > > > >
> > > > > > > > [1] http://mentallandscape.com/Papers_podc84.pdf
> > > > > > > > вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <
> > [hidden email]
> > > > > >:
> > > > > > > > >
> > > > > > > > > Hi Ivan,
> > > > > > > > >
> > > > > > > > > Great analysis. Agree that edge-chasing looks like better
> > > > > candidate.
> > > > > > > > First,
> > > > > > > > > it will be applicable to both normal and MVCC transactions.
> > > > > Second, in
> > > > > > > > MVCC
> > > > > > > > > we probably will also need to release some locks when doing
> > > > > rollbacks.
> > > > > > > > What
> > > > > > > > > we should think about is failover - what if a node which was
> > in the
> > > > > > > > middle
> > > > > > > > > of a graph fails? We need to craft fault-tolerance carefully.
> > > > > > > > >
> > > > > > > > > Another critically important point is how to trigger deadlock
> > > > > detection.
> > > > > > > > My
> > > > > > > > > concerns about edge-chasing was not about latency, but about
> > a
> > > > > number of
> > > > > > > > > messages which travels between nodes. Good algorithm must
> > produce
> > > > > little
> > > > > > > > to
> > > > > > > > > no messages in case of normal contenion while still providing
> > > > > protection
> > > > > > > > in
> > > > > > > > > case of real deadlocks. So how would we trigger fraph
> > traversal
> > > > > for the
> > > > > > > > > given transaction? May be we can start with hard timeout and
> > then
> > > > > employ
> > > > > > > > a
> > > > > > > > > kind of adaptive increment in case high rate of
> > false-positives are
> > > > > > > > > observed. May be something else.
> > > > > > > > >
> > > > > > > > > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <
> > > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Vladimir,
> > > > > > > > > >
> > > > > > > > > > Thanks for the articles! I studied them and a couple of
> > others.
> > > > > And I
> > > > > > > > > > would like to share a knowledge I found.
> > > > > > > > > >
> > > > > > > > > > BACKGROUND
> > > > > > > > > > First of all our algorithm implemented in
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > >
> > org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > > > > > > > > > is not an edge-chasing algorithm. In essence a lock-waiting
> > > > > > > > > > transaction site polls nodes responsible for keys of
> > interest
> > > > > one by
> > > > > > > > > > one and reconstructs global wait-for graph (WFG) locally.
> > > > > > > > > > A centralized algorithm discussed in this thread looks
> > similar to
> > > > > > > > > > algorithms proposed by Ho [1]. The simplest of them
> > (two-phased)
> > > > > > > > > > reports false deadlocks when unlock before transaction
> > finish is
> > > > > > > > > > permitted. So, it seems that it only works when strict
> > two-phase
> > > > > > > > > > locking (2PL) is obeyed. Another one (called one-phased)
> > requires
> > > > > > > > > > tracking all locked keys by each transaction which is not
> > > > > desirable
> > > > > > > > > > for MVCC transactions.
> > > > > > > > > > Aforementioned edge-chasing algorithm by Chandy, Misra and
> > Haas
> > > > > [2] is
> > > > > > > > > > proven to work even when 2PL is not obeyed.
> > > > > > > > > > Also performance target is not clear for me. It looks like
> > > > > centralized
> > > > > > > > > > approaches can use fewer messages and provide lower
> > latency that
> > > > > > > > > > distributed ones. But would like to understand what are the
> > > > > orders of
> > > > > > > > > > latency and messaging overhead. Many of algorithms are
> > described
> > > > > in
> > > > > > > > > > [3] and some performance details are mentioned. It is said
> > "As
> > > > > per
> > > > > > > > > > [GRAY81], most deadlocks involve two to four
> > transactions." I
> > > > > see one
> > > > > > > > > > more argument making deadlock detection algorithm latency
> > not so
> > > > > > > > > > important. We can consider deadlock as unavoidable
> > situation,
> > > > > but even
> > > > > > > > > > lightning fast detector will not save a database from
> > performance
> > > > > > > > > > degradation in case of tons of deadlocks.
> > > > > > > > > > What is also interesting, Google Cloud Spanner employs
> > simple
> > > > > > > > > > "wound-wait" deadlock prevention [4] instead of any
> > detection
> > > > > > > > > > algorithm.
> > > > > > > > > >
> > > > > > > > > > DISCUSSION
> > > > > > > > > > So, I see the following options:
> > > > > > > > > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > > > > > > > > > 2. Centralized two-phased algorithm described by Ho [1]
> > with
> > > > > > > > > > restriction that transactions must obey 2PL.
> > > > > > > > > > Personally, I tend to edge-chasing approach because it
> > looks
> > > > > simple
> > > > > > > > > > and universal. Restricting ourselves with 2PL will
> > restrict us in
> > > > > > > > > > features which could be implemented in future (e.g.
> > transaction
> > > > > > > > > > savepoints), will not it?
> > > > > > > > > >
> > > > > > > > > > WDYT?
> > > > > > > > > >
> > > > > > > > > > Ivan
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > > > > > > > > > [2]
> > > > > > > > > >
> > > > > > > >
> > > > >
> > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > > > > > [3]
> > https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > > > > > > > > > [4]
> > > > > > > > > >
> > > > > > > >
> > > > >
> > https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> > > > > > > > > >
> > > > > > > > > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <
> > > > > [hidden email]>:
> > > > > > > > > > >
> > > > > > > > > > > Ivan,
> > > > > > > > > > >
> > > > > > > > > > > This is interesting question. I think we should spend
> > some
> > > > > time for
> > > > > > > > > > formal
> > > > > > > > > > > verification whether this algorithm works or not. Several
> > > > > articles
> > > > > > > > you
> > > > > > > > > > may
> > > > > > > > > > > use as a startitng point: [1], [2]. From what I
> > understand,
> > > > > Ignite
> > > > > > > > fall
> > > > > > > > > > > into "AND" model, and currently implemented algorithm is
> > a
> > > > > variation
> > > > > > > > of
> > > > > > > > > > > "edge-chasing" approach as per Chandy, Misra and Haas
> > [3],
> > > > > which is
> > > > > > > > > > *proven
> > > > > > > > > > > to be correct* in that it both detects deadlocks when
> > they are
> > > > > > > > present,
> > > > > > > > > > and
> > > > > > > > > > > do not produce false positives. But is might be too
> > heavy for
> > > > > the
> > > > > > > > system
> > > > > > > > > > > under contention.
> > > > > > > > > > >
> > > > > > > > > > > We need to search for any formal proof of correctness of
> > > > > proposed
> > > > > > > > > > > algorithm. This area is already researched throughly
> > enough,
> > > > > so we
> > > > > > > > should
> > > > > > > > > > > be able to find an answer quickly.
> > > > > > > > > > >
> > > > > > > > > > > Vladimir.
> > > > > > > > > > >
> > > > > > > > > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > > > > > > > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > > > > > > > > > [3]
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > >
> > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <
> > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > Next part as promised. A working item for me is a
> > deadlock
> > > > > detector
> > > > > > > > > > > > for MVCC transactions [1]. The message is structured
> > in 2
> > > > > parts.
> > > > > > > > First
> > > > > > > > > > > > is an analysis of the current state of affairs and
> > possible
> > > > > > > > options to
> > > > > > > > > > > > go. Second is a proposed option. First part is going
> > to be
> > > > > not so
> > > > > > > > > > > > short so some might prefer to skip it.
> > > > > > > > > > > >
> > > > > > > > > > > > ANALYSIS
> > > > > > > > > > > > The immediate question is "why we cannot use an
> > existing
> > > > > deadlock
> > > > > > > > > > > > detector?". The differences between classic and MVCC
> > > > > transactions
> > > > > > > > > > > > implementation is the answer. Currently a collection of
> > > > > > > > IgniteTxEntry
> > > > > > > > > > > > is used for detection. But such collection is not
> > maintained
> > > > > for
> > > > > > > > MVCC
> > > > > > > > > > > > transactions. So, it will not work out of box.
> > > > > > > > > > > > Also it looks like that current distributed iterative
> > > > > approach
> > > > > > > > cannot
> > > > > > > > > > > > be low latency it the worst case because of doing
> > possibly
> > > > > many
> > > > > > > > > > > > network requests sequentially.
> > > > > > > > > > > > So, what options do we have? Generally we should choose
> > > > > between
> > > > > > > > > > > > centralized and distributed approaches. By centralized
> > > > > approach I
> > > > > > > > mean
> > > > > > > > > > > > existence of a dedicated deadlock detector located on a
> > > > > single
> > > > > > > > node.
> > > > > > > > > > > > In the centralized approach we can face difficulties
> > related
> > > > > to
> > > > > > > > > > > > failover as a node running deadlock detector can fail.
> > In the
> > > > > > > > > > > > distributed approach extra network messaging overhead
> > can
> > > > > strike
> > > > > > > > > > > > because different nodes participating in a deadlock
> > can start
> > > > > > > > > > > > detection independently and send redundant messages. I
> > see
> > > > > some
> > > > > > > > > > > > aspects which make sense for choosing implementation.
> > Here
> > > > > they are
> > > > > > > > > > > > with an approach that is better (roughly speaking) in
> > > > > parentheses:
> > > > > > > > > > > > * Detection latency (centralized).
> > > > > > > > > > > > * Messaging overhead (centralized).
> > > > > > > > > > > > * Failover (distributed).
> > > > > > > > > > > > And also having a park of deadlock detectors sounds
> > not very
> > > > > good.
> > > > > > > > I
> > > > > > > > > > > > hope that it is possible to develop a common solution
> > > > > suitable for
> > > > > > > > > > > > both kinds of transactions. I suggest to pilot new
> > solution
> > > > > with
> > > > > > > > MVCC
> > > > > > > > > > > > and then adopt it for classic transactions.
> > > > > > > > > > > >
> > > > > > > > > > > > PROPOSAL
> > > > > > > > > > > > Actually I propose to start with an centralized
> > algorithm
> > > > > > > > described by
> > > > > > > > > > > > Vladimir in the beginning of the thread. I will try to
> > > > > outline main
> > > > > > > > > > > > points of it.
> > > > > > > > > > > > 1. Single deadlock detector exists in the cluster which
> > > > > maintains
> > > > > > > > > > > > transaction wait-for graph (WFG).
> > > > > > > > > > > > 2. Each cluster node sends and invalidates wait-for
> > edges to
> > > > > the
> > > > > > > > > > detector.
> > > > > > > > > > > > 3. The detector periodically searches cycles in WFG and
> > > > > chooses and
> > > > > > > > > > > > aborts a victim transaction if cycle is found.
> > > > > > > > > > > >
> > > > > > > > > > > > Currently I have one fundamental question. Is there a
> > > > > possibility
> > > > > > > > of
> > > > > > > > > > > > false detected deadlocks because of concurrent WFG
> > updates?
> > > > > > > > > > > > Of course there are many points of improvements and
> > > > > optimizations.
> > > > > > > > But
> > > > > > > > > > > > I would like to start from discussing key points.
> > > > > > > > > > > >
> > > > > > > > > > > > Please share your thoughts!
> > > > > > > > > > > >
> > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > > > > > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <
> > > > > [hidden email]>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Igniters,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would like to resume the discussion about a
> > deadlock
> > > > > detector.
> > > > > > > > I
> > > > > > > > > > start
> > > > > > > > > > > > > with a motivation for a further work on a subject.
> > As I see
> > > > > > > > current
> > > > > > > > > > > > > implementation (entry point
> > IgniteTxManager.detectDeadlock)
> > > > > > > > starts a
> > > > > > > > > > > > > detection only after a transaction was timed out. In
> > my
> > > > > mind it
> > > > > > > > is
> > > > > > > > > > not
> > > > > > > > > > > > > very good from a product usability standpoint. As you
> > > > > know, in a
> > > > > > > > > > > > > situation of deadlock some keys become non-usable
> > for an
> > > > > infinite
> > > > > > > > > > amount
> > > > > > > > > > > > > of time. Currently the only way to work around it is
> > > > > configuring
> > > > > > > > a
> > > > > > > > > > > > > timeout, but it could be rather tricky in practice to
> > > > > choose a
> > > > > > > > > > > > > proper/universal value for it. So, I see the main
> > point as:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ability to break deadlocks without a need to
> > configure
> > > > > timeouts
> > > > > > > > > > > > explicitly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I will return soon with some thoughts about
> > implementation.
> > > > > > > > > > Meanwhile,
> > > > > > > > > > > > > does anybody have in mind any other usability points
> > which
> > > > > I am
> > > > > > > > > > missing?
> > > > > > > > > > > > > Or is there any alternative approaches?
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <
> > [hidden email]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > > > > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov
> > <
> > > > > > > > > > [hidden email]
> > > > > > > > > > > > >>
> > > > > > > > > > > > > > wrote:>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It doesn’t need all txes. Instead, other nodes
> > will
> > > > > send
> > > > > > > > info
> > > > > > > > > > about>
> > > > > > > > > > > > > > > suspicious txes to it from time to time.>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I see your point, I think it might work.>
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Ivan Pavlukhin
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Ivan Pavlukhin
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >

--
Best regards,
Ivan Pavlukhin