2 phase waiting for partitions release

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

2 phase waiting for partitions release

Pavel Kovalenko
Hello Igniters,

Current implementation of
GridDhtPartitionsExchangeFuture#waitPartitionRelease function doesn't give
us 100% guarantees that
after this method completes there are no ongoing atomic or transactional
updates on current node during main stage of PME.
It gives us only guarantee that all primary updates will be finished on
that node, while we can still receive and process backup updates after this
method.
Example of such case is described in
https://issues.apache.org/jira/browse/IGNITE-7871

To avoid such situations we would like to implement second phase of
waitPartitionRelease method.
On this phase every server node participating in PME should wait while all
other server nodes will finish their ongoing updates.

Here is brief algorithm description:

Non-coordinator node:
1) Finish all ongoing atomic & transactional updates.
2) Send acknowledgement to coordinator.
3) Wait for final acknowledgement from coordinator, that all nodes finished
their updates.
4) Continue PME.

Coordinator node:
1) Finish all ongoing atomic & transactional updates.
2) Wait for all acknowledgements from all server nodes.
3) Send final acknowledgement to all server nodes.
4) Continue PME.

Acknowledgement messages have tiny size, so network pressure and overall
performance drop will be minimal.

Another solution of the problem is just cancelling atomic backup updates
and transactional backup updates on PREPARED phase if topology version is
changed.
But from user perspective it's not correct to catch transaction errors even
in cases when node is joining to the cluster.

Any thoughts?
Reply | Threaded
Open this post in threaded view
|

Re: 2 phase waiting for partitions release

Dmitriy Pavlov
Hi Igniters,

I prefer option 1 because throwing any exceptions is bad for product
usability. I think we should do this way only if it is unavoidable.

In the same time it would be good if we could provide so reliable but
optimized (from the point of view of messages count) solution.

Please share your vision.

Sincerely,
Dmitriy Pavlov

пн, 19 мар. 2018 г. в 20:15, Pavel Kovalenko <[hidden email]>:

> Hello Igniters,
>
> Current implementation of
> GridDhtPartitionsExchangeFuture#waitPartitionRelease function doesn't give
> us 100% guarantees that
> after this method completes there are no ongoing atomic or transactional
> updates on current node during main stage of PME.
> It gives us only guarantee that all primary updates will be finished on
> that node, while we can still receive and process backup updates after this
> method.
> Example of such case is described in
> https://issues.apache.org/jira/browse/IGNITE-7871
>
> To avoid such situations we would like to implement second phase of
> waitPartitionRelease method.
> On this phase every server node participating in PME should wait while all
> other server nodes will finish their ongoing updates.
>
> Here is brief algorithm description:
>
> Non-coordinator node:
> 1) Finish all ongoing atomic & transactional updates.
> 2) Send acknowledgement to coordinator.
> 3) Wait for final acknowledgement from coordinator, that all nodes finished
> their updates.
> 4) Continue PME.
>
> Coordinator node:
> 1) Finish all ongoing atomic & transactional updates.
> 2) Wait for all acknowledgements from all server nodes.
> 3) Send final acknowledgement to all server nodes.
> 4) Continue PME.
>
> Acknowledgement messages have tiny size, so network pressure and overall
> performance drop will be minimal.
>
> Another solution of the problem is just cancelling atomic backup updates
> and transactional backup updates on PREPARED phase if topology version is
> changed.
> But from user perspective it's not correct to catch transaction errors even
> in cases when node is joining to the cluster.
>
> Any thoughts?
>
Reply | Threaded
Open this post in threaded view
|

Re: 2 phase waiting for partitions release

Alexey Goncharuk
For now, I think the two-phase await is the only option. After the fix is
prototyped we need to benchmark and check what is the impact of this change
on PME timing.

2018-03-20 18:09 GMT+03:00 Dmitry Pavlov <[hidden email]>:

> Hi Igniters,
>
> I prefer option 1 because throwing any exceptions is bad for product
> usability. I think we should do this way only if it is unavoidable.
>
> In the same time it would be good if we could provide so reliable but
> optimized (from the point of view of messages count) solution.
>
> Please share your vision.
>
> Sincerely,
> Dmitriy Pavlov
>
> пн, 19 мар. 2018 г. в 20:15, Pavel Kovalenko <[hidden email]>:
>
> > Hello Igniters,
> >
> > Current implementation of
> > GridDhtPartitionsExchangeFuture#waitPartitionRelease function doesn't
> give
> > us 100% guarantees that
> > after this method completes there are no ongoing atomic or transactional
> > updates on current node during main stage of PME.
> > It gives us only guarantee that all primary updates will be finished on
> > that node, while we can still receive and process backup updates after
> this
> > method.
> > Example of such case is described in
> > https://issues.apache.org/jira/browse/IGNITE-7871
> >
> > To avoid such situations we would like to implement second phase of
> > waitPartitionRelease method.
> > On this phase every server node participating in PME should wait while
> all
> > other server nodes will finish their ongoing updates.
> >
> > Here is brief algorithm description:
> >
> > Non-coordinator node:
> > 1) Finish all ongoing atomic & transactional updates.
> > 2) Send acknowledgement to coordinator.
> > 3) Wait for final acknowledgement from coordinator, that all nodes
> finished
> > their updates.
> > 4) Continue PME.
> >
> > Coordinator node:
> > 1) Finish all ongoing atomic & transactional updates.
> > 2) Wait for all acknowledgements from all server nodes.
> > 3) Send final acknowledgement to all server nodes.
> > 4) Continue PME.
> >
> > Acknowledgement messages have tiny size, so network pressure and overall
> > performance drop will be minimal.
> >
> > Another solution of the problem is just cancelling atomic backup updates
> > and transactional backup updates on PREPARED phase if topology version is
> > changed.
> > But from user perspective it's not correct to catch transaction errors
> even
> > in cases when node is joining to the cluster.
> >
> > Any thoughts?
> >
>