Hello!
I tried to summarize some ideas on how to make ATOMIC caches stay consistent in case of topology changes. Imagine partitioned ATOMIC cache with 2 backups configured (primary + 2 backup copies of the partition). One of possible scenarios to get to primary and backups divergence is the following - update initiating node sends update operation to primary node, primary node propagates update to 1 of 2 backups and then dies. If initiating node crashes as well (and therefore cannot retry the operation) then in current implementation system comes to a situation when 2 copies of the partition present in cluster may be different. Note that both situations possible - new primary contains the latest update and backup does not and vice versa. New backup will be elected according to configured affinity and will rebalance the partition from random owner, but copies may not be consistent due to described above. This problem does not affect TRANSACTIONAL caches as 2PC protocol deals with scenarios of the kind very well. Here is the link to IEP - https://cwiki.apache.org/confluence/display/IGNITE/IEP-12+Make+ATOMIC+Caches+Consistent+Again Sam, Alex G, Vladimir, please share your thoughts. --Yakov |
Yakov,
As one of the prevention mechanisms for the scenario you describe, would it be possible to send the update message to backup nodes in the order in which they would become primary nodes? For example, if backup1 node is the next in chain to become the primary node, then the current primary node should send the update message to backup1 before it sends it to backup2. In this case, if the primary and client nodes fail, then backup1 node will become primary and will have the latest version of the data, no? Will this solve the situation you describe, at least partially? D. On Mon, Dec 25, 2017 at 7:23 AM, Yakov Zhdanov <[hidden email]> wrote: > Hello! > > I tried to summarize some ideas on how to make ATOMIC caches stay > consistent in case of topology changes. > > Imagine partitioned ATOMIC cache with 2 backups configured (primary + 2 > backup copies of the partition). One of possible scenarios to get to > primary and backups divergence is the following - update initiating node > sends update operation to primary node, primary node propagates update to 1 > of 2 backups and then dies. If initiating node crashes as well (and > therefore cannot retry the operation) then in current implementation system > comes to a situation when 2 copies of the partition present in cluster may > be different. Note that both situations possible - new primary contains the > latest update and backup does not and vice versa. New backup will be > elected according to configured affinity and will rebalance the partition > from random owner, but copies may not be consistent due to described above. > > This problem does not affect TRANSACTIONAL caches as 2PC protocol deals > with scenarios of the kind very well. > > Here is the link to IEP - > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > 12+Make+ATOMIC+Caches+Consistent+Again > > Sam, Alex G, Vladimir, please share your thoughts. > > --Yakov > |
Dmitry,
This will not solve the problem: 1. new primary election may be affinity function dependent 2. If you want to introduce order in updating backup copies then you lose parallelism. This will increase ops latency. My suggestion should not affect current performance on stable and on unstable topologies. Btw, I have just got an idea - if we need to rebalance partition to more than 1 node then we need to scan it only once. I believe now we do it each time from very beginning. Yakov Zhdanov, www.gridgain.com 2017-12-27 6:14 GMT+03:00 Dmitriy Setrakyan <[hidden email]>: > Yakov, > > As one of the prevention mechanisms for the scenario you describe, would it > be possible to send the update message to backup nodes in the order in > which they would become primary nodes? For example, if backup1 node is the > next in chain to become the primary node, then the current primary node > should send the update message to backup1 before it sends it to backup2. > > In this case, if the primary and client nodes fail, then backup1 node will > become primary and will have the latest version of the data, no? Will this > solve the situation you describe, at least partially? > > D. > > On Mon, Dec 25, 2017 at 7:23 AM, Yakov Zhdanov <[hidden email]> > wrote: > > > Hello! > > > > I tried to summarize some ideas on how to make ATOMIC caches stay > > consistent in case of topology changes. > > > > Imagine partitioned ATOMIC cache with 2 backups configured (primary + 2 > > backup copies of the partition). One of possible scenarios to get to > > primary and backups divergence is the following - update initiating node > > sends update operation to primary node, primary node propagates update > to 1 > > of 2 backups and then dies. If initiating node crashes as well (and > > therefore cannot retry the operation) then in current implementation > system > > comes to a situation when 2 copies of the partition present in cluster > may > > be different. Note that both situations possible - new primary contains > the > > latest update and backup does not and vice versa. New backup will be > > elected according to configured affinity and will rebalance the partition > > from random owner, but copies may not be consistent due to described > above. > > > > This problem does not affect TRANSACTIONAL caches as 2PC protocol deals > > with scenarios of the kind very well. > > > > Here is the link to IEP - > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > 12+Make+ATOMIC+Caches+Consistent+Again > > > > Sam, Alex G, Vladimir, please share your thoughts. > > > > --Yakov > > > |
On Thu, Dec 28, 2017 at 9:12 AM, Yakov Zhdanov <[hidden email]>
wrote: > Dmitry, > > This will not solve the problem: > > 1. new primary election may be affinity function dependent > Yes, of course. > 2. If you want to introduce order in updating backup copies then you lose > parallelism. This will increase ops latency. > I did not mean waiting till an update is finished. I suggested waiting an ask for the message is received. In my view, this will significantly reduce the possibility of out-of-sync data. > > My suggestion should not affect current performance on stable and on > unstable topologies. > > Btw, I have just got an idea - if we need to rebalance partition to more > than 1 node then we need to scan it only once. I believe now we do it each > time from very beginning. > I am not sure what this really means, but you should file a ticket for it with a proper description. D. |
Free forum by Nabble | Edit this page |