Igniters and especially Ivan Rakov,
"Idle verify" [1] is a really cool tool, to make sure that cluster is consistent. 1) But it required to have operations paused during cluster check. At some clusters, this check requires hours (3-4 hours at cases I saw). I've checked the code of "idle verify" and it seems it possible to make it "online" with some assumptions. Idea: Currently "Idle verify" checks that partitions hashes, generated this way while (it.hasNextX()) { CacheDataRow row = it.nextX(); partHash += row.key().hashCode(); partHash += Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); } , are the same. What if we'll generate same pairs updateCounter-partitionHash but will compare hashes only in case counters are the same? So, for example, will ask cluster to generate pairs for 64 partitions, then will find that 55 have the same counters (was not updated during check) and check them. The rest (64-55 = 9) partitions will be re-requested and rechecked with an additional 55. This way we'll be able to check cluster is consistent even in сase operations are in progress (just retrying modified). Risks and assumptions: Using this strategy we'll check the cluster's consistency ... eventually, and the check will take more time even on an idle cluster. In case operationsPerTimeToGeneratePartitionHashes > partitionsCount we'll definitely gain no progress. But, in case of the load is not high, we'll be able to check all cluster. Another hope is that we'll be able to pause/continue scan, for example, we'll check 1/3 partitions today, 1/3 tomorrow, and in three days we'll check the whole cluster. Have I missed something? 2) Since "Idle verify" uses regular pagmem, I assume it replaces hot data with persisted. So, we have to warm up the cluster after each check. Are there any chances to check without cooling the cluster? [1] https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums |
Hi Anton,
Thanks for sharing your ideas. I think your approach should work in general. I'll just share my concerns about possible issues that may come up. 1) Equality of update counters doesn't imply equality of partitions content under load. For every update, primary node generates update counter and then update is delivered to backup node and gets applied with the corresponding update counter. For example, there are two transactions (A and B) that update partition X by the following scenario: - A updates key1 in partition X on primary node and increments counter to 10 - B updates key2 in partition X on primary node and increments counter to 11 - While A is still updating another keys, B is finally committed - Update of key2 arrives to backup node and sets update counter to 11 Observer will see equal update counters (11), but update of key 1 is still missing in the backup partition. This is a fundamental problem which is being solved here: https://issues.apache.org/jira/browse/IGNITE-10078 "Online verify" should operate with new complex update counters which take such "update holes" into account. Otherwise, online verify may provide false-positive inconsistency reports. 2) Acquisition and comparison of update counters is fast, but partition hash calculation is long. We should check that update counter remains unchanged after every K keys handled. 3) > Another hope is that we'll be able to pause/continue scan, for > example, we'll check 1/3 partitions today, 1/3 tomorrow, and in three > days we'll check the whole cluster. Totally makes sense. We may find ourselves into a situation where some "hot" partitions are still unprocessed, and every next attempt to calculate partition hash fails due to another concurrent update. We should be able to track progress of validation (% of calculation time wasted due to concurrent operations may be a good metric, 100% is the worst case) and provide option to stop/pause activity. I think, pause should return an "intermediate results report" with information about which partitions have been successfully checked. With such report, we can resume activity later: partitions from report will be just skipped. 4) > Since "Idle verify" uses regular pagmem, I assume it replaces hot data > with persisted. > So, we have to warm up the cluster after each check. > Are there any chances to check without cooling the cluster? I don't see an easy way to achieve it with our page memory architecture. We definitely can't just read pages from disk directly: we need to synchronize page access with concurrent update operations and checkpoints. From my point of view, the correct way to solve this issue is improving our page replacement [1] mechanics by making it truly scan-resistant. P. S. There's another possible way of achieving online verify: instead of on-demand hash calculation, we can always keep up-to-date hash value for every partition. We'll need to update hash on every insert/update/remove operation, but there will be no reordering issues as per function that we use for aggregating hash results (+) is commutative. With having pre-calculated partition hash value, we can automatically detect inconsistent partitions on every PME. What do you think? [1] - https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) Best Regards, Ivan Rakov On 29.04.2019 12:20, Anton Vinogradov wrote: > Igniters and especially Ivan Rakov, > > "Idle verify" [1] is a really cool tool, to make sure that cluster is > consistent. > > 1) But it required to have operations paused during cluster check. > At some clusters, this check requires hours (3-4 hours at cases I saw). > I've checked the code of "idle verify" and it seems it possible to > make it "online" with some assumptions. > > Idea: > Currently "Idle verify" checks that partitions hashes, generated this way > while (it.hasNextX()) { > CacheDataRow row = it.nextX(); > partHash += row.key().hashCode(); > partHash += > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); > } > , are the same. > > What if we'll generate same pairs updateCounter-partitionHash but will > compare hashes only in case counters are the same? > So, for example, will ask cluster to generate pairs for 64 partitions, > then will find that 55 have the same counters (was not updated during > check) and check them. > The rest (64-55 = 9) partitions will be re-requested and rechecked > with an additional 55. > This way we'll be able to check cluster is consistent even in сase > operations are in progress (just retrying modified). > > Risks and assumptions: > Using this strategy we'll check the cluster's consistency ... > eventually, and the check will take more time even on an idle cluster. > In case operationsPerTimeToGeneratePartitionHashes > partitionsCount > we'll definitely gain no progress. > But, in case of the load is not high, we'll be able to check all cluster. > > Another hope is that we'll be able to pause/continue scan, for > example, we'll check 1/3 partitions today, 1/3 tomorrow, and in three > days we'll check the whole cluster. > > Have I missed something? > > 2) Since "Idle verify" uses regular pagmem, I assume it replaces hot > data with persisted. > So, we have to warm up the cluster after each check. > Are there any chances to check without cooling the cluster? > > [1] > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums |
Ivan, thanks for the analysis!
>> With having pre-calculated partition hash value, we can automatically detect inconsistent partitions on every PME. Great idea, seems this covers all broken synс cases. It will check alive nodes in case the primary failed immediately and will check rejoining node once it finished a rebalance (PME on becoming an owner). Recovered cluster will be checked on activation PME (or even before that?). Also, warmed cluster will be still warmed after check. Have I missed some cases leads to broken sync except bugs? 1) But how to keep this hash? - It should be automatically persisted on each checkpoint (it should not require recalculation on restore, snapshots should be covered too) (and covered by WAL?). - It should be always available at RAM for every partition (even for cold partitions never updated/readed on this node) to be immediately used once all operations done on PME. Can we have special pages to keep such hashes and never allow their eviction? 2) PME is a rare operation on production cluster, but, seems, we have to check consistency in a regular way. Since we have to finish all operations before the check, should we have fake PME for maintenance check in this case? On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <[hidden email]> wrote: > Hi Anton, > > Thanks for sharing your ideas. > I think your approach should work in general. I'll just share my concerns > about possible issues that may come up. > > 1) Equality of update counters doesn't imply equality of partitions > content under load. > For every update, primary node generates update counter and then update is > delivered to backup node and gets applied with the corresponding update > counter. For example, there are two transactions (A and B) that update > partition X by the following scenario: > - A updates key1 in partition X on primary node and increments counter to > 10 > - B updates key2 in partition X on primary node and increments counter to > 11 > - While A is still updating another keys, B is finally committed > - Update of key2 arrives to backup node and sets update counter to 11 > Observer will see equal update counters (11), but update of key 1 is still > missing in the backup partition. > This is a fundamental problem which is being solved here: > https://issues.apache.org/jira/browse/IGNITE-10078 > "Online verify" should operate with new complex update counters which take > such "update holes" into account. Otherwise, online verify may provide > false-positive inconsistency reports. > > 2) Acquisition and comparison of update counters is fast, but partition > hash calculation is long. We should check that update counter remains > unchanged after every K keys handled. > > 3) > > Another hope is that we'll be able to pause/continue scan, for example, > we'll check 1/3 partitions today, 1/3 tomorrow, and in three days we'll > check the whole cluster. > > Totally makes sense. > We may find ourselves into a situation where some "hot" partitions are > still unprocessed, and every next attempt to calculate partition hash fails > due to another concurrent update. We should be able to track progress of > validation (% of calculation time wasted due to concurrent operations may > be a good metric, 100% is the worst case) and provide option to stop/pause > activity. > I think, pause should return an "intermediate results report" with > information about which partitions have been successfully checked. With > such report, we can resume activity later: partitions from report will be > just skipped. > > 4) > > Since "Idle verify" uses regular pagmem, I assume it replaces hot data > with persisted. > So, we have to warm up the cluster after each check. > Are there any chances to check without cooling the cluster? > > I don't see an easy way to achieve it with our page memory architecture. > We definitely can't just read pages from disk directly: we need to > synchronize page access with concurrent update operations and checkpoints. > From my point of view, the correct way to solve this issue is improving > our page replacement [1] mechanics by making it truly scan-resistant. > > P. S. There's another possible way of achieving online verify: instead of > on-demand hash calculation, we can always keep up-to-date hash value for > every partition. We'll need to update hash on every insert/update/remove > operation, but there will be no reordering issues as per function that we > use for aggregating hash results (+) is commutative. With having > pre-calculated partition hash value, we can automatically detect > inconsistent partitions on every PME. What do you think? > > [1] - > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) > > Best Regards, > Ivan Rakov > > On 29.04.2019 12:20, Anton Vinogradov wrote: > > Igniters and especially Ivan Rakov, > > "Idle verify" [1] is a really cool tool, to make sure that cluster is > consistent. > > 1) But it required to have operations paused during cluster check. > At some clusters, this check requires hours (3-4 hours at cases I saw). > I've checked the code of "idle verify" and it seems it possible to make it > "online" with some assumptions. > > Idea: > Currently "Idle verify" checks that partitions hashes, generated this way > while (it.hasNextX()) { > CacheDataRow row = it.nextX(); > partHash += row.key().hashCode(); > partHash += > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); > } > , are the same. > > What if we'll generate same pairs updateCounter-partitionHash but will > compare hashes only in case counters are the same? > So, for example, will ask cluster to generate pairs for 64 partitions, > then will find that 55 have the same counters (was not updated during > check) and check them. > The rest (64-55 = 9) partitions will be re-requested and rechecked with an > additional 55. > This way we'll be able to check cluster is consistent even in сase > operations are in progress (just retrying modified). > > Risks and assumptions: > Using this strategy we'll check the cluster's consistency ... eventually, > and the check will take more time even on an idle cluster. > In case operationsPerTimeToGeneratePartitionHashes > partitionsCount we'll > definitely gain no progress. > But, in case of the load is not high, we'll be able to check all cluster. > > Another hope is that we'll be able to pause/continue scan, for example, > we'll check 1/3 partitions today, 1/3 tomorrow, and in three days we'll > check the whole cluster. > > Have I missed something? > > 2) Since "Idle verify" uses regular pagmem, I assume it replaces hot data > with persisted. > So, we have to warm up the cluster after each check. > Are there any chances to check without cooling the cluster? > > [1] > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > > |
> But how to keep this hash?
I think, we can just adopt way of storing partition update counters. Update counters are: 1) Kept and updated in heap, see IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during regular cache operations, no page replacement latency issues) 2) Synchronized with page memory (and with disk) on every checkpoint, see GridCacheOffheapManager#saveStoreMetadata 3) Stored in partition meta page, see PagePartitionMetaIO#setUpdateCounter 4) On node restart, we init onheap counter with value from disk (for the moment of last checkpoint) and update it to latest value during WAL logical records replay > 2) PME is a rare operation on production cluster, but, seems, we have > to check consistency in a regular way. > Since we have to finish all operations before the check, should we > have fake PME for maintenance check in this case? From my experience, PME happens on prod clusters from time to time (several times per week), which can be enough. In case it's needed to check consistency more often than regular PMEs occur, we can implement command that will trigger fake PME for consistency checking. Best Regards, Ivan Rakov On 29.04.2019 18:53, Anton Vinogradov wrote: > Ivan, thanks for the analysis! > > >> With having pre-calculated partition hash value, we can > automatically detect inconsistent partitions on every PME. > Great idea, seems this covers all broken synс cases. > > It will check alive nodes in case the primary failed immediately > and will check rejoining node once it finished a rebalance (PME on > becoming an owner). > Recovered cluster will be checked on activation PME (or even before > that?). > Also, warmed cluster will be still warmed after check. > > Have I missed some cases leads to broken sync except bugs? > > 1) But how to keep this hash? > - It should be automatically persisted on each checkpoint (it should > not require recalculation on restore, snapshots should be covered too) > (and covered by WAL?). > - It should be always available at RAM for every partition (even for > cold partitions never updated/readed on this node) to be immediately > used once all operations done on PME. > > Can we have special pages to keep such hashes and never allow their > eviction? > > 2) PME is a rare operation on production cluster, but, seems, we have > to check consistency in a regular way. > Since we have to finish all operations before the check, should we > have fake PME for maintenance check in this case? > > On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <[hidden email] > <mailto:[hidden email]>> wrote: > > Hi Anton, > > Thanks for sharing your ideas. > I think your approach should work in general. I'll just share my > concerns about possible issues that may come up. > > 1) Equality of update counters doesn't imply equality of > partitions content under load. > For every update, primary node generates update counter and then > update is delivered to backup node and gets applied with the > corresponding update counter. For example, there are two > transactions (A and B) that update partition X by the following > scenario: > - A updates key1 in partition X on primary node and increments > counter to 10 > - B updates key2 in partition X on primary node and increments > counter to 11 > - While A is still updating another keys, B is finally committed > - Update of key2 arrives to backup node and sets update counter to 11 > Observer will see equal update counters (11), but update of key 1 > is still missing in the backup partition. > This is a fundamental problem which is being solved here: > https://issues.apache.org/jira/browse/IGNITE-10078 > "Online verify" should operate with new complex update counters > which take such "update holes" into account. Otherwise, online > verify may provide false-positive inconsistency reports. > > 2) Acquisition and comparison of update counters is fast, but > partition hash calculation is long. We should check that update > counter remains unchanged after every K keys handled. > > 3) > >> Another hope is that we'll be able to pause/continue scan, for >> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in >> three days we'll check the whole cluster. > Totally makes sense. > We may find ourselves into a situation where some "hot" partitions > are still unprocessed, and every next attempt to calculate > partition hash fails due to another concurrent update. We should > be able to track progress of validation (% of calculation time > wasted due to concurrent operations may be a good metric, 100% is > the worst case) and provide option to stop/pause activity. > I think, pause should return an "intermediate results report" with > information about which partitions have been successfully checked. > With such report, we can resume activity later: partitions from > report will be just skipped. > > 4) > >> Since "Idle verify" uses regular pagmem, I assume it replaces hot >> data with persisted. >> So, we have to warm up the cluster after each check. >> Are there any chances to check without cooling the cluster? > I don't see an easy way to achieve it with our page memory > architecture. We definitely can't just read pages from disk > directly: we need to synchronize page access with concurrent > update operations and checkpoints. > From my point of view, the correct way to solve this issue is > improving our page replacement [1] mechanics by making it truly > scan-resistant. > > P. S. There's another possible way of achieving online verify: > instead of on-demand hash calculation, we can always keep > up-to-date hash value for every partition. We'll need to update > hash on every insert/update/remove operation, but there will be no > reordering issues as per function that we use for aggregating hash > results (+) is commutative. With having pre-calculated partition > hash value, we can automatically detect inconsistent partitions on > every PME. What do you think? > > [1] - > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) > > Best Regards, > Ivan Rakov > > On 29.04.2019 12:20, Anton Vinogradov wrote: >> Igniters and especially Ivan Rakov, >> >> "Idle verify" [1] is a really cool tool, to make sure that >> cluster is consistent. >> >> 1) But it required to have operations paused during cluster check. >> At some clusters, this check requires hours (3-4 hours at cases I >> saw). >> I've checked the code of "idle verify" and it seems it possible >> to make it "online" with some assumptions. >> >> Idea: >> Currently "Idle verify" checks that partitions hashes, generated >> this way >> while (it.hasNextX()) { >> CacheDataRow row = it.nextX(); >> partHash += row.key().hashCode(); >> partHash += >> Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); >> } >> , are the same. >> >> What if we'll generate same pairs updateCounter-partitionHash but >> will compare hashes only in case counters are the same? >> So, for example, will ask cluster to generate pairs for 64 >> partitions, then will find that 55 have the same counters (was >> not updated during check) and check them. >> The rest (64-55 = 9) partitions will be re-requested and >> rechecked with an additional 55. >> This way we'll be able to check cluster is consistent even in >> сase operations are in progress (just retrying modified). >> >> Risks and assumptions: >> Using this strategy we'll check the cluster's consistency ... >> eventually, and the check will take more time even on an idle >> cluster. >> In case operationsPerTimeToGeneratePartitionHashes > >> partitionsCount we'll definitely gain no progress. >> But, in case of the load is not high, we'll be able to check all >> cluster. >> >> Another hope is that we'll be able to pause/continue scan, for >> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in >> three days we'll check the whole cluster. >> >> Have I missed something? >> >> 2) Since "Idle verify" uses regular pagmem, I assume it replaces >> hot data with persisted. >> So, we have to warm up the cluster after each check. >> Are there any chances to check without cooling the cluster? >> >> [1] >> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > |
Ivan,
Thanks for the detailed explanation. I'll try to implement the PoC to check the idea. On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <[hidden email]> wrote: > > But how to keep this hash? > I think, we can just adopt way of storing partition update counters. > Update counters are: > 1) Kept and updated in heap, see > IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during > regular cache operations, no page replacement latency issues) > 2) Synchronized with page memory (and with disk) on every checkpoint, > see GridCacheOffheapManager#saveStoreMetadata > 3) Stored in partition meta page, see PagePartitionMetaIO#setUpdateCounter > 4) On node restart, we init onheap counter with value from disk (for the > moment of last checkpoint) and update it to latest value during WAL > logical records replay > > > 2) PME is a rare operation on production cluster, but, seems, we have > > to check consistency in a regular way. > > Since we have to finish all operations before the check, should we > > have fake PME for maintenance check in this case? > From my experience, PME happens on prod clusters from time to time > (several times per week), which can be enough. In case it's needed to > check consistency more often than regular PMEs occur, we can implement > command that will trigger fake PME for consistency checking. > > Best Regards, > Ivan Rakov > > On 29.04.2019 18:53, Anton Vinogradov wrote: > > Ivan, thanks for the analysis! > > > > >> With having pre-calculated partition hash value, we can > > automatically detect inconsistent partitions on every PME. > > Great idea, seems this covers all broken synс cases. > > > > It will check alive nodes in case the primary failed immediately > > and will check rejoining node once it finished a rebalance (PME on > > becoming an owner). > > Recovered cluster will be checked on activation PME (or even before > > that?). > > Also, warmed cluster will be still warmed after check. > > > > Have I missed some cases leads to broken sync except bugs? > > > > 1) But how to keep this hash? > > - It should be automatically persisted on each checkpoint (it should > > not require recalculation on restore, snapshots should be covered too) > > (and covered by WAL?). > > - It should be always available at RAM for every partition (even for > > cold partitions never updated/readed on this node) to be immediately > > used once all operations done on PME. > > > > Can we have special pages to keep such hashes and never allow their > > eviction? > > > > 2) PME is a rare operation on production cluster, but, seems, we have > > to check consistency in a regular way. > > Since we have to finish all operations before the check, should we > > have fake PME for maintenance check in this case? > > > > On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <[hidden email] > > <mailto:[hidden email]>> wrote: > > > > Hi Anton, > > > > Thanks for sharing your ideas. > > I think your approach should work in general. I'll just share my > > concerns about possible issues that may come up. > > > > 1) Equality of update counters doesn't imply equality of > > partitions content under load. > > For every update, primary node generates update counter and then > > update is delivered to backup node and gets applied with the > > corresponding update counter. For example, there are two > > transactions (A and B) that update partition X by the following > > scenario: > > - A updates key1 in partition X on primary node and increments > > counter to 10 > > - B updates key2 in partition X on primary node and increments > > counter to 11 > > - While A is still updating another keys, B is finally committed > > - Update of key2 arrives to backup node and sets update counter to 11 > > Observer will see equal update counters (11), but update of key 1 > > is still missing in the backup partition. > > This is a fundamental problem which is being solved here: > > https://issues.apache.org/jira/browse/IGNITE-10078 > > "Online verify" should operate with new complex update counters > > which take such "update holes" into account. Otherwise, online > > verify may provide false-positive inconsistency reports. > > > > 2) Acquisition and comparison of update counters is fast, but > > partition hash calculation is long. We should check that update > > counter remains unchanged after every K keys handled. > > > > 3) > > > >> Another hope is that we'll be able to pause/continue scan, for > >> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in > >> three days we'll check the whole cluster. > > Totally makes sense. > > We may find ourselves into a situation where some "hot" partitions > > are still unprocessed, and every next attempt to calculate > > partition hash fails due to another concurrent update. We should > > be able to track progress of validation (% of calculation time > > wasted due to concurrent operations may be a good metric, 100% is > > the worst case) and provide option to stop/pause activity. > > I think, pause should return an "intermediate results report" with > > information about which partitions have been successfully checked. > > With such report, we can resume activity later: partitions from > > report will be just skipped. > > > > 4) > > > >> Since "Idle verify" uses regular pagmem, I assume it replaces hot > >> data with persisted. > >> So, we have to warm up the cluster after each check. > >> Are there any chances to check without cooling the cluster? > > I don't see an easy way to achieve it with our page memory > > architecture. We definitely can't just read pages from disk > > directly: we need to synchronize page access with concurrent > > update operations and checkpoints. > > From my point of view, the correct way to solve this issue is > > improving our page replacement [1] mechanics by making it truly > > scan-resistant. > > > > P. S. There's another possible way of achieving online verify: > > instead of on-demand hash calculation, we can always keep > > up-to-date hash value for every partition. We'll need to update > > hash on every insert/update/remove operation, but there will be no > > reordering issues as per function that we use for aggregating hash > > results (+) is commutative. With having pre-calculated partition > > hash value, we can automatically detect inconsistent partitions on > > every PME. What do you think? > > > > [1] - > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) > > > > Best Regards, > > Ivan Rakov > > > > On 29.04.2019 12:20, Anton Vinogradov wrote: > >> Igniters and especially Ivan Rakov, > >> > >> "Idle verify" [1] is a really cool tool, to make sure that > >> cluster is consistent. > >> > >> 1) But it required to have operations paused during cluster check. > >> At some clusters, this check requires hours (3-4 hours at cases I > >> saw). > >> I've checked the code of "idle verify" and it seems it possible > >> to make it "online" with some assumptions. > >> > >> Idea: > >> Currently "Idle verify" checks that partitions hashes, generated > >> this way > >> while (it.hasNextX()) { > >> CacheDataRow row = it.nextX(); > >> partHash += row.key().hashCode(); > >> partHash += > >> > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); > >> } > >> , are the same. > >> > >> What if we'll generate same pairs updateCounter-partitionHash but > >> will compare hashes only in case counters are the same? > >> So, for example, will ask cluster to generate pairs for 64 > >> partitions, then will find that 55 have the same counters (was > >> not updated during check) and check them. > >> The rest (64-55 = 9) partitions will be re-requested and > >> rechecked with an additional 55. > >> This way we'll be able to check cluster is consistent even in > >> сase operations are in progress (just retrying modified). > >> > >> Risks and assumptions: > >> Using this strategy we'll check the cluster's consistency ... > >> eventually, and the check will take more time even on an idle > >> cluster. > >> In case operationsPerTimeToGeneratePartitionHashes > > >> partitionsCount we'll definitely gain no progress. > >> But, in case of the load is not high, we'll be able to check all > >> cluster. > >> > >> Another hope is that we'll be able to pause/continue scan, for > >> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in > >> three days we'll check the whole cluster. > >> > >> Have I missed something? > >> > >> 2) Since "Idle verify" uses regular pagmem, I assume it replaces > >> hot data with persisted. > >> So, we have to warm up the cluster after each check. > >> Are there any chances to check without cooling the cluster? > >> > >> [1] > >> > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > > > |
Ivan, just to make sure ...
The discussed case will fully solve the issue [1] in case we'll also add some strategy to reject partitions with missed updates (updateCnt==Ok, Hash!=Ok). For example, we may use the Quorum strategy, when the majority wins. Sounds correct? [1] https://issues.apache.org/jira/browse/IGNITE-10078 On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <[hidden email]> wrote: > Ivan, > > Thanks for the detailed explanation. > I'll try to implement the PoC to check the idea. > > On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <[hidden email]> wrote: > >> > But how to keep this hash? >> I think, we can just adopt way of storing partition update counters. >> Update counters are: >> 1) Kept and updated in heap, see >> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during >> regular cache operations, no page replacement latency issues) >> 2) Synchronized with page memory (and with disk) on every checkpoint, >> see GridCacheOffheapManager#saveStoreMetadata >> 3) Stored in partition meta page, see PagePartitionMetaIO#setUpdateCounter >> 4) On node restart, we init onheap counter with value from disk (for the >> moment of last checkpoint) and update it to latest value during WAL >> logical records replay >> >> > 2) PME is a rare operation on production cluster, but, seems, we have >> > to check consistency in a regular way. >> > Since we have to finish all operations before the check, should we >> > have fake PME for maintenance check in this case? >> From my experience, PME happens on prod clusters from time to time >> (several times per week), which can be enough. In case it's needed to >> check consistency more often than regular PMEs occur, we can implement >> command that will trigger fake PME for consistency checking. >> >> Best Regards, >> Ivan Rakov >> >> On 29.04.2019 18:53, Anton Vinogradov wrote: >> > Ivan, thanks for the analysis! >> > >> > >> With having pre-calculated partition hash value, we can >> > automatically detect inconsistent partitions on every PME. >> > Great idea, seems this covers all broken synс cases. >> > >> > It will check alive nodes in case the primary failed immediately >> > and will check rejoining node once it finished a rebalance (PME on >> > becoming an owner). >> > Recovered cluster will be checked on activation PME (or even before >> > that?). >> > Also, warmed cluster will be still warmed after check. >> > >> > Have I missed some cases leads to broken sync except bugs? >> > >> > 1) But how to keep this hash? >> > - It should be automatically persisted on each checkpoint (it should >> > not require recalculation on restore, snapshots should be covered too) >> > (and covered by WAL?). >> > - It should be always available at RAM for every partition (even for >> > cold partitions never updated/readed on this node) to be immediately >> > used once all operations done on PME. >> > >> > Can we have special pages to keep such hashes and never allow their >> > eviction? >> > >> > 2) PME is a rare operation on production cluster, but, seems, we have >> > to check consistency in a regular way. >> > Since we have to finish all operations before the check, should we >> > have fake PME for maintenance check in this case? >> > >> > On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <[hidden email] >> > <mailto:[hidden email]>> wrote: >> > >> > Hi Anton, >> > >> > Thanks for sharing your ideas. >> > I think your approach should work in general. I'll just share my >> > concerns about possible issues that may come up. >> > >> > 1) Equality of update counters doesn't imply equality of >> > partitions content under load. >> > For every update, primary node generates update counter and then >> > update is delivered to backup node and gets applied with the >> > corresponding update counter. For example, there are two >> > transactions (A and B) that update partition X by the following >> > scenario: >> > - A updates key1 in partition X on primary node and increments >> > counter to 10 >> > - B updates key2 in partition X on primary node and increments >> > counter to 11 >> > - While A is still updating another keys, B is finally committed >> > - Update of key2 arrives to backup node and sets update counter to >> 11 >> > Observer will see equal update counters (11), but update of key 1 >> > is still missing in the backup partition. >> > This is a fundamental problem which is being solved here: >> > https://issues.apache.org/jira/browse/IGNITE-10078 >> > "Online verify" should operate with new complex update counters >> > which take such "update holes" into account. Otherwise, online >> > verify may provide false-positive inconsistency reports. >> > >> > 2) Acquisition and comparison of update counters is fast, but >> > partition hash calculation is long. We should check that update >> > counter remains unchanged after every K keys handled. >> > >> > 3) >> > >> >> Another hope is that we'll be able to pause/continue scan, for >> >> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in >> >> three days we'll check the whole cluster. >> > Totally makes sense. >> > We may find ourselves into a situation where some "hot" partitions >> > are still unprocessed, and every next attempt to calculate >> > partition hash fails due to another concurrent update. We should >> > be able to track progress of validation (% of calculation time >> > wasted due to concurrent operations may be a good metric, 100% is >> > the worst case) and provide option to stop/pause activity. >> > I think, pause should return an "intermediate results report" with >> > information about which partitions have been successfully checked. >> > With such report, we can resume activity later: partitions from >> > report will be just skipped. >> > >> > 4) >> > >> >> Since "Idle verify" uses regular pagmem, I assume it replaces hot >> >> data with persisted. >> >> So, we have to warm up the cluster after each check. >> >> Are there any chances to check without cooling the cluster? >> > I don't see an easy way to achieve it with our page memory >> > architecture. We definitely can't just read pages from disk >> > directly: we need to synchronize page access with concurrent >> > update operations and checkpoints. >> > From my point of view, the correct way to solve this issue is >> > improving our page replacement [1] mechanics by making it truly >> > scan-resistant. >> > >> > P. S. There's another possible way of achieving online verify: >> > instead of on-demand hash calculation, we can always keep >> > up-to-date hash value for every partition. We'll need to update >> > hash on every insert/update/remove operation, but there will be no >> > reordering issues as per function that we use for aggregating hash >> > results (+) is commutative. With having pre-calculated partition >> > hash value, we can automatically detect inconsistent partitions on >> > every PME. What do you think? >> > >> > [1] - >> > >> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) >> > >> > Best Regards, >> > Ivan Rakov >> > >> > On 29.04.2019 12:20, Anton Vinogradov wrote: >> >> Igniters and especially Ivan Rakov, >> >> >> >> "Idle verify" [1] is a really cool tool, to make sure that >> >> cluster is consistent. >> >> >> >> 1) But it required to have operations paused during cluster check. >> >> At some clusters, this check requires hours (3-4 hours at cases I >> >> saw). >> >> I've checked the code of "idle verify" and it seems it possible >> >> to make it "online" with some assumptions. >> >> >> >> Idea: >> >> Currently "Idle verify" checks that partitions hashes, generated >> >> this way >> >> while (it.hasNextX()) { >> >> CacheDataRow row = it.nextX(); >> >> partHash += row.key().hashCode(); >> >> partHash += >> >> >> Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); >> >> } >> >> , are the same. >> >> >> >> What if we'll generate same pairs updateCounter-partitionHash but >> >> will compare hashes only in case counters are the same? >> >> So, for example, will ask cluster to generate pairs for 64 >> >> partitions, then will find that 55 have the same counters (was >> >> not updated during check) and check them. >> >> The rest (64-55 = 9) partitions will be re-requested and >> >> rechecked with an additional 55. >> >> This way we'll be able to check cluster is consistent even in >> >> сase operations are in progress (just retrying modified). >> >> >> >> Risks and assumptions: >> >> Using this strategy we'll check the cluster's consistency ... >> >> eventually, and the check will take more time even on an idle >> >> cluster. >> >> In case operationsPerTimeToGeneratePartitionHashes > >> >> partitionsCount we'll definitely gain no progress. >> >> But, in case of the load is not high, we'll be able to check all >> >> cluster. >> >> >> >> Another hope is that we'll be able to pause/continue scan, for >> >> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in >> >> three days we'll check the whole cluster. >> >> >> >> Have I missed something? >> >> >> >> 2) Since "Idle verify" uses regular pagmem, I assume it replaces >> >> hot data with persisted. >> >> So, we have to warm up the cluster after each check. >> >> Are there any chances to check without cooling the cluster? >> >> >> >> [1] >> >> >> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums >> > >> > |
Anton,
Automatic quorum-based partition drop may work as a partial workaround for IGNITE-10078, but discussed approach surely doesn't replace IGNITE-10078 activity. We still don't know what do to when quorum can't be reached (2 partitions have hash X, 2 have hash Y) and keeping extended update counters is the only way to resolve such case. On the other hand, precalculated partition hashes validation on PME can be a good addition to IGNITE-10078 logic: we'll be able to detect situations when extended update counters are equal, but for some reason (bug or whatsoever) partition contents are different. Best Regards, Ivan Rakov On 06.05.2019 12:27, Anton Vinogradov wrote: > Ivan, just to make sure ... > The discussed case will fully solve the issue [1] in case we'll also add > some strategy to reject partitions with missed updates (updateCnt==Ok, > Hash!=Ok). > For example, we may use the Quorum strategy, when the majority wins. > Sounds correct? > > [1] https://issues.apache.org/jira/browse/IGNITE-10078 > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <[hidden email]> wrote: > >> Ivan, >> >> Thanks for the detailed explanation. >> I'll try to implement the PoC to check the idea. >> >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <[hidden email]> wrote: >> >>>> But how to keep this hash? >>> I think, we can just adopt way of storing partition update counters. >>> Update counters are: >>> 1) Kept and updated in heap, see >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during >>> regular cache operations, no page replacement latency issues) >>> 2) Synchronized with page memory (and with disk) on every checkpoint, >>> see GridCacheOffheapManager#saveStoreMetadata >>> 3) Stored in partition meta page, see PagePartitionMetaIO#setUpdateCounter >>> 4) On node restart, we init onheap counter with value from disk (for the >>> moment of last checkpoint) and update it to latest value during WAL >>> logical records replay >>> >>>> 2) PME is a rare operation on production cluster, but, seems, we have >>>> to check consistency in a regular way. >>>> Since we have to finish all operations before the check, should we >>>> have fake PME for maintenance check in this case? >>> From my experience, PME happens on prod clusters from time to time >>> (several times per week), which can be enough. In case it's needed to >>> check consistency more often than regular PMEs occur, we can implement >>> command that will trigger fake PME for consistency checking. >>> >>> Best Regards, >>> Ivan Rakov >>> >>> On 29.04.2019 18:53, Anton Vinogradov wrote: >>>> Ivan, thanks for the analysis! >>>> >>>>>> With having pre-calculated partition hash value, we can >>>> automatically detect inconsistent partitions on every PME. >>>> Great idea, seems this covers all broken synс cases. >>>> >>>> It will check alive nodes in case the primary failed immediately >>>> and will check rejoining node once it finished a rebalance (PME on >>>> becoming an owner). >>>> Recovered cluster will be checked on activation PME (or even before >>>> that?). >>>> Also, warmed cluster will be still warmed after check. >>>> >>>> Have I missed some cases leads to broken sync except bugs? >>>> >>>> 1) But how to keep this hash? >>>> - It should be automatically persisted on each checkpoint (it should >>>> not require recalculation on restore, snapshots should be covered too) >>>> (and covered by WAL?). >>>> - It should be always available at RAM for every partition (even for >>>> cold partitions never updated/readed on this node) to be immediately >>>> used once all operations done on PME. >>>> >>>> Can we have special pages to keep such hashes and never allow their >>>> eviction? >>>> >>>> 2) PME is a rare operation on production cluster, but, seems, we have >>>> to check consistency in a regular way. >>>> Since we have to finish all operations before the check, should we >>>> have fake PME for maintenance check in this case? >>>> >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <[hidden email] >>>> <mailto:[hidden email]>> wrote: >>>> >>>> Hi Anton, >>>> >>>> Thanks for sharing your ideas. >>>> I think your approach should work in general. I'll just share my >>>> concerns about possible issues that may come up. >>>> >>>> 1) Equality of update counters doesn't imply equality of >>>> partitions content under load. >>>> For every update, primary node generates update counter and then >>>> update is delivered to backup node and gets applied with the >>>> corresponding update counter. For example, there are two >>>> transactions (A and B) that update partition X by the following >>>> scenario: >>>> - A updates key1 in partition X on primary node and increments >>>> counter to 10 >>>> - B updates key2 in partition X on primary node and increments >>>> counter to 11 >>>> - While A is still updating another keys, B is finally committed >>>> - Update of key2 arrives to backup node and sets update counter to >>> 11 >>>> Observer will see equal update counters (11), but update of key 1 >>>> is still missing in the backup partition. >>>> This is a fundamental problem which is being solved here: >>>> https://issues.apache.org/jira/browse/IGNITE-10078 >>>> "Online verify" should operate with new complex update counters >>>> which take such "update holes" into account. Otherwise, online >>>> verify may provide false-positive inconsistency reports. >>>> >>>> 2) Acquisition and comparison of update counters is fast, but >>>> partition hash calculation is long. We should check that update >>>> counter remains unchanged after every K keys handled. >>>> >>>> 3) >>>> >>>>> Another hope is that we'll be able to pause/continue scan, for >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in >>>>> three days we'll check the whole cluster. >>>> Totally makes sense. >>>> We may find ourselves into a situation where some "hot" partitions >>>> are still unprocessed, and every next attempt to calculate >>>> partition hash fails due to another concurrent update. We should >>>> be able to track progress of validation (% of calculation time >>>> wasted due to concurrent operations may be a good metric, 100% is >>>> the worst case) and provide option to stop/pause activity. >>>> I think, pause should return an "intermediate results report" with >>>> information about which partitions have been successfully checked. >>>> With such report, we can resume activity later: partitions from >>>> report will be just skipped. >>>> >>>> 4) >>>> >>>>> Since "Idle verify" uses regular pagmem, I assume it replaces hot >>>>> data with persisted. >>>>> So, we have to warm up the cluster after each check. >>>>> Are there any chances to check without cooling the cluster? >>>> I don't see an easy way to achieve it with our page memory >>>> architecture. We definitely can't just read pages from disk >>>> directly: we need to synchronize page access with concurrent >>>> update operations and checkpoints. >>>> From my point of view, the correct way to solve this issue is >>>> improving our page replacement [1] mechanics by making it truly >>>> scan-resistant. >>>> >>>> P. S. There's another possible way of achieving online verify: >>>> instead of on-demand hash calculation, we can always keep >>>> up-to-date hash value for every partition. We'll need to update >>>> hash on every insert/update/remove operation, but there will be no >>>> reordering issues as per function that we use for aggregating hash >>>> results (+) is commutative. With having pre-calculated partition >>>> hash value, we can automatically detect inconsistent partitions on >>>> every PME. What do you think? >>>> >>>> [1] - >>>> >>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) >>>> Best Regards, >>>> Ivan Rakov >>>> >>>> On 29.04.2019 12:20, Anton Vinogradov wrote: >>>>> Igniters and especially Ivan Rakov, >>>>> >>>>> "Idle verify" [1] is a really cool tool, to make sure that >>>>> cluster is consistent. >>>>> >>>>> 1) But it required to have operations paused during cluster check. >>>>> At some clusters, this check requires hours (3-4 hours at cases I >>>>> saw). >>>>> I've checked the code of "idle verify" and it seems it possible >>>>> to make it "online" with some assumptions. >>>>> >>>>> Idea: >>>>> Currently "Idle verify" checks that partitions hashes, generated >>>>> this way >>>>> while (it.hasNextX()) { >>>>> CacheDataRow row = it.nextX(); >>>>> partHash += row.key().hashCode(); >>>>> partHash += >>>>> >>> Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); >>>>> } >>>>> , are the same. >>>>> >>>>> What if we'll generate same pairs updateCounter-partitionHash but >>>>> will compare hashes only in case counters are the same? >>>>> So, for example, will ask cluster to generate pairs for 64 >>>>> partitions, then will find that 55 have the same counters (was >>>>> not updated during check) and check them. >>>>> The rest (64-55 = 9) partitions will be re-requested and >>>>> rechecked with an additional 55. >>>>> This way we'll be able to check cluster is consistent even in >>>>> сase operations are in progress (just retrying modified). >>>>> >>>>> Risks and assumptions: >>>>> Using this strategy we'll check the cluster's consistency ... >>>>> eventually, and the check will take more time even on an idle >>>>> cluster. >>>>> In case operationsPerTimeToGeneratePartitionHashes > >>>>> partitionsCount we'll definitely gain no progress. >>>>> But, in case of the load is not high, we'll be able to check all >>>>> cluster. >>>>> >>>>> Another hope is that we'll be able to pause/continue scan, for >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in >>>>> three days we'll check the whole cluster. >>>>> >>>>> Have I missed something? >>>>> >>>>> 2) Since "Idle verify" uses regular pagmem, I assume it replaces >>>>> hot data with persisted. >>>>> So, we have to warm up the cluster after each check. >>>>> Are there any chances to check without cooling the cluster? >>>>> >>>>> [1] >>>>> >>> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums |
Ivan,
1) I've checked the PR [1] and it looks like it does not solve the issue too. AFAICS, the main goal here (at PR) is to produce PartitionUpdateCounter#sequential which can be false for all backups, what backup should win in that case? Is there any IEP or some another design page for this fix? Looks like extended counters should be able to recover the whole cluster even in case all copies of the same partition are broken. So, seems, the counter should provide detailed info: - biggest applied updateCounter - list of all missed counters before biggest applied - optional hash In that case, we'll be able to perform some exchange between broken copies. For example, we'll found that copy1 missed key1, and copy2 missed key2. It's pretty simple to fix both copies in that case. In case all misses can be solved this way, we'll continue cluster activation like it was not broken before. 2) Seems I see the simpler solution to handle misses (than at PR). Once you have newUpdateCounter > curUpdateCounter + 1, you should add byte (or int or long (smaplest possible)) value to special structure. This value will represent delta between newUpdateCounter and curUpdateCounter in bitmask way. In case you'll handle updateCounter less that curUpdateCounter, you should update the value at structure responsible to this delta. For example, when you have delta "2 to 6", you will have 00000000 initially and 00011111 finally. Each delta update should be finished with check it completed (value == 31 in this case). Once it finished, it should be removed from the structure. Deltas can and should be reused to solve GC issue. What do you think about the proposed solution? 3) Hash computation can be an additional extension for extended counters, just one more dimension to be extremely sure everything is ok. Any objections? [1] https://github.com/apache/ignite/pull/5765 On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <[hidden email]> wrote: > Anton, > > Automatic quorum-based partition drop may work as a partial workaround > for IGNITE-10078, but discussed approach surely doesn't replace > IGNITE-10078 activity. We still don't know what do to when quorum can't > be reached (2 partitions have hash X, 2 have hash Y) and keeping > extended update counters is the only way to resolve such case. > On the other hand, precalculated partition hashes validation on PME can > be a good addition to IGNITE-10078 logic: we'll be able to detect > situations when extended update counters are equal, but for some reason > (bug or whatsoever) partition contents are different. > > Best Regards, > Ivan Rakov > > On 06.05.2019 12:27, Anton Vinogradov wrote: > > Ivan, just to make sure ... > > The discussed case will fully solve the issue [1] in case we'll also add > > some strategy to reject partitions with missed updates (updateCnt==Ok, > > Hash!=Ok). > > For example, we may use the Quorum strategy, when the majority wins. > > Sounds correct? > > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078 > > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <[hidden email]> wrote: > > > >> Ivan, > >> > >> Thanks for the detailed explanation. > >> I'll try to implement the PoC to check the idea. > >> > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <[hidden email]> > wrote: > >> > >>>> But how to keep this hash? > >>> I think, we can just adopt way of storing partition update counters. > >>> Update counters are: > >>> 1) Kept and updated in heap, see > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during > >>> regular cache operations, no page replacement latency issues) > >>> 2) Synchronized with page memory (and with disk) on every checkpoint, > >>> see GridCacheOffheapManager#saveStoreMetadata > >>> 3) Stored in partition meta page, see > PagePartitionMetaIO#setUpdateCounter > >>> 4) On node restart, we init onheap counter with value from disk (for > the > >>> moment of last checkpoint) and update it to latest value during WAL > >>> logical records replay > >>> > >>>> 2) PME is a rare operation on production cluster, but, seems, we have > >>>> to check consistency in a regular way. > >>>> Since we have to finish all operations before the check, should we > >>>> have fake PME for maintenance check in this case? > >>> From my experience, PME happens on prod clusters from time to time > >>> (several times per week), which can be enough. In case it's needed to > >>> check consistency more often than regular PMEs occur, we can implement > >>> command that will trigger fake PME for consistency checking. > >>> > >>> Best Regards, > >>> Ivan Rakov > >>> > >>> On 29.04.2019 18:53, Anton Vinogradov wrote: > >>>> Ivan, thanks for the analysis! > >>>> > >>>>>> With having pre-calculated partition hash value, we can > >>>> automatically detect inconsistent partitions on every PME. > >>>> Great idea, seems this covers all broken synс cases. > >>>> > >>>> It will check alive nodes in case the primary failed immediately > >>>> and will check rejoining node once it finished a rebalance (PME on > >>>> becoming an owner). > >>>> Recovered cluster will be checked on activation PME (or even before > >>>> that?). > >>>> Also, warmed cluster will be still warmed after check. > >>>> > >>>> Have I missed some cases leads to broken sync except bugs? > >>>> > >>>> 1) But how to keep this hash? > >>>> - It should be automatically persisted on each checkpoint (it should > >>>> not require recalculation on restore, snapshots should be covered too) > >>>> (and covered by WAL?). > >>>> - It should be always available at RAM for every partition (even for > >>>> cold partitions never updated/readed on this node) to be immediately > >>>> used once all operations done on PME. > >>>> > >>>> Can we have special pages to keep such hashes and never allow their > >>>> eviction? > >>>> > >>>> 2) PME is a rare operation on production cluster, but, seems, we have > >>>> to check consistency in a regular way. > >>>> Since we have to finish all operations before the check, should we > >>>> have fake PME for maintenance check in this case? > >>>> > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <[hidden email] > >>>> <mailto:[hidden email]>> wrote: > >>>> > >>>> Hi Anton, > >>>> > >>>> Thanks for sharing your ideas. > >>>> I think your approach should work in general. I'll just share my > >>>> concerns about possible issues that may come up. > >>>> > >>>> 1) Equality of update counters doesn't imply equality of > >>>> partitions content under load. > >>>> For every update, primary node generates update counter and then > >>>> update is delivered to backup node and gets applied with the > >>>> corresponding update counter. For example, there are two > >>>> transactions (A and B) that update partition X by the following > >>>> scenario: > >>>> - A updates key1 in partition X on primary node and increments > >>>> counter to 10 > >>>> - B updates key2 in partition X on primary node and increments > >>>> counter to 11 > >>>> - While A is still updating another keys, B is finally committed > >>>> - Update of key2 arrives to backup node and sets update counter > to > >>> 11 > >>>> Observer will see equal update counters (11), but update of key 1 > >>>> is still missing in the backup partition. > >>>> This is a fundamental problem which is being solved here: > >>>> https://issues.apache.org/jira/browse/IGNITE-10078 > >>>> "Online verify" should operate with new complex update counters > >>>> which take such "update holes" into account. Otherwise, online > >>>> verify may provide false-positive inconsistency reports. > >>>> > >>>> 2) Acquisition and comparison of update counters is fast, but > >>>> partition hash calculation is long. We should check that update > >>>> counter remains unchanged after every K keys handled. > >>>> > >>>> 3) > >>>> > >>>>> Another hope is that we'll be able to pause/continue scan, for > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in > >>>>> three days we'll check the whole cluster. > >>>> Totally makes sense. > >>>> We may find ourselves into a situation where some "hot" > partitions > >>>> are still unprocessed, and every next attempt to calculate > >>>> partition hash fails due to another concurrent update. We should > >>>> be able to track progress of validation (% of calculation time > >>>> wasted due to concurrent operations may be a good metric, 100% is > >>>> the worst case) and provide option to stop/pause activity. > >>>> I think, pause should return an "intermediate results report" > with > >>>> information about which partitions have been successfully > checked. > >>>> With such report, we can resume activity later: partitions from > >>>> report will be just skipped. > >>>> > >>>> 4) > >>>> > >>>>> Since "Idle verify" uses regular pagmem, I assume it replaces > hot > >>>>> data with persisted. > >>>>> So, we have to warm up the cluster after each check. > >>>>> Are there any chances to check without cooling the cluster? > >>>> I don't see an easy way to achieve it with our page memory > >>>> architecture. We definitely can't just read pages from disk > >>>> directly: we need to synchronize page access with concurrent > >>>> update operations and checkpoints. > >>>> From my point of view, the correct way to solve this issue is > >>>> improving our page replacement [1] mechanics by making it truly > >>>> scan-resistant. > >>>> > >>>> P. S. There's another possible way of achieving online verify: > >>>> instead of on-demand hash calculation, we can always keep > >>>> up-to-date hash value for every partition. We'll need to update > >>>> hash on every insert/update/remove operation, but there will be > no > >>>> reordering issues as per function that we use for aggregating > hash > >>>> results (+) is commutative. With having pre-calculated partition > >>>> hash value, we can automatically detect inconsistent partitions > on > >>>> every PME. What do you think? > >>>> > >>>> [1] - > >>>> > >>> > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) > >>>> Best Regards, > >>>> Ivan Rakov > >>>> > >>>> On 29.04.2019 12:20, Anton Vinogradov wrote: > >>>>> Igniters and especially Ivan Rakov, > >>>>> > >>>>> "Idle verify" [1] is a really cool tool, to make sure that > >>>>> cluster is consistent. > >>>>> > >>>>> 1) But it required to have operations paused during cluster > check. > >>>>> At some clusters, this check requires hours (3-4 hours at cases > I > >>>>> saw). > >>>>> I've checked the code of "idle verify" and it seems it possible > >>>>> to make it "online" with some assumptions. > >>>>> > >>>>> Idea: > >>>>> Currently "Idle verify" checks that partitions hashes, generated > >>>>> this way > >>>>> while (it.hasNextX()) { > >>>>> CacheDataRow row = it.nextX(); > >>>>> partHash += row.key().hashCode(); > >>>>> partHash += > >>>>> > >>> Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); > >>>>> } > >>>>> , are the same. > >>>>> > >>>>> What if we'll generate same pairs updateCounter-partitionHash > but > >>>>> will compare hashes only in case counters are the same? > >>>>> So, for example, will ask cluster to generate pairs for 64 > >>>>> partitions, then will find that 55 have the same counters (was > >>>>> not updated during check) and check them. > >>>>> The rest (64-55 = 9) partitions will be re-requested and > >>>>> rechecked with an additional 55. > >>>>> This way we'll be able to check cluster is consistent even in > >>>>> сase operations are in progress (just retrying modified). > >>>>> > >>>>> Risks and assumptions: > >>>>> Using this strategy we'll check the cluster's consistency ... > >>>>> eventually, and the check will take more time even on an idle > >>>>> cluster. > >>>>> In case operationsPerTimeToGeneratePartitionHashes > > >>>>> partitionsCount we'll definitely gain no progress. > >>>>> But, in case of the load is not high, we'll be able to check all > >>>>> cluster. > >>>>> > >>>>> Another hope is that we'll be able to pause/continue scan, for > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in > >>>>> three days we'll check the whole cluster. > >>>>> > >>>>> Have I missed something? > >>>>> > >>>>> 2) Since "Idle verify" uses regular pagmem, I assume it replaces > >>>>> hot data with persisted. > >>>>> So, we have to warm up the cluster after each check. > >>>>> Are there any chances to check without cooling the cluster? > >>>>> > >>>>> [1] > >>>>> > >>> > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > |
Anton,
1) Extended counters indeed will answer the question if partition could be safely restored to synchronized state on all owners. The only condition - one of owners has no missed updates. If not, partition must be moved to LOST state, see [1], TxPartitionCounterStateOnePrimaryTwoBackupsFailAll*Test, IgniteSystemProperties#IGNITE_FAIL_NODE_ON_UNRECOVERABLE_PARTITION_INCONSISTENCY This is known issue and could happen if all partition owners were unavailable at some point. In such case we could try to recover consistency using some complex recovery protocol as you described. Related ticket [2] 2) Bitset implementation is considered as an option in GG Community Edition. No specific implementation dates at the moment. 3) As for "online" partition verification, I think the best option right now is to do verification partition by partition using read only mode per group partition under load. While verification is in progress, all write ops are waiting, not rejected. This is only 100% reliable way to compare partitions - by touching actual data, all other ways like pre-computed hash are error prone. There is already ticket [3] for simplifing grid consistency verification which could be used as basis for such functionality. As for avoiding cache pollution, we could try read pages sequentially from disk without lifting them to pagemem and computing some kind of commutative hash. It's safe under partition write lock. [1] https://issues.apache.org/jira/browse/IGNITE-11611 [2] https://issues.apache.org/jira/browse/IGNITE-6324 [3] https://issues.apache.org/jira/browse/IGNITE-11256 пн, 6 мая 2019 г. в 16:12, Anton Vinogradov <[hidden email]>: > Ivan, > > 1) I've checked the PR [1] and it looks like it does not solve the issue > too. > AFAICS, the main goal here (at PR) is to produce > PartitionUpdateCounter#sequential which can be false for all backups, what > backup should win in that case? > > Is there any IEP or some another design page for this fix? > > Looks like extended counters should be able to recover the whole cluster > even in case all copies of the same partition are broken. > So, seems, the counter should provide detailed info: > - biggest applied updateCounter > - list of all missed counters before biggest applied > - optional hash > > In that case, we'll be able to perform some exchange between broken copies. > For example, we'll found that copy1 missed key1, and copy2 missed key2. > It's pretty simple to fix both copies in that case. > In case all misses can be solved this way, we'll continue cluster > activation like it was not broken before. > > 2) Seems I see the simpler solution to handle misses (than at PR). > Once you have newUpdateCounter > curUpdateCounter + 1, you should add byte > (or int or long (smaplest possible)) value to special structure. > This value will represent delta between newUpdateCounter and > curUpdateCounter in bitmask way. > In case you'll handle updateCounter less that curUpdateCounter, you should > update the value at structure responsible to this delta. > For example, when you have delta "2 to 6", you will have 00000000 initially > and 00011111 finally. > Each delta update should be finished with check it completed (value == 31 > in this case). Once it finished, it should be removed from the structure. > Deltas can and should be reused to solve GC issue. > > What do you think about the proposed solution? > > 3) Hash computation can be an additional extension for extended counters, > just one more dimension to be extremely sure everything is ok. > Any objections? > > [1] https://github.com/apache/ignite/pull/5765 > > On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <[hidden email]> wrote: > > > Anton, > > > > Automatic quorum-based partition drop may work as a partial workaround > > for IGNITE-10078, but discussed approach surely doesn't replace > > IGNITE-10078 activity. We still don't know what do to when quorum can't > > be reached (2 partitions have hash X, 2 have hash Y) and keeping > > extended update counters is the only way to resolve such case. > > On the other hand, precalculated partition hashes validation on PME can > > be a good addition to IGNITE-10078 logic: we'll be able to detect > > situations when extended update counters are equal, but for some reason > > (bug or whatsoever) partition contents are different. > > > > Best Regards, > > Ivan Rakov > > > > On 06.05.2019 12:27, Anton Vinogradov wrote: > > > Ivan, just to make sure ... > > > The discussed case will fully solve the issue [1] in case we'll also > add > > > some strategy to reject partitions with missed updates (updateCnt==Ok, > > > Hash!=Ok). > > > For example, we may use the Quorum strategy, when the majority wins. > > > Sounds correct? > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078 > > > > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <[hidden email]> > wrote: > > > > > >> Ivan, > > >> > > >> Thanks for the detailed explanation. > > >> I'll try to implement the PoC to check the idea. > > >> > > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <[hidden email]> > > wrote: > > >> > > >>>> But how to keep this hash? > > >>> I think, we can just adopt way of storing partition update counters. > > >>> Update counters are: > > >>> 1) Kept and updated in heap, see > > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed > during > > >>> regular cache operations, no page replacement latency issues) > > >>> 2) Synchronized with page memory (and with disk) on every checkpoint, > > >>> see GridCacheOffheapManager#saveStoreMetadata > > >>> 3) Stored in partition meta page, see > > PagePartitionMetaIO#setUpdateCounter > > >>> 4) On node restart, we init onheap counter with value from disk (for > > the > > >>> moment of last checkpoint) and update it to latest value during WAL > > >>> logical records replay > > >>> > > >>>> 2) PME is a rare operation on production cluster, but, seems, we > have > > >>>> to check consistency in a regular way. > > >>>> Since we have to finish all operations before the check, should we > > >>>> have fake PME for maintenance check in this case? > > >>> From my experience, PME happens on prod clusters from time to time > > >>> (several times per week), which can be enough. In case it's needed to > > >>> check consistency more often than regular PMEs occur, we can > implement > > >>> command that will trigger fake PME for consistency checking. > > >>> > > >>> Best Regards, > > >>> Ivan Rakov > > >>> > > >>> On 29.04.2019 18:53, Anton Vinogradov wrote: > > >>>> Ivan, thanks for the analysis! > > >>>> > > >>>>>> With having pre-calculated partition hash value, we can > > >>>> automatically detect inconsistent partitions on every PME. > > >>>> Great idea, seems this covers all broken synс cases. > > >>>> > > >>>> It will check alive nodes in case the primary failed immediately > > >>>> and will check rejoining node once it finished a rebalance (PME on > > >>>> becoming an owner). > > >>>> Recovered cluster will be checked on activation PME (or even before > > >>>> that?). > > >>>> Also, warmed cluster will be still warmed after check. > > >>>> > > >>>> Have I missed some cases leads to broken sync except bugs? > > >>>> > > >>>> 1) But how to keep this hash? > > >>>> - It should be automatically persisted on each checkpoint (it should > > >>>> not require recalculation on restore, snapshots should be covered > too) > > >>>> (and covered by WAL?). > > >>>> - It should be always available at RAM for every partition (even for > > >>>> cold partitions never updated/readed on this node) to be immediately > > >>>> used once all operations done on PME. > > >>>> > > >>>> Can we have special pages to keep such hashes and never allow their > > >>>> eviction? > > >>>> > > >>>> 2) PME is a rare operation on production cluster, but, seems, we > have > > >>>> to check consistency in a regular way. > > >>>> Since we have to finish all operations before the check, should we > > >>>> have fake PME for maintenance check in this case? > > >>>> > > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <[hidden email] > > >>>> <mailto:[hidden email]>> wrote: > > >>>> > > >>>> Hi Anton, > > >>>> > > >>>> Thanks for sharing your ideas. > > >>>> I think your approach should work in general. I'll just share > my > > >>>> concerns about possible issues that may come up. > > >>>> > > >>>> 1) Equality of update counters doesn't imply equality of > > >>>> partitions content under load. > > >>>> For every update, primary node generates update counter and > then > > >>>> update is delivered to backup node and gets applied with the > > >>>> corresponding update counter. For example, there are two > > >>>> transactions (A and B) that update partition X by the following > > >>>> scenario: > > >>>> - A updates key1 in partition X on primary node and increments > > >>>> counter to 10 > > >>>> - B updates key2 in partition X on primary node and increments > > >>>> counter to 11 > > >>>> - While A is still updating another keys, B is finally > committed > > >>>> - Update of key2 arrives to backup node and sets update counter > > to > > >>> 11 > > >>>> Observer will see equal update counters (11), but update of > key 1 > > >>>> is still missing in the backup partition. > > >>>> This is a fundamental problem which is being solved here: > > >>>> https://issues.apache.org/jira/browse/IGNITE-10078 > > >>>> "Online verify" should operate with new complex update counters > > >>>> which take such "update holes" into account. Otherwise, online > > >>>> verify may provide false-positive inconsistency reports. > > >>>> > > >>>> 2) Acquisition and comparison of update counters is fast, but > > >>>> partition hash calculation is long. We should check that update > > >>>> counter remains unchanged after every K keys handled. > > >>>> > > >>>> 3) > > >>>> > > >>>>> Another hope is that we'll be able to pause/continue scan, for > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, and > in > > >>>>> three days we'll check the whole cluster. > > >>>> Totally makes sense. > > >>>> We may find ourselves into a situation where some "hot" > > partitions > > >>>> are still unprocessed, and every next attempt to calculate > > >>>> partition hash fails due to another concurrent update. We > should > > >>>> be able to track progress of validation (% of calculation time > > >>>> wasted due to concurrent operations may be a good metric, 100% > is > > >>>> the worst case) and provide option to stop/pause activity. > > >>>> I think, pause should return an "intermediate results report" > > with > > >>>> information about which partitions have been successfully > > checked. > > >>>> With such report, we can resume activity later: partitions from > > >>>> report will be just skipped. > > >>>> > > >>>> 4) > > >>>> > > >>>>> Since "Idle verify" uses regular pagmem, I assume it replaces > > hot > > >>>>> data with persisted. > > >>>>> So, we have to warm up the cluster after each check. > > >>>>> Are there any chances to check without cooling the cluster? > > >>>> I don't see an easy way to achieve it with our page memory > > >>>> architecture. We definitely can't just read pages from disk > > >>>> directly: we need to synchronize page access with concurrent > > >>>> update operations and checkpoints. > > >>>> From my point of view, the correct way to solve this issue is > > >>>> improving our page replacement [1] mechanics by making it truly > > >>>> scan-resistant. > > >>>> > > >>>> P. S. There's another possible way of achieving online verify: > > >>>> instead of on-demand hash calculation, we can always keep > > >>>> up-to-date hash value for every partition. We'll need to update > > >>>> hash on every insert/update/remove operation, but there will be > > no > > >>>> reordering issues as per function that we use for aggregating > > hash > > >>>> results (+) is commutative. With having pre-calculated > partition > > >>>> hash value, we can automatically detect inconsistent partitions > > on > > >>>> every PME. What do you think? > > >>>> > > >>>> [1] - > > >>>> > > >>> > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) > > >>>> Best Regards, > > >>>> Ivan Rakov > > >>>> > > >>>> On 29.04.2019 12:20, Anton Vinogradov wrote: > > >>>>> Igniters and especially Ivan Rakov, > > >>>>> > > >>>>> "Idle verify" [1] is a really cool tool, to make sure that > > >>>>> cluster is consistent. > > >>>>> > > >>>>> 1) But it required to have operations paused during cluster > > check. > > >>>>> At some clusters, this check requires hours (3-4 hours at > cases > > I > > >>>>> saw). > > >>>>> I've checked the code of "idle verify" and it seems it > possible > > >>>>> to make it "online" with some assumptions. > > >>>>> > > >>>>> Idea: > > >>>>> Currently "Idle verify" checks that partitions hashes, > generated > > >>>>> this way > > >>>>> while (it.hasNextX()) { > > >>>>> CacheDataRow row = it.nextX(); > > >>>>> partHash += row.key().hashCode(); > > >>>>> partHash += > > >>>>> > > >>> > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); > > >>>>> } > > >>>>> , are the same. > > >>>>> > > >>>>> What if we'll generate same pairs updateCounter-partitionHash > > but > > >>>>> will compare hashes only in case counters are the same? > > >>>>> So, for example, will ask cluster to generate pairs for 64 > > >>>>> partitions, then will find that 55 have the same counters (was > > >>>>> not updated during check) and check them. > > >>>>> The rest (64-55 = 9) partitions will be re-requested and > > >>>>> rechecked with an additional 55. > > >>>>> This way we'll be able to check cluster is consistent even in > > >>>>> сase operations are in progress (just retrying modified). > > >>>>> > > >>>>> Risks and assumptions: > > >>>>> Using this strategy we'll check the cluster's consistency ... > > >>>>> eventually, and the check will take more time even on an idle > > >>>>> cluster. > > >>>>> In case operationsPerTimeToGeneratePartitionHashes > > > >>>>> partitionsCount we'll definitely gain no progress. > > >>>>> But, in case of the load is not high, we'll be able to check > all > > >>>>> cluster. > > >>>>> > > >>>>> Another hope is that we'll be able to pause/continue scan, for > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, and > in > > >>>>> three days we'll check the whole cluster. > > >>>>> > > >>>>> Have I missed something? > > >>>>> > > >>>>> 2) Since "Idle verify" uses regular pagmem, I assume it > replaces > > >>>>> hot data with persisted. > > >>>>> So, we have to warm up the cluster after each check. > > >>>>> Are there any chances to check without cooling the cluster? > > >>>>> > > >>>>> [1] > > >>>>> > > >>> > > > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > > > -- Best regards, Alexei Scherbakov |
Alexei,
Got it. Could you please let me know once PR will be ready for review? Currently, have some questions, but, possible, they caused by non-final PR (eg. why atomic counter still ignores misses?). On Tue, May 7, 2019 at 4:43 PM Alexei Scherbakov < [hidden email]> wrote: > Anton, > > 1) Extended counters indeed will answer the question if partition could be > safely restored to synchronized state on all owners. > The only condition - one of owners has no missed updates. > If not, partition must be moved to LOST state, see [1], > TxPartitionCounterStateOnePrimaryTwoBackupsFailAll*Test, > > IgniteSystemProperties#IGNITE_FAIL_NODE_ON_UNRECOVERABLE_PARTITION_INCONSISTENCY > This is known issue and could happen if all partition owners were > unavailable at some point. > In such case we could try to recover consistency using some complex > recovery protocol as you described. Related ticket [2] > > 2) Bitset implementation is considered as an option in GG Community > Edition. No specific implementation dates at the moment. > > 3) As for "online" partition verification, I think the best option right > now is to do verification partition by partition using read only mode per > group partition under load. > While verification is in progress, all write ops are waiting, not rejected. > This is only 100% reliable way to compare partitions - by touching actual > data, all other ways like pre-computed hash are error prone. > There is already ticket [3] for simplifing grid consistency verification > which could be used as basis for such functionality. > As for avoiding cache pollution, we could try read pages sequentially from > disk without lifting them to pagemem and computing some kind of commutative > hash. It's safe under partition write lock. > > [1] https://issues.apache.org/jira/browse/IGNITE-11611 > [2] https://issues.apache.org/jira/browse/IGNITE-6324 > [3] https://issues.apache.org/jira/browse/IGNITE-11256 > > пн, 6 мая 2019 г. в 16:12, Anton Vinogradov <[hidden email]>: > > > Ivan, > > > > 1) I've checked the PR [1] and it looks like it does not solve the issue > > too. > > AFAICS, the main goal here (at PR) is to produce > > PartitionUpdateCounter#sequential which can be false for all backups, > what > > backup should win in that case? > > > > Is there any IEP or some another design page for this fix? > > > > Looks like extended counters should be able to recover the whole cluster > > even in case all copies of the same partition are broken. > > So, seems, the counter should provide detailed info: > > - biggest applied updateCounter > > - list of all missed counters before biggest applied > > - optional hash > > > > In that case, we'll be able to perform some exchange between broken > copies. > > For example, we'll found that copy1 missed key1, and copy2 missed key2. > > It's pretty simple to fix both copies in that case. > > In case all misses can be solved this way, we'll continue cluster > > activation like it was not broken before. > > > > 2) Seems I see the simpler solution to handle misses (than at PR). > > Once you have newUpdateCounter > curUpdateCounter + 1, you should add > byte > > (or int or long (smaplest possible)) value to special structure. > > This value will represent delta between newUpdateCounter and > > curUpdateCounter in bitmask way. > > In case you'll handle updateCounter less that curUpdateCounter, you > should > > update the value at structure responsible to this delta. > > For example, when you have delta "2 to 6", you will have 00000000 > initially > > and 00011111 finally. > > Each delta update should be finished with check it completed (value == 31 > > in this case). Once it finished, it should be removed from the structure. > > Deltas can and should be reused to solve GC issue. > > > > What do you think about the proposed solution? > > > > 3) Hash computation can be an additional extension for extended counters, > > just one more dimension to be extremely sure everything is ok. > > Any objections? > > > > [1] https://github.com/apache/ignite/pull/5765 > > > > On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <[hidden email]> > wrote: > > > > > Anton, > > > > > > Automatic quorum-based partition drop may work as a partial workaround > > > for IGNITE-10078, but discussed approach surely doesn't replace > > > IGNITE-10078 activity. We still don't know what do to when quorum can't > > > be reached (2 partitions have hash X, 2 have hash Y) and keeping > > > extended update counters is the only way to resolve such case. > > > On the other hand, precalculated partition hashes validation on PME can > > > be a good addition to IGNITE-10078 logic: we'll be able to detect > > > situations when extended update counters are equal, but for some reason > > > (bug or whatsoever) partition contents are different. > > > > > > Best Regards, > > > Ivan Rakov > > > > > > On 06.05.2019 12:27, Anton Vinogradov wrote: > > > > Ivan, just to make sure ... > > > > The discussed case will fully solve the issue [1] in case we'll also > > add > > > > some strategy to reject partitions with missed updates > (updateCnt==Ok, > > > > Hash!=Ok). > > > > For example, we may use the Quorum strategy, when the majority wins. > > > > Sounds correct? > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078 > > > > > > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <[hidden email]> > > wrote: > > > > > > > >> Ivan, > > > >> > > > >> Thanks for the detailed explanation. > > > >> I'll try to implement the PoC to check the idea. > > > >> > > > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <[hidden email]> > > > wrote: > > > >> > > > >>>> But how to keep this hash? > > > >>> I think, we can just adopt way of storing partition update > counters. > > > >>> Update counters are: > > > >>> 1) Kept and updated in heap, see > > > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed > > during > > > >>> regular cache operations, no page replacement latency issues) > > > >>> 2) Synchronized with page memory (and with disk) on every > checkpoint, > > > >>> see GridCacheOffheapManager#saveStoreMetadata > > > >>> 3) Stored in partition meta page, see > > > PagePartitionMetaIO#setUpdateCounter > > > >>> 4) On node restart, we init onheap counter with value from disk > (for > > > the > > > >>> moment of last checkpoint) and update it to latest value during WAL > > > >>> logical records replay > > > >>> > > > >>>> 2) PME is a rare operation on production cluster, but, seems, we > > have > > > >>>> to check consistency in a regular way. > > > >>>> Since we have to finish all operations before the check, should we > > > >>>> have fake PME for maintenance check in this case? > > > >>> From my experience, PME happens on prod clusters from time to > time > > > >>> (several times per week), which can be enough. In case it's needed > to > > > >>> check consistency more often than regular PMEs occur, we can > > implement > > > >>> command that will trigger fake PME for consistency checking. > > > >>> > > > >>> Best Regards, > > > >>> Ivan Rakov > > > >>> > > > >>> On 29.04.2019 18:53, Anton Vinogradov wrote: > > > >>>> Ivan, thanks for the analysis! > > > >>>> > > > >>>>>> With having pre-calculated partition hash value, we can > > > >>>> automatically detect inconsistent partitions on every PME. > > > >>>> Great idea, seems this covers all broken synс cases. > > > >>>> > > > >>>> It will check alive nodes in case the primary failed immediately > > > >>>> and will check rejoining node once it finished a rebalance (PME on > > > >>>> becoming an owner). > > > >>>> Recovered cluster will be checked on activation PME (or even > before > > > >>>> that?). > > > >>>> Also, warmed cluster will be still warmed after check. > > > >>>> > > > >>>> Have I missed some cases leads to broken sync except bugs? > > > >>>> > > > >>>> 1) But how to keep this hash? > > > >>>> - It should be automatically persisted on each checkpoint (it > should > > > >>>> not require recalculation on restore, snapshots should be covered > > too) > > > >>>> (and covered by WAL?). > > > >>>> - It should be always available at RAM for every partition (even > for > > > >>>> cold partitions never updated/readed on this node) to be > immediately > > > >>>> used once all operations done on PME. > > > >>>> > > > >>>> Can we have special pages to keep such hashes and never allow > their > > > >>>> eviction? > > > >>>> > > > >>>> 2) PME is a rare operation on production cluster, but, seems, we > > have > > > >>>> to check consistency in a regular way. > > > >>>> Since we have to finish all operations before the check, should we > > > >>>> have fake PME for maintenance check in this case? > > > >>>> > > > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <[hidden email] > > > >>>> <mailto:[hidden email]>> wrote: > > > >>>> > > > >>>> Hi Anton, > > > >>>> > > > >>>> Thanks for sharing your ideas. > > > >>>> I think your approach should work in general. I'll just share > > my > > > >>>> concerns about possible issues that may come up. > > > >>>> > > > >>>> 1) Equality of update counters doesn't imply equality of > > > >>>> partitions content under load. > > > >>>> For every update, primary node generates update counter and > > then > > > >>>> update is delivered to backup node and gets applied with the > > > >>>> corresponding update counter. For example, there are two > > > >>>> transactions (A and B) that update partition X by the > following > > > >>>> scenario: > > > >>>> - A updates key1 in partition X on primary node and > increments > > > >>>> counter to 10 > > > >>>> - B updates key2 in partition X on primary node and > increments > > > >>>> counter to 11 > > > >>>> - While A is still updating another keys, B is finally > > committed > > > >>>> - Update of key2 arrives to backup node and sets update > counter > > > to > > > >>> 11 > > > >>>> Observer will see equal update counters (11), but update of > > key 1 > > > >>>> is still missing in the backup partition. > > > >>>> This is a fundamental problem which is being solved here: > > > >>>> https://issues.apache.org/jira/browse/IGNITE-10078 > > > >>>> "Online verify" should operate with new complex update > counters > > > >>>> which take such "update holes" into account. Otherwise, > online > > > >>>> verify may provide false-positive inconsistency reports. > > > >>>> > > > >>>> 2) Acquisition and comparison of update counters is fast, but > > > >>>> partition hash calculation is long. We should check that > update > > > >>>> counter remains unchanged after every K keys handled. > > > >>>> > > > >>>> 3) > > > >>>> > > > >>>>> Another hope is that we'll be able to pause/continue scan, > for > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, and > > in > > > >>>>> three days we'll check the whole cluster. > > > >>>> Totally makes sense. > > > >>>> We may find ourselves into a situation where some "hot" > > > partitions > > > >>>> are still unprocessed, and every next attempt to calculate > > > >>>> partition hash fails due to another concurrent update. We > > should > > > >>>> be able to track progress of validation (% of calculation > time > > > >>>> wasted due to concurrent operations may be a good metric, > 100% > > is > > > >>>> the worst case) and provide option to stop/pause activity. > > > >>>> I think, pause should return an "intermediate results report" > > > with > > > >>>> information about which partitions have been successfully > > > checked. > > > >>>> With such report, we can resume activity later: partitions > from > > > >>>> report will be just skipped. > > > >>>> > > > >>>> 4) > > > >>>> > > > >>>>> Since "Idle verify" uses regular pagmem, I assume it > replaces > > > hot > > > >>>>> data with persisted. > > > >>>>> So, we have to warm up the cluster after each check. > > > >>>>> Are there any chances to check without cooling the cluster? > > > >>>> I don't see an easy way to achieve it with our page memory > > > >>>> architecture. We definitely can't just read pages from disk > > > >>>> directly: we need to synchronize page access with concurrent > > > >>>> update operations and checkpoints. > > > >>>> From my point of view, the correct way to solve this issue is > > > >>>> improving our page replacement [1] mechanics by making it > truly > > > >>>> scan-resistant. > > > >>>> > > > >>>> P. S. There's another possible way of achieving online > verify: > > > >>>> instead of on-demand hash calculation, we can always keep > > > >>>> up-to-date hash value for every partition. We'll need to > update > > > >>>> hash on every insert/update/remove operation, but there will > be > > > no > > > >>>> reordering issues as per function that we use for aggregating > > > hash > > > >>>> results (+) is commutative. With having pre-calculated > > partition > > > >>>> hash value, we can automatically detect inconsistent > partitions > > > on > > > >>>> every PME. What do you think? > > > >>>> > > > >>>> [1] - > > > >>>> > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) > > > >>>> Best Regards, > > > >>>> Ivan Rakov > > > >>>> > > > >>>> On 29.04.2019 12:20, Anton Vinogradov wrote: > > > >>>>> Igniters and especially Ivan Rakov, > > > >>>>> > > > >>>>> "Idle verify" [1] is a really cool tool, to make sure that > > > >>>>> cluster is consistent. > > > >>>>> > > > >>>>> 1) But it required to have operations paused during cluster > > > check. > > > >>>>> At some clusters, this check requires hours (3-4 hours at > > cases > > > I > > > >>>>> saw). > > > >>>>> I've checked the code of "idle verify" and it seems it > > possible > > > >>>>> to make it "online" with some assumptions. > > > >>>>> > > > >>>>> Idea: > > > >>>>> Currently "Idle verify" checks that partitions hashes, > > generated > > > >>>>> this way > > > >>>>> while (it.hasNextX()) { > > > >>>>> CacheDataRow row = it.nextX(); > > > >>>>> partHash += row.key().hashCode(); > > > >>>>> partHash += > > > >>>>> > > > >>> > > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); > > > >>>>> } > > > >>>>> , are the same. > > > >>>>> > > > >>>>> What if we'll generate same pairs > updateCounter-partitionHash > > > but > > > >>>>> will compare hashes only in case counters are the same? > > > >>>>> So, for example, will ask cluster to generate pairs for 64 > > > >>>>> partitions, then will find that 55 have the same counters > (was > > > >>>>> not updated during check) and check them. > > > >>>>> The rest (64-55 = 9) partitions will be re-requested and > > > >>>>> rechecked with an additional 55. > > > >>>>> This way we'll be able to check cluster is consistent even > in > > > >>>>> сase operations are in progress (just retrying modified). > > > >>>>> > > > >>>>> Risks and assumptions: > > > >>>>> Using this strategy we'll check the cluster's consistency > ... > > > >>>>> eventually, and the check will take more time even on an > idle > > > >>>>> cluster. > > > >>>>> In case operationsPerTimeToGeneratePartitionHashes > > > > >>>>> partitionsCount we'll definitely gain no progress. > > > >>>>> But, in case of the load is not high, we'll be able to check > > all > > > >>>>> cluster. > > > >>>>> > > > >>>>> Another hope is that we'll be able to pause/continue scan, > for > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, and > > in > > > >>>>> three days we'll check the whole cluster. > > > >>>>> > > > >>>>> Have I missed something? > > > >>>>> > > > >>>>> 2) Since "Idle verify" uses regular pagmem, I assume it > > replaces > > > >>>>> hot data with persisted. > > > >>>>> So, we have to warm up the cluster after each check. > > > >>>>> Are there any chances to check without cooling the cluster? > > > >>>>> > > > >>>>> [1] > > > >>>>> > > > >>> > > > > > > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > > > > > > > > -- > > Best regards, > Alexei Scherbakov > |
Anton,
It's ready for review, look for Patch Available status. Yes, atomic caches are not fixed by this contribution. See [1] [1] https://issues.apache.org/jira/browse/IGNITE-11797 вт, 7 мая 2019 г. в 17:30, Anton Vinogradov <[hidden email]>: > Alexei, > > Got it. > Could you please let me know once PR will be ready for review? > Currently, have some questions, but, possible, they caused by non-final PR > (eg. why atomic counter still ignores misses?). > > On Tue, May 7, 2019 at 4:43 PM Alexei Scherbakov < > [hidden email]> wrote: > > > Anton, > > > > 1) Extended counters indeed will answer the question if partition could > be > > safely restored to synchronized state on all owners. > > The only condition - one of owners has no missed updates. > > If not, partition must be moved to LOST state, see [1], > > TxPartitionCounterStateOnePrimaryTwoBackupsFailAll*Test, > > > > > IgniteSystemProperties#IGNITE_FAIL_NODE_ON_UNRECOVERABLE_PARTITION_INCONSISTENCY > > This is known issue and could happen if all partition owners were > > unavailable at some point. > > In such case we could try to recover consistency using some complex > > recovery protocol as you described. Related ticket [2] > > > > 2) Bitset implementation is considered as an option in GG Community > > Edition. No specific implementation dates at the moment. > > > > 3) As for "online" partition verification, I think the best option right > > now is to do verification partition by partition using read only mode per > > group partition under load. > > While verification is in progress, all write ops are waiting, not > rejected. > > This is only 100% reliable way to compare partitions - by touching actual > > data, all other ways like pre-computed hash are error prone. > > There is already ticket [3] for simplifing grid consistency verification > > which could be used as basis for such functionality. > > As for avoiding cache pollution, we could try read pages sequentially > from > > disk without lifting them to pagemem and computing some kind of > commutative > > hash. It's safe under partition write lock. > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11611 > > [2] https://issues.apache.org/jira/browse/IGNITE-6324 > > [3] https://issues.apache.org/jira/browse/IGNITE-11256 > > > > пн, 6 мая 2019 г. в 16:12, Anton Vinogradov <[hidden email]>: > > > > > Ivan, > > > > > > 1) I've checked the PR [1] and it looks like it does not solve the > issue > > > too. > > > AFAICS, the main goal here (at PR) is to produce > > > PartitionUpdateCounter#sequential which can be false for all backups, > > what > > > backup should win in that case? > > > > > > Is there any IEP or some another design page for this fix? > > > > > > Looks like extended counters should be able to recover the whole > cluster > > > even in case all copies of the same partition are broken. > > > So, seems, the counter should provide detailed info: > > > - biggest applied updateCounter > > > - list of all missed counters before biggest applied > > > - optional hash > > > > > > In that case, we'll be able to perform some exchange between broken > > copies. > > > For example, we'll found that copy1 missed key1, and copy2 missed key2. > > > It's pretty simple to fix both copies in that case. > > > In case all misses can be solved this way, we'll continue cluster > > > activation like it was not broken before. > > > > > > 2) Seems I see the simpler solution to handle misses (than at PR). > > > Once you have newUpdateCounter > curUpdateCounter + 1, you should add > > byte > > > (or int or long (smaplest possible)) value to special structure. > > > This value will represent delta between newUpdateCounter and > > > curUpdateCounter in bitmask way. > > > In case you'll handle updateCounter less that curUpdateCounter, you > > should > > > update the value at structure responsible to this delta. > > > For example, when you have delta "2 to 6", you will have 00000000 > > initially > > > and 00011111 finally. > > > Each delta update should be finished with check it completed (value == > 31 > > > in this case). Once it finished, it should be removed from the > structure. > > > Deltas can and should be reused to solve GC issue. > > > > > > What do you think about the proposed solution? > > > > > > 3) Hash computation can be an additional extension for extended > counters, > > > just one more dimension to be extremely sure everything is ok. > > > Any objections? > > > > > > [1] https://github.com/apache/ignite/pull/5765 > > > > > > On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <[hidden email]> > > wrote: > > > > > > > Anton, > > > > > > > > Automatic quorum-based partition drop may work as a partial > workaround > > > > for IGNITE-10078, but discussed approach surely doesn't replace > > > > IGNITE-10078 activity. We still don't know what do to when quorum > can't > > > > be reached (2 partitions have hash X, 2 have hash Y) and keeping > > > > extended update counters is the only way to resolve such case. > > > > On the other hand, precalculated partition hashes validation on PME > can > > > > be a good addition to IGNITE-10078 logic: we'll be able to detect > > > > situations when extended update counters are equal, but for some > reason > > > > (bug or whatsoever) partition contents are different. > > > > > > > > Best Regards, > > > > Ivan Rakov > > > > > > > > On 06.05.2019 12:27, Anton Vinogradov wrote: > > > > > Ivan, just to make sure ... > > > > > The discussed case will fully solve the issue [1] in case we'll > also > > > add > > > > > some strategy to reject partitions with missed updates > > (updateCnt==Ok, > > > > > Hash!=Ok). > > > > > For example, we may use the Quorum strategy, when the majority > wins. > > > > > Sounds correct? > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078 > > > > > > > > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <[hidden email]> > > > wrote: > > > > > > > > > >> Ivan, > > > > >> > > > > >> Thanks for the detailed explanation. > > > > >> I'll try to implement the PoC to check the idea. > > > > >> > > > > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <[hidden email] > > > > > > wrote: > > > > >> > > > > >>>> But how to keep this hash? > > > > >>> I think, we can just adopt way of storing partition update > > counters. > > > > >>> Update counters are: > > > > >>> 1) Kept and updated in heap, see > > > > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed > > > during > > > > >>> regular cache operations, no page replacement latency issues) > > > > >>> 2) Synchronized with page memory (and with disk) on every > > checkpoint, > > > > >>> see GridCacheOffheapManager#saveStoreMetadata > > > > >>> 3) Stored in partition meta page, see > > > > PagePartitionMetaIO#setUpdateCounter > > > > >>> 4) On node restart, we init onheap counter with value from disk > > (for > > > > the > > > > >>> moment of last checkpoint) and update it to latest value during > WAL > > > > >>> logical records replay > > > > >>> > > > > >>>> 2) PME is a rare operation on production cluster, but, seems, we > > > have > > > > >>>> to check consistency in a regular way. > > > > >>>> Since we have to finish all operations before the check, should > we > > > > >>>> have fake PME for maintenance check in this case? > > > > >>> From my experience, PME happens on prod clusters from time to > > time > > > > >>> (several times per week), which can be enough. In case it's > needed > > to > > > > >>> check consistency more often than regular PMEs occur, we can > > > implement > > > > >>> command that will trigger fake PME for consistency checking. > > > > >>> > > > > >>> Best Regards, > > > > >>> Ivan Rakov > > > > >>> > > > > >>> On 29.04.2019 18:53, Anton Vinogradov wrote: > > > > >>>> Ivan, thanks for the analysis! > > > > >>>> > > > > >>>>>> With having pre-calculated partition hash value, we can > > > > >>>> automatically detect inconsistent partitions on every PME. > > > > >>>> Great idea, seems this covers all broken synс cases. > > > > >>>> > > > > >>>> It will check alive nodes in case the primary failed immediately > > > > >>>> and will check rejoining node once it finished a rebalance (PME > on > > > > >>>> becoming an owner). > > > > >>>> Recovered cluster will be checked on activation PME (or even > > before > > > > >>>> that?). > > > > >>>> Also, warmed cluster will be still warmed after check. > > > > >>>> > > > > >>>> Have I missed some cases leads to broken sync except bugs? > > > > >>>> > > > > >>>> 1) But how to keep this hash? > > > > >>>> - It should be automatically persisted on each checkpoint (it > > should > > > > >>>> not require recalculation on restore, snapshots should be > covered > > > too) > > > > >>>> (and covered by WAL?). > > > > >>>> - It should be always available at RAM for every partition (even > > for > > > > >>>> cold partitions never updated/readed on this node) to be > > immediately > > > > >>>> used once all operations done on PME. > > > > >>>> > > > > >>>> Can we have special pages to keep such hashes and never allow > > their > > > > >>>> eviction? > > > > >>>> > > > > >>>> 2) PME is a rare operation on production cluster, but, seems, we > > > have > > > > >>>> to check consistency in a regular way. > > > > >>>> Since we have to finish all operations before the check, should > we > > > > >>>> have fake PME for maintenance check in this case? > > > > >>>> > > > > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov < > [hidden email] > > > > >>>> <mailto:[hidden email]>> wrote: > > > > >>>> > > > > >>>> Hi Anton, > > > > >>>> > > > > >>>> Thanks for sharing your ideas. > > > > >>>> I think your approach should work in general. I'll just > share > > > my > > > > >>>> concerns about possible issues that may come up. > > > > >>>> > > > > >>>> 1) Equality of update counters doesn't imply equality of > > > > >>>> partitions content under load. > > > > >>>> For every update, primary node generates update counter and > > > then > > > > >>>> update is delivered to backup node and gets applied with > the > > > > >>>> corresponding update counter. For example, there are two > > > > >>>> transactions (A and B) that update partition X by the > > following > > > > >>>> scenario: > > > > >>>> - A updates key1 in partition X on primary node and > > increments > > > > >>>> counter to 10 > > > > >>>> - B updates key2 in partition X on primary node and > > increments > > > > >>>> counter to 11 > > > > >>>> - While A is still updating another keys, B is finally > > > committed > > > > >>>> - Update of key2 arrives to backup node and sets update > > counter > > > > to > > > > >>> 11 > > > > >>>> Observer will see equal update counters (11), but update of > > > key 1 > > > > >>>> is still missing in the backup partition. > > > > >>>> This is a fundamental problem which is being solved here: > > > > >>>> https://issues.apache.org/jira/browse/IGNITE-10078 > > > > >>>> "Online verify" should operate with new complex update > > counters > > > > >>>> which take such "update holes" into account. Otherwise, > > online > > > > >>>> verify may provide false-positive inconsistency reports. > > > > >>>> > > > > >>>> 2) Acquisition and comparison of update counters is fast, > but > > > > >>>> partition hash calculation is long. We should check that > > update > > > > >>>> counter remains unchanged after every K keys handled. > > > > >>>> > > > > >>>> 3) > > > > >>>> > > > > >>>>> Another hope is that we'll be able to pause/continue scan, > > for > > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, > and > > > in > > > > >>>>> three days we'll check the whole cluster. > > > > >>>> Totally makes sense. > > > > >>>> We may find ourselves into a situation where some "hot" > > > > partitions > > > > >>>> are still unprocessed, and every next attempt to calculate > > > > >>>> partition hash fails due to another concurrent update. We > > > should > > > > >>>> be able to track progress of validation (% of calculation > > time > > > > >>>> wasted due to concurrent operations may be a good metric, > > 100% > > > is > > > > >>>> the worst case) and provide option to stop/pause activity. > > > > >>>> I think, pause should return an "intermediate results > report" > > > > with > > > > >>>> information about which partitions have been successfully > > > > checked. > > > > >>>> With such report, we can resume activity later: partitions > > from > > > > >>>> report will be just skipped. > > > > >>>> > > > > >>>> 4) > > > > >>>> > > > > >>>>> Since "Idle verify" uses regular pagmem, I assume it > > replaces > > > > hot > > > > >>>>> data with persisted. > > > > >>>>> So, we have to warm up the cluster after each check. > > > > >>>>> Are there any chances to check without cooling the > cluster? > > > > >>>> I don't see an easy way to achieve it with our page memory > > > > >>>> architecture. We definitely can't just read pages from disk > > > > >>>> directly: we need to synchronize page access with > concurrent > > > > >>>> update operations and checkpoints. > > > > >>>> From my point of view, the correct way to solve this issue > is > > > > >>>> improving our page replacement [1] mechanics by making it > > truly > > > > >>>> scan-resistant. > > > > >>>> > > > > >>>> P. S. There's another possible way of achieving online > > verify: > > > > >>>> instead of on-demand hash calculation, we can always keep > > > > >>>> up-to-date hash value for every partition. We'll need to > > update > > > > >>>> hash on every insert/update/remove operation, but there > will > > be > > > > no > > > > >>>> reordering issues as per function that we use for > aggregating > > > > hash > > > > >>>> results (+) is commutative. With having pre-calculated > > > partition > > > > >>>> hash value, we can automatically detect inconsistent > > partitions > > > > on > > > > >>>> every PME. What do you think? > > > > >>>> > > > > >>>> [1] - > > > > >>>> > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) > > > > >>>> Best Regards, > > > > >>>> Ivan Rakov > > > > >>>> > > > > >>>> On 29.04.2019 12:20, Anton Vinogradov wrote: > > > > >>>>> Igniters and especially Ivan Rakov, > > > > >>>>> > > > > >>>>> "Idle verify" [1] is a really cool tool, to make sure that > > > > >>>>> cluster is consistent. > > > > >>>>> > > > > >>>>> 1) But it required to have operations paused during > cluster > > > > check. > > > > >>>>> At some clusters, this check requires hours (3-4 hours at > > > cases > > > > I > > > > >>>>> saw). > > > > >>>>> I've checked the code of "idle verify" and it seems it > > > possible > > > > >>>>> to make it "online" with some assumptions. > > > > >>>>> > > > > >>>>> Idea: > > > > >>>>> Currently "Idle verify" checks that partitions hashes, > > > generated > > > > >>>>> this way > > > > >>>>> while (it.hasNextX()) { > > > > >>>>> CacheDataRow row = it.nextX(); > > > > >>>>> partHash += row.key().hashCode(); > > > > >>>>> partHash += > > > > >>>>> > > > > >>> > > > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); > > > > >>>>> } > > > > >>>>> , are the same. > > > > >>>>> > > > > >>>>> What if we'll generate same pairs > > updateCounter-partitionHash > > > > but > > > > >>>>> will compare hashes only in case counters are the same? > > > > >>>>> So, for example, will ask cluster to generate pairs for 64 > > > > >>>>> partitions, then will find that 55 have the same counters > > (was > > > > >>>>> not updated during check) and check them. > > > > >>>>> The rest (64-55 = 9) partitions will be re-requested and > > > > >>>>> rechecked with an additional 55. > > > > >>>>> This way we'll be able to check cluster is consistent even > > in > > > > >>>>> сase operations are in progress (just retrying modified). > > > > >>>>> > > > > >>>>> Risks and assumptions: > > > > >>>>> Using this strategy we'll check the cluster's consistency > > ... > > > > >>>>> eventually, and the check will take more time even on an > > idle > > > > >>>>> cluster. > > > > >>>>> In case operationsPerTimeToGeneratePartitionHashes > > > > > >>>>> partitionsCount we'll definitely gain no progress. > > > > >>>>> But, in case of the load is not high, we'll be able to > check > > > all > > > > >>>>> cluster. > > > > >>>>> > > > > >>>>> Another hope is that we'll be able to pause/continue scan, > > for > > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, > and > > > in > > > > >>>>> three days we'll check the whole cluster. > > > > >>>>> > > > > >>>>> Have I missed something? > > > > >>>>> > > > > >>>>> 2) Since "Idle verify" uses regular pagmem, I assume it > > > replaces > > > > >>>>> hot data with persisted. > > > > >>>>> So, we have to warm up the cluster after each check. > > > > >>>>> Are there any chances to check without cooling the > cluster? > > > > >>>>> > > > > >>>>> [1] > > > > >>>>> > > > > >>> > > > > > > > > > > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > > > > > > > > > > > > > -- > > > > Best regards, > > Alexei Scherbakov > > > -- Best regards, Alexei Scherbakov |
I've linked to IGNITE-10078 additional related/dependent tickets requiring
investigation as soon as main contribution will be accepted. вт, 7 мая 2019 г. в 18:10, Alexei Scherbakov <[hidden email]>: > Anton, > > It's ready for review, look for Patch Available status. > Yes, atomic caches are not fixed by this contribution. See [1] > > [1] https://issues.apache.org/jira/browse/IGNITE-11797 > > вт, 7 мая 2019 г. в 17:30, Anton Vinogradov <[hidden email]>: > >> Alexei, >> >> Got it. >> Could you please let me know once PR will be ready for review? >> Currently, have some questions, but, possible, they caused by non-final PR >> (eg. why atomic counter still ignores misses?). >> >> On Tue, May 7, 2019 at 4:43 PM Alexei Scherbakov < >> [hidden email]> wrote: >> >> > Anton, >> > >> > 1) Extended counters indeed will answer the question if partition could >> be >> > safely restored to synchronized state on all owners. >> > The only condition - one of owners has no missed updates. >> > If not, partition must be moved to LOST state, see [1], >> > TxPartitionCounterStateOnePrimaryTwoBackupsFailAll*Test, >> > >> > >> IgniteSystemProperties#IGNITE_FAIL_NODE_ON_UNRECOVERABLE_PARTITION_INCONSISTENCY >> > This is known issue and could happen if all partition owners were >> > unavailable at some point. >> > In such case we could try to recover consistency using some complex >> > recovery protocol as you described. Related ticket [2] >> > >> > 2) Bitset implementation is considered as an option in GG Community >> > Edition. No specific implementation dates at the moment. >> > >> > 3) As for "online" partition verification, I think the best option >> right >> > now is to do verification partition by partition using read only mode >> per >> > group partition under load. >> > While verification is in progress, all write ops are waiting, not >> rejected. >> > This is only 100% reliable way to compare partitions - by touching >> actual >> > data, all other ways like pre-computed hash are error prone. >> > There is already ticket [3] for simplifing grid consistency verification >> > which could be used as basis for such functionality. >> > As for avoiding cache pollution, we could try read pages sequentially >> from >> > disk without lifting them to pagemem and computing some kind of >> commutative >> > hash. It's safe under partition write lock. >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-11611 >> > [2] https://issues.apache.org/jira/browse/IGNITE-6324 >> > [3] https://issues.apache.org/jira/browse/IGNITE-11256 >> > >> > пн, 6 мая 2019 г. в 16:12, Anton Vinogradov <[hidden email]>: >> > >> > > Ivan, >> > > >> > > 1) I've checked the PR [1] and it looks like it does not solve the >> issue >> > > too. >> > > AFAICS, the main goal here (at PR) is to produce >> > > PartitionUpdateCounter#sequential which can be false for all backups, >> > what >> > > backup should win in that case? >> > > >> > > Is there any IEP or some another design page for this fix? >> > > >> > > Looks like extended counters should be able to recover the whole >> cluster >> > > even in case all copies of the same partition are broken. >> > > So, seems, the counter should provide detailed info: >> > > - biggest applied updateCounter >> > > - list of all missed counters before biggest applied >> > > - optional hash >> > > >> > > In that case, we'll be able to perform some exchange between broken >> > copies. >> > > For example, we'll found that copy1 missed key1, and copy2 missed >> key2. >> > > It's pretty simple to fix both copies in that case. >> > > In case all misses can be solved this way, we'll continue cluster >> > > activation like it was not broken before. >> > > >> > > 2) Seems I see the simpler solution to handle misses (than at PR). >> > > Once you have newUpdateCounter > curUpdateCounter + 1, you should add >> > byte >> > > (or int or long (smaplest possible)) value to special structure. >> > > This value will represent delta between newUpdateCounter and >> > > curUpdateCounter in bitmask way. >> > > In case you'll handle updateCounter less that curUpdateCounter, you >> > should >> > > update the value at structure responsible to this delta. >> > > For example, when you have delta "2 to 6", you will have 00000000 >> > initially >> > > and 00011111 finally. >> > > Each delta update should be finished with check it completed (value >> == 31 >> > > in this case). Once it finished, it should be removed from the >> structure. >> > > Deltas can and should be reused to solve GC issue. >> > > >> > > What do you think about the proposed solution? >> > > >> > > 3) Hash computation can be an additional extension for extended >> counters, >> > > just one more dimension to be extremely sure everything is ok. >> > > Any objections? >> > > >> > > [1] https://github.com/apache/ignite/pull/5765 >> > > >> > > On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <[hidden email]> >> > wrote: >> > > >> > > > Anton, >> > > > >> > > > Automatic quorum-based partition drop may work as a partial >> workaround >> > > > for IGNITE-10078, but discussed approach surely doesn't replace >> > > > IGNITE-10078 activity. We still don't know what do to when quorum >> can't >> > > > be reached (2 partitions have hash X, 2 have hash Y) and keeping >> > > > extended update counters is the only way to resolve such case. >> > > > On the other hand, precalculated partition hashes validation on PME >> can >> > > > be a good addition to IGNITE-10078 logic: we'll be able to detect >> > > > situations when extended update counters are equal, but for some >> reason >> > > > (bug or whatsoever) partition contents are different. >> > > > >> > > > Best Regards, >> > > > Ivan Rakov >> > > > >> > > > On 06.05.2019 12:27, Anton Vinogradov wrote: >> > > > > Ivan, just to make sure ... >> > > > > The discussed case will fully solve the issue [1] in case we'll >> also >> > > add >> > > > > some strategy to reject partitions with missed updates >> > (updateCnt==Ok, >> > > > > Hash!=Ok). >> > > > > For example, we may use the Quorum strategy, when the majority >> wins. >> > > > > Sounds correct? >> > > > > >> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078 >> > > > > >> > > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <[hidden email]> >> > > wrote: >> > > > > >> > > > >> Ivan, >> > > > >> >> > > > >> Thanks for the detailed explanation. >> > > > >> I'll try to implement the PoC to check the idea. >> > > > >> >> > > > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov < >> [hidden email]> >> > > > wrote: >> > > > >> >> > > > >>>> But how to keep this hash? >> > > > >>> I think, we can just adopt way of storing partition update >> > counters. >> > > > >>> Update counters are: >> > > > >>> 1) Kept and updated in heap, see >> > > > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed >> > > during >> > > > >>> regular cache operations, no page replacement latency issues) >> > > > >>> 2) Synchronized with page memory (and with disk) on every >> > checkpoint, >> > > > >>> see GridCacheOffheapManager#saveStoreMetadata >> > > > >>> 3) Stored in partition meta page, see >> > > > PagePartitionMetaIO#setUpdateCounter >> > > > >>> 4) On node restart, we init onheap counter with value from disk >> > (for >> > > > the >> > > > >>> moment of last checkpoint) and update it to latest value during >> WAL >> > > > >>> logical records replay >> > > > >>> >> > > > >>>> 2) PME is a rare operation on production cluster, but, seems, >> we >> > > have >> > > > >>>> to check consistency in a regular way. >> > > > >>>> Since we have to finish all operations before the check, >> should we >> > > > >>>> have fake PME for maintenance check in this case? >> > > > >>> From my experience, PME happens on prod clusters from time to >> > time >> > > > >>> (several times per week), which can be enough. In case it's >> needed >> > to >> > > > >>> check consistency more often than regular PMEs occur, we can >> > > implement >> > > > >>> command that will trigger fake PME for consistency checking. >> > > > >>> >> > > > >>> Best Regards, >> > > > >>> Ivan Rakov >> > > > >>> >> > > > >>> On 29.04.2019 18:53, Anton Vinogradov wrote: >> > > > >>>> Ivan, thanks for the analysis! >> > > > >>>> >> > > > >>>>>> With having pre-calculated partition hash value, we can >> > > > >>>> automatically detect inconsistent partitions on every PME. >> > > > >>>> Great idea, seems this covers all broken synс cases. >> > > > >>>> >> > > > >>>> It will check alive nodes in case the primary failed >> immediately >> > > > >>>> and will check rejoining node once it finished a rebalance >> (PME on >> > > > >>>> becoming an owner). >> > > > >>>> Recovered cluster will be checked on activation PME (or even >> > before >> > > > >>>> that?). >> > > > >>>> Also, warmed cluster will be still warmed after check. >> > > > >>>> >> > > > >>>> Have I missed some cases leads to broken sync except bugs? >> > > > >>>> >> > > > >>>> 1) But how to keep this hash? >> > > > >>>> - It should be automatically persisted on each checkpoint (it >> > should >> > > > >>>> not require recalculation on restore, snapshots should be >> covered >> > > too) >> > > > >>>> (and covered by WAL?). >> > > > >>>> - It should be always available at RAM for every partition >> (even >> > for >> > > > >>>> cold partitions never updated/readed on this node) to be >> > immediately >> > > > >>>> used once all operations done on PME. >> > > > >>>> >> > > > >>>> Can we have special pages to keep such hashes and never allow >> > their >> > > > >>>> eviction? >> > > > >>>> >> > > > >>>> 2) PME is a rare operation on production cluster, but, seems, >> we >> > > have >> > > > >>>> to check consistency in a regular way. >> > > > >>>> Since we have to finish all operations before the check, >> should we >> > > > >>>> have fake PME for maintenance check in this case? >> > > > >>>> >> > > > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov < >> [hidden email] >> > > > >>>> <mailto:[hidden email]>> wrote: >> > > > >>>> >> > > > >>>> Hi Anton, >> > > > >>>> >> > > > >>>> Thanks for sharing your ideas. >> > > > >>>> I think your approach should work in general. I'll just >> share >> > > my >> > > > >>>> concerns about possible issues that may come up. >> > > > >>>> >> > > > >>>> 1) Equality of update counters doesn't imply equality of >> > > > >>>> partitions content under load. >> > > > >>>> For every update, primary node generates update counter >> and >> > > then >> > > > >>>> update is delivered to backup node and gets applied with >> the >> > > > >>>> corresponding update counter. For example, there are two >> > > > >>>> transactions (A and B) that update partition X by the >> > following >> > > > >>>> scenario: >> > > > >>>> - A updates key1 in partition X on primary node and >> > increments >> > > > >>>> counter to 10 >> > > > >>>> - B updates key2 in partition X on primary node and >> > increments >> > > > >>>> counter to 11 >> > > > >>>> - While A is still updating another keys, B is finally >> > > committed >> > > > >>>> - Update of key2 arrives to backup node and sets update >> > counter >> > > > to >> > > > >>> 11 >> > > > >>>> Observer will see equal update counters (11), but update >> of >> > > key 1 >> > > > >>>> is still missing in the backup partition. >> > > > >>>> This is a fundamental problem which is being solved here: >> > > > >>>> https://issues.apache.org/jira/browse/IGNITE-10078 >> > > > >>>> "Online verify" should operate with new complex update >> > counters >> > > > >>>> which take such "update holes" into account. Otherwise, >> > online >> > > > >>>> verify may provide false-positive inconsistency reports. >> > > > >>>> >> > > > >>>> 2) Acquisition and comparison of update counters is fast, >> but >> > > > >>>> partition hash calculation is long. We should check that >> > update >> > > > >>>> counter remains unchanged after every K keys handled. >> > > > >>>> >> > > > >>>> 3) >> > > > >>>> >> > > > >>>>> Another hope is that we'll be able to pause/continue >> scan, >> > for >> > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, >> and >> > > in >> > > > >>>>> three days we'll check the whole cluster. >> > > > >>>> Totally makes sense. >> > > > >>>> We may find ourselves into a situation where some "hot" >> > > > partitions >> > > > >>>> are still unprocessed, and every next attempt to calculate >> > > > >>>> partition hash fails due to another concurrent update. We >> > > should >> > > > >>>> be able to track progress of validation (% of calculation >> > time >> > > > >>>> wasted due to concurrent operations may be a good metric, >> > 100% >> > > is >> > > > >>>> the worst case) and provide option to stop/pause activity. >> > > > >>>> I think, pause should return an "intermediate results >> report" >> > > > with >> > > > >>>> information about which partitions have been successfully >> > > > checked. >> > > > >>>> With such report, we can resume activity later: partitions >> > from >> > > > >>>> report will be just skipped. >> > > > >>>> >> > > > >>>> 4) >> > > > >>>> >> > > > >>>>> Since "Idle verify" uses regular pagmem, I assume it >> > replaces >> > > > hot >> > > > >>>>> data with persisted. >> > > > >>>>> So, we have to warm up the cluster after each check. >> > > > >>>>> Are there any chances to check without cooling the >> cluster? >> > > > >>>> I don't see an easy way to achieve it with our page memory >> > > > >>>> architecture. We definitely can't just read pages from >> disk >> > > > >>>> directly: we need to synchronize page access with >> concurrent >> > > > >>>> update operations and checkpoints. >> > > > >>>> From my point of view, the correct way to solve this >> issue is >> > > > >>>> improving our page replacement [1] mechanics by making it >> > truly >> > > > >>>> scan-resistant. >> > > > >>>> >> > > > >>>> P. S. There's another possible way of achieving online >> > verify: >> > > > >>>> instead of on-demand hash calculation, we can always keep >> > > > >>>> up-to-date hash value for every partition. We'll need to >> > update >> > > > >>>> hash on every insert/update/remove operation, but there >> will >> > be >> > > > no >> > > > >>>> reordering issues as per function that we use for >> aggregating >> > > > hash >> > > > >>>> results (+) is commutative. With having pre-calculated >> > > partition >> > > > >>>> hash value, we can automatically detect inconsistent >> > partitions >> > > > on >> > > > >>>> every PME. What do you think? >> > > > >>>> >> > > > >>>> [1] - >> > > > >>>> >> > > > >>> >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) >> > > > >>>> Best Regards, >> > > > >>>> Ivan Rakov >> > > > >>>> >> > > > >>>> On 29.04.2019 12:20, Anton Vinogradov wrote: >> > > > >>>>> Igniters and especially Ivan Rakov, >> > > > >>>>> >> > > > >>>>> "Idle verify" [1] is a really cool tool, to make sure >> that >> > > > >>>>> cluster is consistent. >> > > > >>>>> >> > > > >>>>> 1) But it required to have operations paused during >> cluster >> > > > check. >> > > > >>>>> At some clusters, this check requires hours (3-4 hours at >> > > cases >> > > > I >> > > > >>>>> saw). >> > > > >>>>> I've checked the code of "idle verify" and it seems it >> > > possible >> > > > >>>>> to make it "online" with some assumptions. >> > > > >>>>> >> > > > >>>>> Idea: >> > > > >>>>> Currently "Idle verify" checks that partitions hashes, >> > > generated >> > > > >>>>> this way >> > > > >>>>> while (it.hasNextX()) { >> > > > >>>>> CacheDataRow row = it.nextX(); >> > > > >>>>> partHash += row.key().hashCode(); >> > > > >>>>> partHash += >> > > > >>>>> >> > > > >>> >> > > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); >> > > > >>>>> } >> > > > >>>>> , are the same. >> > > > >>>>> >> > > > >>>>> What if we'll generate same pairs >> > updateCounter-partitionHash >> > > > but >> > > > >>>>> will compare hashes only in case counters are the same? >> > > > >>>>> So, for example, will ask cluster to generate pairs for >> 64 >> > > > >>>>> partitions, then will find that 55 have the same counters >> > (was >> > > > >>>>> not updated during check) and check them. >> > > > >>>>> The rest (64-55 = 9) partitions will be re-requested and >> > > > >>>>> rechecked with an additional 55. >> > > > >>>>> This way we'll be able to check cluster is consistent >> even >> > in >> > > > >>>>> сase operations are in progress (just retrying modified). >> > > > >>>>> >> > > > >>>>> Risks and assumptions: >> > > > >>>>> Using this strategy we'll check the cluster's consistency >> > ... >> > > > >>>>> eventually, and the check will take more time even on an >> > idle >> > > > >>>>> cluster. >> > > > >>>>> In case operationsPerTimeToGeneratePartitionHashes > >> > > > >>>>> partitionsCount we'll definitely gain no progress. >> > > > >>>>> But, in case of the load is not high, we'll be able to >> check >> > > all >> > > > >>>>> cluster. >> > > > >>>>> >> > > > >>>>> Another hope is that we'll be able to pause/continue >> scan, >> > for >> > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, >> and >> > > in >> > > > >>>>> three days we'll check the whole cluster. >> > > > >>>>> >> > > > >>>>> Have I missed something? >> > > > >>>>> >> > > > >>>>> 2) Since "Idle verify" uses regular pagmem, I assume it >> > > replaces >> > > > >>>>> hot data with persisted. >> > > > >>>>> So, we have to warm up the cluster after each check. >> > > > >>>>> Are there any chances to check without cooling the >> cluster? >> > > > >>>>> >> > > > >>>>> [1] >> > > > >>>>> >> > > > >>> >> > > > >> > > >> > >> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums >> > > > >> > > >> > >> > >> > -- >> > >> > Best regards, >> > Alexei Scherbakov >> > >> > > > -- > > Best regards, > Alexei Scherbakov > -- Best regards, Alexei Scherbakov |
Free forum by Nabble | Edit this page |