Hi Igor,
I noticed that current Cassandra store implementation doesn't support batching for writeAll and deleteAll methods, it simply executes all updates one by one (asynchronously in parallel). I think it can be useful to provide such support and created a ticket [1]. Can you please give your input on this? Does it make sense in your opinion? [1] https://issues.apache.org/jira/browse/IGNITE-3588 -Val |
Hi Valentin,
For writeAll/readAll Cassandra cache store implementation uses async operations (http://www.datastax.com/dev/blog/java-driver-async-queries) and futures, which has the best characteristics in terms of performance. Cassandra BATCH statement is actually quite often anti-pattern for those who come from relational world. BATCH statement concept in Cassandra is totally different from relational world and is not for optimizing batch/bulk operations. The main purpose of Cassandra BATCH is to keep denormalized data in sync. For example when you duplicating the same data into several tables. All other cases are not recommended for Cassandra batches: - https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij - http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html - https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ It's also good to mention that in CassandraCacheStore implementation (actually in CassandraSessionImpl) all operation with Cassandra is wrapped in a loop. The reason is in a case of failure it will be performed 20 attempts to retry the operation with incrementally increasing timeouts starting from 100ms and specific exception handling logic (Cassandra hosts unavailability and etc.). Thus it provides quite reliable persistence mechanism. According to load tests, even on heavily overloaded Cassandra cluster (CPU LOAD > 10 per one core) there were no lost writes/reads/deletes and maximum 6 attempts to perform one operation. Igor Rudyak On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < [hidden email]> wrote: > Hi Igor, > > I noticed that current Cassandra store implementation doesn't support > batching for writeAll and deleteAll methods, it simply executes all updates > one by one (asynchronously in parallel). > > I think it can be useful to provide such support and created a ticket [1]. > Can you please give your input on this? Does it make sense in your opinion? > > [1] https://issues.apache.org/jira/browse/IGNITE-3588 > > -Val > |
On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak <[hidden email]> wrote:
> Hi Valentin, > > For writeAll/readAll Cassandra cache store implementation uses async > operations (http://www.datastax.com/dev/blog/java-driver-async-queries) > and > futures, which has the best characteristics in terms of performance. > > Thanks, Igor. This link describes the query operations, but I could not find the mention of writes. > Cassandra BATCH statement is actually quite often anti-pattern for those > who come from relational world. BATCH statement concept in Cassandra is > totally different from relational world and is not for optimizing > batch/bulk operations. The main purpose of Cassandra BATCH is to keep > denormalized data in sync. For example when you duplicating the same data > into several tables. All other cases are not recommended for Cassandra > batches: > - > > https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij > - > > http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html > - https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ > > It's also good to mention that in CassandraCacheStore implementation > (actually in CassandraSessionImpl) all operation with Cassandra is wrapped > in a loop. The reason is in a case of failure it will be performed 20 > attempts to retry the operation with incrementally increasing timeouts > starting from 100ms and specific exception handling logic (Cassandra hosts > unavailability and etc.). Thus it provides quite reliable persistence > mechanism. According to load tests, even on heavily overloaded Cassandra > cluster (CPU LOAD > 10 per one core) there were no lost > writes/reads/deletes and maximum 6 attempts to perform one operation. > I think that the main point about Cassandra batch operations is not about reliability, but about performance. If user batches up 100s of updates in 1 Cassandra batch, then it will be a lot faster than doing them 1-by-1 in Ignite. Wrapping them into Ignite "putAll(...)" call just seems more logical to me, no? > > Igor Rudyak > > On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < > [hidden email]> wrote: > > > Hi Igor, > > > > I noticed that current Cassandra store implementation doesn't support > > batching for writeAll and deleteAll methods, it simply executes all > updates > > one by one (asynchronously in parallel). > > > > I think it can be useful to provide such support and created a ticket > [1]. > > Can you please give your input on this? Does it make sense in your > opinion? > > > > [1] https://issues.apache.org/jira/browse/IGNITE-3588 > > > > -Val > > > |
Dmitriy,
There is absolutely same approach for all async read/write/delete operations - Cassandra session just provides executeAsync(statement) function for all type of operations. To be more detailed about Cassandra batches, there are actually two types of batches: 1) *Logged batch* (aka atomic) - the main purpose of such batches is to keep duplicated data in sync while updating multiple tables, but at the cost of performance. 2) *Unlogged batch* - the only specific case for such batch is when all updates are addressed to only *one* partition key and batch having "*reasonable size*". In a such situation there *could be* performance benefits if you are using Cassandra *TokenAware* load balancing policy. In this particular case all the updates will go directly without any additional coordination to the primary node, which is responsible for storing data for this partition key. The *generic rule* is that - *individual updates using async mode* provides the best performance ( https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html). That's because it spread all updates across the whole cluster. In contrast to this, when you are using batches, what this is actually doing is putting a huge amount of pressure on a single coordinator node. This is because the coordinator needs to forward each individual insert/update/delete to the correct replicas. In general you're just losing all the benefit of Cassandra TokenAware load balancing policy when you're updating different partitions in a single round trip to the database. Probably the only enhancement which could be done is to separate our batch to smaller batches, each of which is updating records having the same partition key. In this case it could provide some performance benefits when used in combination with Cassandra TokenAware policy. But there are several concerns: 1) It looks like rather rare case 2) Makes error handling more complex - you just don't know what operations in a batch succeed and what failed and need to retry all batch 3) Retry logic could produce more load on the cluster - in case of individual updates you just need to retry the only mutations which are failed, in case of batches you need to retry the whole batch 4)* Unlogged batch is deprecated in Cassandra 3.0* ( https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html), which we are currently using for Ignite Cassandra module. Igor Rudyak On Tue, Jul 26, 2016 at 4:45 PM, Dmitriy Setrakyan <[hidden email]> wrote: > > > On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak <[hidden email]> wrote: > >> Hi Valentin, >> >> For writeAll/readAll Cassandra cache store implementation uses async >> operations (http://www.datastax.com/dev/blog/java-driver-async-queries) >> and >> futures, which has the best characteristics in terms of performance. >> >> > Thanks, Igor. This link describes the query operations, but I could not > find the mention of writes. > > >> Cassandra BATCH statement is actually quite often anti-pattern for those >> who come from relational world. BATCH statement concept in Cassandra is >> totally different from relational world and is not for optimizing >> batch/bulk operations. The main purpose of Cassandra BATCH is to keep >> denormalized data in sync. For example when you duplicating the same data >> into several tables. All other cases are not recommended for Cassandra >> batches: >> - >> >> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij >> - >> >> http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html >> - https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ >> >> It's also good to mention that in CassandraCacheStore implementation >> (actually in CassandraSessionImpl) all operation with Cassandra is wrapped >> in a loop. The reason is in a case of failure it will be performed 20 >> attempts to retry the operation with incrementally increasing timeouts >> starting from 100ms and specific exception handling logic (Cassandra hosts >> unavailability and etc.). Thus it provides quite reliable persistence >> mechanism. According to load tests, even on heavily overloaded Cassandra >> cluster (CPU LOAD > 10 per one core) there were no lost >> writes/reads/deletes and maximum 6 attempts to perform one operation. >> > > I think that the main point about Cassandra batch operations is not about > reliability, but about performance. If user batches up 100s of updates in 1 > Cassandra batch, then it will be a lot faster than doing them 1-by-1 in > Ignite. Wrapping them into Ignite "putAll(...)" call just seems more > logical to me, no? > > >> >> Igor Rudyak >> >> On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < >> [hidden email]> wrote: >> >> > Hi Igor, >> > >> > I noticed that current Cassandra store implementation doesn't support >> > batching for writeAll and deleteAll methods, it simply executes all >> updates >> > one by one (asynchronously in parallel). >> > >> > I think it can be useful to provide such support and created a ticket >> [1]. >> > Can you please give your input on this? Does it make sense in your >> opinion? >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-3588 >> > >> > -Val >> > >> > > |
I am very confused still. Ilya, can you please explain what happens in
Cassandra if user calls IgniteCache.putAll(...) method? In Ignite, if putAll(...) is called, Ignite will make the best effort to execute the update as a batch, in which case the performance is better. What is the analogy in Cassandra? D. On Tue, Jul 26, 2016 at 9:16 PM, Igor Rudyak <[hidden email]> wrote: > Dmitriy, > > There is absolutely same approach for all async read/write/delete > operations - Cassandra session just provides executeAsync(statement) > function > for all type of operations. > > To be more detailed about Cassandra batches, there are actually two types > of batches: > > 1) *Logged batch* (aka atomic) - the main purpose of such batches is to > keep duplicated data in sync while updating multiple tables, but at the > cost of performance. > > 2) *Unlogged batch* - the only specific case for such batch is when all > updates are addressed to only *one* partition key and batch having > "*reasonable > size*". In a such situation there *could be* performance benefits if you > are using Cassandra *TokenAware* load balancing policy. In this particular > case all the updates will go directly without any additional > coordination to the primary node, which is responsible for storing data for > this partition key. > > The *generic rule* is that - *individual updates using async mode* provides > the best performance ( > https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html). That's > because it spread all updates across the whole cluster. In contrast to > this, when you are using batches, what this is actually doing is putting a > huge amount of pressure on a single coordinator node. This is because the > coordinator needs to forward each individual insert/update/delete to the > correct replicas. In general you're just losing all the benefit of > Cassandra TokenAware load balancing policy when you're updating different > partitions in a single round trip to the database. > > Probably the only enhancement which could be done is to separate our batch > to smaller batches, each of which is updating records having the same > partition key. In this case it could provide some performance benefits when > used in combination with Cassandra TokenAware policy. But there are several > concerns: > > 1) It looks like rather rare case > 2) Makes error handling more complex - you just don't know what operations > in a batch succeed and what failed and need to retry all batch > 3) Retry logic could produce more load on the cluster - in case of > individual updates you just need to retry the only mutations which are > failed, in case of batches you need to retry the whole batch > 4)* Unlogged batch is deprecated in Cassandra 3.0* ( > https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html), > which > we are currently using for Ignite Cassandra module. > > > Igor Rudyak > > > > On Tue, Jul 26, 2016 at 4:45 PM, Dmitriy Setrakyan <[hidden email]> > wrote: > > > > > > > On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak <[hidden email]> wrote: > > > >> Hi Valentin, > >> > >> For writeAll/readAll Cassandra cache store implementation uses async > >> operations (http://www.datastax.com/dev/blog/java-driver-async-queries) > >> and > >> futures, which has the best characteristics in terms of performance. > >> > >> > > Thanks, Igor. This link describes the query operations, but I could not > > find the mention of writes. > > > > > >> Cassandra BATCH statement is actually quite often anti-pattern for those > >> who come from relational world. BATCH statement concept in Cassandra is > >> totally different from relational world and is not for optimizing > >> batch/bulk operations. The main purpose of Cassandra BATCH is to keep > >> denormalized data in sync. For example when you duplicating the same > data > >> into several tables. All other cases are not recommended for Cassandra > >> batches: > >> - > >> > >> > https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij > >> - > >> > >> > http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html > >> - https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ > >> > >> It's also good to mention that in CassandraCacheStore implementation > >> (actually in CassandraSessionImpl) all operation with Cassandra is > wrapped > >> in a loop. The reason is in a case of failure it will be performed 20 > >> attempts to retry the operation with incrementally increasing timeouts > >> starting from 100ms and specific exception handling logic (Cassandra > hosts > >> unavailability and etc.). Thus it provides quite reliable persistence > >> mechanism. According to load tests, even on heavily overloaded Cassandra > >> cluster (CPU LOAD > 10 per one core) there were no lost > >> writes/reads/deletes and maximum 6 attempts to perform one operation. > >> > > > > I think that the main point about Cassandra batch operations is not about > > reliability, but about performance. If user batches up 100s of updates > in 1 > > Cassandra batch, then it will be a lot faster than doing them 1-by-1 in > > Ignite. Wrapping them into Ignite "putAll(...)" call just seems more > > logical to me, no? > > > > > >> > >> Igor Rudyak > >> > >> On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < > >> [hidden email]> wrote: > >> > >> > Hi Igor, > >> > > >> > I noticed that current Cassandra store implementation doesn't support > >> > batching for writeAll and deleteAll methods, it simply executes all > >> updates > >> > one by one (asynchronously in parallel). > >> > > >> > I think it can be useful to provide such support and created a ticket > >> [1]. > >> > Can you please give your input on this? Does it make sense in your > >> opinion? > >> > > >> > [1] https://issues.apache.org/jira/browse/IGNITE-3588 > >> > > >> > -Val > >> > > >> > > > > > |
Hi Igor,
Does it make sense for you using logged batches to guarantee atomicity in Cassandra in cases we are doing a cross cache transaction operation? Luiz -- Luiz Felipe Trevisan On Wed, Jul 27, 2016 at 2:05 AM, Dmitriy Setrakyan <[hidden email]> wrote: > I am very confused still. Ilya, can you please explain what happens in > Cassandra if user calls IgniteCache.putAll(...) method? > > In Ignite, if putAll(...) is called, Ignite will make the best effort to > execute the update as a batch, in which case the performance is better. > What is the analogy in Cassandra? > > D. > > On Tue, Jul 26, 2016 at 9:16 PM, Igor Rudyak <[hidden email]> wrote: > > > Dmitriy, > > > > There is absolutely same approach for all async read/write/delete > > operations - Cassandra session just provides executeAsync(statement) > > function > > for all type of operations. > > > > To be more detailed about Cassandra batches, there are actually two types > > of batches: > > > > 1) *Logged batch* (aka atomic) - the main purpose of such batches is to > > keep duplicated data in sync while updating multiple tables, but at the > > cost of performance. > > > > 2) *Unlogged batch* - the only specific case for such batch is when all > > updates are addressed to only *one* partition key and batch having > > "*reasonable > > size*". In a such situation there *could be* performance benefits if you > > are using Cassandra *TokenAware* load balancing policy. In this > particular > > case all the updates will go directly without any additional > > coordination to the primary node, which is responsible for storing data > for > > this partition key. > > > > The *generic rule* is that - *individual updates using async mode* > provides > > the best performance ( > > https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html). > That's > > because it spread all updates across the whole cluster. In contrast to > > this, when you are using batches, what this is actually doing is putting > a > > huge amount of pressure on a single coordinator node. This is because the > > coordinator needs to forward each individual insert/update/delete to the > > correct replicas. In general you're just losing all the benefit of > > Cassandra TokenAware load balancing policy when you're updating different > > partitions in a single round trip to the database. > > > > Probably the only enhancement which could be done is to separate our > batch > > to smaller batches, each of which is updating records having the same > > partition key. In this case it could provide some performance benefits > when > > used in combination with Cassandra TokenAware policy. But there are > several > > concerns: > > > > 1) It looks like rather rare case > > 2) Makes error handling more complex - you just don't know what > operations > > in a batch succeed and what failed and need to retry all batch > > 3) Retry logic could produce more load on the cluster - in case of > > individual updates you just need to retry the only mutations which are > > failed, in case of batches you need to retry the whole batch > > 4)* Unlogged batch is deprecated in Cassandra 3.0* ( > > https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html), > > which > > we are currently using for Ignite Cassandra module. > > > > > > Igor Rudyak > > > > > > > > On Tue, Jul 26, 2016 at 4:45 PM, Dmitriy Setrakyan < > [hidden email]> > > wrote: > > > > > > > > > > > On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak <[hidden email]> > wrote: > > > > > >> Hi Valentin, > > >> > > >> For writeAll/readAll Cassandra cache store implementation uses async > > >> operations ( > http://www.datastax.com/dev/blog/java-driver-async-queries) > > >> and > > >> futures, which has the best characteristics in terms of performance. > > >> > > >> > > > Thanks, Igor. This link describes the query operations, but I could not > > > find the mention of writes. > > > > > > > > >> Cassandra BATCH statement is actually quite often anti-pattern for > those > > >> who come from relational world. BATCH statement concept in Cassandra > is > > >> totally different from relational world and is not for optimizing > > >> batch/bulk operations. The main purpose of Cassandra BATCH is to keep > > >> denormalized data in sync. For example when you duplicating the same > > data > > >> into several tables. All other cases are not recommended for Cassandra > > >> batches: > > >> - > > >> > > >> > > > https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij > > >> - > > >> > > >> > > > http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html > > >> - > https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ > > >> > > >> It's also good to mention that in CassandraCacheStore implementation > > >> (actually in CassandraSessionImpl) all operation with Cassandra is > > wrapped > > >> in a loop. The reason is in a case of failure it will be performed 20 > > >> attempts to retry the operation with incrementally increasing timeouts > > >> starting from 100ms and specific exception handling logic (Cassandra > > hosts > > >> unavailability and etc.). Thus it provides quite reliable persistence > > >> mechanism. According to load tests, even on heavily overloaded > Cassandra > > >> cluster (CPU LOAD > 10 per one core) there were no lost > > >> writes/reads/deletes and maximum 6 attempts to perform one operation. > > >> > > > > > > I think that the main point about Cassandra batch operations is not > about > > > reliability, but about performance. If user batches up 100s of updates > > in 1 > > > Cassandra batch, then it will be a lot faster than doing them 1-by-1 in > > > Ignite. Wrapping them into Ignite "putAll(...)" call just seems more > > > logical to me, no? > > > > > > > > >> > > >> Igor Rudyak > > >> > > >> On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < > > >> [hidden email]> wrote: > > >> > > >> > Hi Igor, > > >> > > > >> > I noticed that current Cassandra store implementation doesn't > support > > >> > batching for writeAll and deleteAll methods, it simply executes all > > >> updates > > >> > one by one (asynchronously in parallel). > > >> > > > >> > I think it can be useful to provide such support and created a > ticket > > >> [1]. > > >> > Can you please give your input on this? Does it make sense in your > > >> opinion? > > >> > > > >> > [1] https://issues.apache.org/jira/browse/IGNITE-3588 > > >> > > > >> > -Val > > >> > > > >> > > > > > > > > > |
Hi Luiz,
Logged batches is not the solution to achieve atomic view of your Ignite transaction changes in Cassandra. The problem with logged batches(aka atomic) is they guarantees that if any part of the batch succeeds, all of it will, no other transactional enforcement is done at the batch level. For example, there is no batch isolation. Clients are able to read the first updated rows from the batch, while other rows are still being updated on the server (in RDBMS terminology it means *READ-UNCOMMITED* isolation level). Thus Cassandra mean "atomic" in the database sense that if any part of the batch succeeds, all of it will. Probably the best way to archive read atomic isolation for Ignite transaction persisting data into Cassandra, is to implement RAMP transactions (http://www.bailis.org/papers/ramp-sigmod2014.pdf) on top of Cassandra. I may create a ticket for this if community would like it. Igor Rudyak On Wed, Jul 27, 2016 at 12:55 PM, Luiz Felipe Trevisan < [hidden email]> wrote: > Hi Igor, > > Does it make sense for you using logged batches to guarantee atomicity in > Cassandra in cases we are doing a cross cache transaction operation? > > Luiz > > -- > Luiz Felipe Trevisan > > On Wed, Jul 27, 2016 at 2:05 AM, Dmitriy Setrakyan <[hidden email]> > wrote: > >> I am very confused still. Ilya, can you please explain what happens in >> Cassandra if user calls IgniteCache.putAll(...) method? >> >> In Ignite, if putAll(...) is called, Ignite will make the best effort to >> execute the update as a batch, in which case the performance is better. >> What is the analogy in Cassandra? >> >> D. >> >> On Tue, Jul 26, 2016 at 9:16 PM, Igor Rudyak <[hidden email]> wrote: >> >> > Dmitriy, >> > >> > There is absolutely same approach for all async read/write/delete >> > operations - Cassandra session just provides executeAsync(statement) >> > function >> > for all type of operations. >> > >> > To be more detailed about Cassandra batches, there are actually two >> types >> > of batches: >> > >> > 1) *Logged batch* (aka atomic) - the main purpose of such batches is to >> > keep duplicated data in sync while updating multiple tables, but at the >> > cost of performance. >> > >> > 2) *Unlogged batch* - the only specific case for such batch is when all >> > updates are addressed to only *one* partition key and batch having >> > "*reasonable >> > size*". In a such situation there *could be* performance benefits if you >> > are using Cassandra *TokenAware* load balancing policy. In this >> particular >> > case all the updates will go directly without any additional >> > coordination to the primary node, which is responsible for storing data >> for >> > this partition key. >> > >> > The *generic rule* is that - *individual updates using async mode* >> provides >> > the best performance ( >> > https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html). >> That's >> > because it spread all updates across the whole cluster. In contrast to >> > this, when you are using batches, what this is actually doing is >> putting a >> > huge amount of pressure on a single coordinator node. This is because >> the >> > coordinator needs to forward each individual insert/update/delete to the >> > correct replicas. In general you're just losing all the benefit of >> > Cassandra TokenAware load balancing policy when you're updating >> different >> > partitions in a single round trip to the database. >> > >> > Probably the only enhancement which could be done is to separate our >> batch >> > to smaller batches, each of which is updating records having the same >> > partition key. In this case it could provide some performance benefits >> when >> > used in combination with Cassandra TokenAware policy. But there are >> several >> > concerns: >> > >> > 1) It looks like rather rare case >> > 2) Makes error handling more complex - you just don't know what >> operations >> > in a batch succeed and what failed and need to retry all batch >> > 3) Retry logic could produce more load on the cluster - in case of >> > individual updates you just need to retry the only mutations which are >> > failed, in case of batches you need to retry the whole batch >> > 4)* Unlogged batch is deprecated in Cassandra 3.0* ( >> > https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html), >> > which >> > we are currently using for Ignite Cassandra module. >> > >> > >> > Igor Rudyak >> > >> > >> > >> > On Tue, Jul 26, 2016 at 4:45 PM, Dmitriy Setrakyan < >> [hidden email]> >> > wrote: >> > >> > > >> > > >> > > On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak <[hidden email]> >> wrote: >> > > >> > >> Hi Valentin, >> > >> >> > >> For writeAll/readAll Cassandra cache store implementation uses async >> > >> operations ( >> http://www.datastax.com/dev/blog/java-driver-async-queries) >> > >> and >> > >> futures, which has the best characteristics in terms of performance. >> > >> >> > >> >> > > Thanks, Igor. This link describes the query operations, but I could >> not >> > > find the mention of writes. >> > > >> > > >> > >> Cassandra BATCH statement is actually quite often anti-pattern for >> those >> > >> who come from relational world. BATCH statement concept in Cassandra >> is >> > >> totally different from relational world and is not for optimizing >> > >> batch/bulk operations. The main purpose of Cassandra BATCH is to keep >> > >> denormalized data in sync. For example when you duplicating the same >> > data >> > >> into several tables. All other cases are not recommended for >> Cassandra >> > >> batches: >> > >> - >> > >> >> > >> >> > >> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij >> > >> - >> > >> >> > >> >> > >> http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html >> > >> - >> https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ >> > >> >> > >> It's also good to mention that in CassandraCacheStore implementation >> > >> (actually in CassandraSessionImpl) all operation with Cassandra is >> > wrapped >> > >> in a loop. The reason is in a case of failure it will be performed 20 >> > >> attempts to retry the operation with incrementally increasing >> timeouts >> > >> starting from 100ms and specific exception handling logic (Cassandra >> > hosts >> > >> unavailability and etc.). Thus it provides quite reliable persistence >> > >> mechanism. According to load tests, even on heavily overloaded >> Cassandra >> > >> cluster (CPU LOAD > 10 per one core) there were no lost >> > >> writes/reads/deletes and maximum 6 attempts to perform one operation. >> > >> >> > > >> > > I think that the main point about Cassandra batch operations is not >> about >> > > reliability, but about performance. If user batches up 100s of updates >> > in 1 >> > > Cassandra batch, then it will be a lot faster than doing them 1-by-1 >> in >> > > Ignite. Wrapping them into Ignite "putAll(...)" call just seems more >> > > logical to me, no? >> > > >> > > >> > >> >> > >> Igor Rudyak >> > >> >> > >> On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < >> > >> [hidden email]> wrote: >> > >> >> > >> > Hi Igor, >> > >> > >> > >> > I noticed that current Cassandra store implementation doesn't >> support >> > >> > batching for writeAll and deleteAll methods, it simply executes all >> > >> updates >> > >> > one by one (asynchronously in parallel). >> > >> > >> > >> > I think it can be useful to provide such support and created a >> ticket >> > >> [1]. >> > >> > Can you please give your input on this? Does it make sense in your >> > >> opinion? >> > >> > >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-3588 >> > >> > >> > >> > -Val >> > >> > >> > >> >> > > >> > > >> > >> > > |
I totally agree with you regarding the guarantees we have with logged
batches and I'm also pretty much aware of the performance penalty involved using this solution. But since all read operations are executed via ignite it means that isolation in the Cassandra level is not really important. I think the only guarantee really needed is that we don't end up with a partial insert in Cassandra in case we have a failure in ignite and we loose the node that was responsible for this write operation. My other assumption is that the write operation needs to finish before an eviction happens for this entry and we loose the data in cache (since batch doesn't guarantee isolation). However if we cannot achieve this I don't see why use ignite as a cache store. Luiz -- Luiz Felipe Trevisan On Wed, Jul 27, 2016 at 4:55 PM, Igor Rudyak <[hidden email]> wrote: > Hi Luiz, > > Logged batches is not the solution to achieve atomic view of your Ignite > transaction changes in Cassandra. > > The problem with logged batches(aka atomic) is they guarantees that if any > part of the batch succeeds, all of it will, no other transactional > enforcement is done at the batch level. For example, there is no batch > isolation. Clients are able to read the first updated rows from the batch, > while other rows are still being updated on the server (in RDBMS > terminology it means *READ-UNCOMMITED* isolation level). Thus Cassandra > mean "atomic" in the database sense that if any part of the batch succeeds, > all of it will. > > Probably the best way to archive read atomic isolation for Ignite > transaction persisting data into Cassandra, is to implement RAMP > transactions (http://www.bailis.org/papers/ramp-sigmod2014.pdf) on top of > Cassandra. > > I may create a ticket for this if community would like it. > > > Igor Rudyak > > > On Wed, Jul 27, 2016 at 12:55 PM, Luiz Felipe Trevisan < > [hidden email]> wrote: > >> Hi Igor, >> >> Does it make sense for you using logged batches to guarantee atomicity in >> Cassandra in cases we are doing a cross cache transaction operation? >> >> Luiz >> >> -- >> Luiz Felipe Trevisan >> >> On Wed, Jul 27, 2016 at 2:05 AM, Dmitriy Setrakyan <[hidden email] >> > wrote: >> >>> I am very confused still. Ilya, can you please explain what happens in >>> Cassandra if user calls IgniteCache.putAll(...) method? >>> >>> In Ignite, if putAll(...) is called, Ignite will make the best effort to >>> execute the update as a batch, in which case the performance is better. >>> What is the analogy in Cassandra? >>> >>> D. >>> >>> On Tue, Jul 26, 2016 at 9:16 PM, Igor Rudyak <[hidden email]> wrote: >>> >>> > Dmitriy, >>> > >>> > There is absolutely same approach for all async read/write/delete >>> > operations - Cassandra session just provides executeAsync(statement) >>> > function >>> > for all type of operations. >>> > >>> > To be more detailed about Cassandra batches, there are actually two >>> types >>> > of batches: >>> > >>> > 1) *Logged batch* (aka atomic) - the main purpose of such batches is to >>> > keep duplicated data in sync while updating multiple tables, but at the >>> > cost of performance. >>> > >>> > 2) *Unlogged batch* - the only specific case for such batch is when all >>> > updates are addressed to only *one* partition key and batch having >>> > "*reasonable >>> > size*". In a such situation there *could be* performance benefits if >>> you >>> > are using Cassandra *TokenAware* load balancing policy. In this >>> particular >>> > case all the updates will go directly without any additional >>> > coordination to the primary node, which is responsible for storing >>> data for >>> > this partition key. >>> > >>> > The *generic rule* is that - *individual updates using async mode* >>> provides >>> > the best performance ( >>> > https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html). >>> That's >>> > because it spread all updates across the whole cluster. In contrast to >>> > this, when you are using batches, what this is actually doing is >>> putting a >>> > huge amount of pressure on a single coordinator node. This is because >>> the >>> > coordinator needs to forward each individual insert/update/delete to >>> the >>> > correct replicas. In general you're just losing all the benefit of >>> > Cassandra TokenAware load balancing policy when you're updating >>> different >>> > partitions in a single round trip to the database. >>> > >>> > Probably the only enhancement which could be done is to separate our >>> batch >>> > to smaller batches, each of which is updating records having the same >>> > partition key. In this case it could provide some performance benefits >>> when >>> > used in combination with Cassandra TokenAware policy. But there are >>> several >>> > concerns: >>> > >>> > 1) It looks like rather rare case >>> > 2) Makes error handling more complex - you just don't know what >>> operations >>> > in a batch succeed and what failed and need to retry all batch >>> > 3) Retry logic could produce more load on the cluster - in case of >>> > individual updates you just need to retry the only mutations which are >>> > failed, in case of batches you need to retry the whole batch >>> > 4)* Unlogged batch is deprecated in Cassandra 3.0* ( >>> > https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html), >>> > which >>> > we are currently using for Ignite Cassandra module. >>> > >>> > >>> > Igor Rudyak >>> > >>> > >>> > >>> > On Tue, Jul 26, 2016 at 4:45 PM, Dmitriy Setrakyan < >>> [hidden email]> >>> > wrote: >>> > >>> > > >>> > > >>> > > On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak <[hidden email]> >>> wrote: >>> > > >>> > >> Hi Valentin, >>> > >> >>> > >> For writeAll/readAll Cassandra cache store implementation uses async >>> > >> operations ( >>> http://www.datastax.com/dev/blog/java-driver-async-queries) >>> > >> and >>> > >> futures, which has the best characteristics in terms of performance. >>> > >> >>> > >> >>> > > Thanks, Igor. This link describes the query operations, but I could >>> not >>> > > find the mention of writes. >>> > > >>> > > >>> > >> Cassandra BATCH statement is actually quite often anti-pattern for >>> those >>> > >> who come from relational world. BATCH statement concept in >>> Cassandra is >>> > >> totally different from relational world and is not for optimizing >>> > >> batch/bulk operations. The main purpose of Cassandra BATCH is to >>> keep >>> > >> denormalized data in sync. For example when you duplicating the same >>> > data >>> > >> into several tables. All other cases are not recommended for >>> Cassandra >>> > >> batches: >>> > >> - >>> > >> >>> > >> >>> > >>> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij >>> > >> - >>> > >> >>> > >> >>> > >>> http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html >>> > >> - >>> https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ >>> > >> >>> > >> It's also good to mention that in CassandraCacheStore implementation >>> > >> (actually in CassandraSessionImpl) all operation with Cassandra is >>> > wrapped >>> > >> in a loop. The reason is in a case of failure it will be performed >>> 20 >>> > >> attempts to retry the operation with incrementally increasing >>> timeouts >>> > >> starting from 100ms and specific exception handling logic (Cassandra >>> > hosts >>> > >> unavailability and etc.). Thus it provides quite reliable >>> persistence >>> > >> mechanism. According to load tests, even on heavily overloaded >>> Cassandra >>> > >> cluster (CPU LOAD > 10 per one core) there were no lost >>> > >> writes/reads/deletes and maximum 6 attempts to perform one >>> operation. >>> > >> >>> > > >>> > > I think that the main point about Cassandra batch operations is not >>> about >>> > > reliability, but about performance. If user batches up 100s of >>> updates >>> > in 1 >>> > > Cassandra batch, then it will be a lot faster than doing them 1-by-1 >>> in >>> > > Ignite. Wrapping them into Ignite "putAll(...)" call just seems more >>> > > logical to me, no? >>> > > >>> > > >>> > >> >>> > >> Igor Rudyak >>> > >> >>> > >> On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < >>> > >> [hidden email]> wrote: >>> > >> >>> > >> > Hi Igor, >>> > >> > >>> > >> > I noticed that current Cassandra store implementation doesn't >>> support >>> > >> > batching for writeAll and deleteAll methods, it simply executes >>> all >>> > >> updates >>> > >> > one by one (asynchronously in parallel). >>> > >> > >>> > >> > I think it can be useful to provide such support and created a >>> ticket >>> > >> [1]. >>> > >> > Can you please give your input on this? Does it make sense in your >>> > >> opinion? >>> > >> > >>> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-3588 >>> > >> > >>> > >> > -Val >>> > >> > >>> > >> >>> > > >>> > > >>> > >>> >> >> > |
There are actually some cases when atomic read isolation in Cassandra could
be important. Lets assume batch was persisted in Cassandra, but not finalized yet - read operation from Cassandra returns us only partially committed data of the batch. In the such situation we have problems when: 1) Some of the batch records already expired from Ignite cache and we reading them from persistent store (Cassandra in our case). 2) All Ignite nodes storing the batch records (or subset records) died (or for example became unavailable for 10sec because of network problem). While reading such records from Ignite cache we will be redirected to persistent store. 3) Network separation occurred such a way that we now have two Ignite cluster, but all the replicas of the batch data are located only in one of these clusters. Again while reading such records from Ignite cache on the second cluster we will be redirected to persistent store. In all mentioned cases, if Cassandra batch isn't finalized yet - we will read partially committed transaction data. On Thu, Jul 28, 2016 at 6:52 AM, Luiz Felipe Trevisan < [hidden email]> wrote: > I totally agree with you regarding the guarantees we have with logged > batches and I'm also pretty much aware of the performance penalty involved > using this solution. > > But since all read operations are executed via ignite it means that > isolation in the Cassandra level is not really important. I think the only > guarantee really needed is that we don't end up with a partial insert in > Cassandra in case we have a failure in ignite and we loose the node that > was responsible for this write operation. > > My other assumption is that the write operation needs to finish before an > eviction happens for this entry and we loose the data in cache (since batch > doesn't guarantee isolation). However if we cannot achieve this I don't see > why use ignite as a cache store. > > Luiz > > -- > Luiz Felipe Trevisan > > On Wed, Jul 27, 2016 at 4:55 PM, Igor Rudyak <[hidden email]> wrote: > >> Hi Luiz, >> >> Logged batches is not the solution to achieve atomic view of your Ignite >> transaction changes in Cassandra. >> >> The problem with logged batches(aka atomic) is they guarantees that if >> any part of the batch succeeds, all of it will, no other transactional >> enforcement is done at the batch level. For example, there is no batch >> isolation. Clients are able to read the first updated rows from the batch, >> while other rows are still being updated on the server (in RDBMS >> terminology it means *READ-UNCOMMITED* isolation level). Thus Cassandra >> mean "atomic" in the database sense that if any part of the batch succeeds, >> all of it will. >> >> Probably the best way to archive read atomic isolation for Ignite >> transaction persisting data into Cassandra, is to implement RAMP >> transactions (http://www.bailis.org/papers/ramp-sigmod2014.pdf) on top >> of Cassandra. >> >> I may create a ticket for this if community would like it. >> >> >> Igor Rudyak >> >> >> On Wed, Jul 27, 2016 at 12:55 PM, Luiz Felipe Trevisan < >> [hidden email]> wrote: >> >>> Hi Igor, >>> >>> Does it make sense for you using logged batches to guarantee atomicity >>> in Cassandra in cases we are doing a cross cache transaction operation? >>> >>> Luiz >>> >>> -- >>> Luiz Felipe Trevisan >>> >>> On Wed, Jul 27, 2016 at 2:05 AM, Dmitriy Setrakyan < >>> [hidden email]> wrote: >>> >>>> I am very confused still. Ilya, can you please explain what happens in >>>> Cassandra if user calls IgniteCache.putAll(...) method? >>>> >>>> In Ignite, if putAll(...) is called, Ignite will make the best effort to >>>> execute the update as a batch, in which case the performance is better. >>>> What is the analogy in Cassandra? >>>> >>>> D. >>>> >>>> On Tue, Jul 26, 2016 at 9:16 PM, Igor Rudyak <[hidden email]> wrote: >>>> >>>> > Dmitriy, >>>> > >>>> > There is absolutely same approach for all async read/write/delete >>>> > operations - Cassandra session just provides executeAsync(statement) >>>> > function >>>> > for all type of operations. >>>> > >>>> > To be more detailed about Cassandra batches, there are actually two >>>> types >>>> > of batches: >>>> > >>>> > 1) *Logged batch* (aka atomic) - the main purpose of such batches is >>>> to >>>> > keep duplicated data in sync while updating multiple tables, but at >>>> the >>>> > cost of performance. >>>> > >>>> > 2) *Unlogged batch* - the only specific case for such batch is when >>>> all >>>> > updates are addressed to only *one* partition key and batch having >>>> > "*reasonable >>>> > size*". In a such situation there *could be* performance benefits if >>>> you >>>> > are using Cassandra *TokenAware* load balancing policy. In this >>>> particular >>>> > case all the updates will go directly without any additional >>>> > coordination to the primary node, which is responsible for storing >>>> data for >>>> > this partition key. >>>> > >>>> > The *generic rule* is that - *individual updates using async mode* >>>> provides >>>> > the best performance ( >>>> > https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html). >>>> That's >>>> > because it spread all updates across the whole cluster. In contrast to >>>> > this, when you are using batches, what this is actually doing is >>>> putting a >>>> > huge amount of pressure on a single coordinator node. This is because >>>> the >>>> > coordinator needs to forward each individual insert/update/delete to >>>> the >>>> > correct replicas. In general you're just losing all the benefit of >>>> > Cassandra TokenAware load balancing policy when you're updating >>>> different >>>> > partitions in a single round trip to the database. >>>> > >>>> > Probably the only enhancement which could be done is to separate our >>>> batch >>>> > to smaller batches, each of which is updating records having the same >>>> > partition key. In this case it could provide some performance >>>> benefits when >>>> > used in combination with Cassandra TokenAware policy. But there are >>>> several >>>> > concerns: >>>> > >>>> > 1) It looks like rather rare case >>>> > 2) Makes error handling more complex - you just don't know what >>>> operations >>>> > in a batch succeed and what failed and need to retry all batch >>>> > 3) Retry logic could produce more load on the cluster - in case of >>>> > individual updates you just need to retry the only mutations which are >>>> > failed, in case of batches you need to retry the whole batch >>>> > 4)* Unlogged batch is deprecated in Cassandra 3.0* ( >>>> > https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html), >>>> > which >>>> > we are currently using for Ignite Cassandra module. >>>> > >>>> > >>>> > Igor Rudyak >>>> > >>>> > >>>> > >>>> > On Tue, Jul 26, 2016 at 4:45 PM, Dmitriy Setrakyan < >>>> [hidden email]> >>>> > wrote: >>>> > >>>> > > >>>> > > >>>> > > On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak <[hidden email]> >>>> wrote: >>>> > > >>>> > >> Hi Valentin, >>>> > >> >>>> > >> For writeAll/readAll Cassandra cache store implementation uses >>>> async >>>> > >> operations ( >>>> http://www.datastax.com/dev/blog/java-driver-async-queries) >>>> > >> and >>>> > >> futures, which has the best characteristics in terms of >>>> performance. >>>> > >> >>>> > >> >>>> > > Thanks, Igor. This link describes the query operations, but I could >>>> not >>>> > > find the mention of writes. >>>> > > >>>> > > >>>> > >> Cassandra BATCH statement is actually quite often anti-pattern for >>>> those >>>> > >> who come from relational world. BATCH statement concept in >>>> Cassandra is >>>> > >> totally different from relational world and is not for optimizing >>>> > >> batch/bulk operations. The main purpose of Cassandra BATCH is to >>>> keep >>>> > >> denormalized data in sync. For example when you duplicating the >>>> same >>>> > data >>>> > >> into several tables. All other cases are not recommended for >>>> Cassandra >>>> > >> batches: >>>> > >> - >>>> > >> >>>> > >> >>>> > >>>> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij >>>> > >> - >>>> > >> >>>> > >> >>>> > >>>> http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html >>>> > >> - >>>> https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ >>>> > >> >>>> > >> It's also good to mention that in CassandraCacheStore >>>> implementation >>>> > >> (actually in CassandraSessionImpl) all operation with Cassandra is >>>> > wrapped >>>> > >> in a loop. The reason is in a case of failure it will be performed >>>> 20 >>>> > >> attempts to retry the operation with incrementally increasing >>>> timeouts >>>> > >> starting from 100ms and specific exception handling logic >>>> (Cassandra >>>> > hosts >>>> > >> unavailability and etc.). Thus it provides quite reliable >>>> persistence >>>> > >> mechanism. According to load tests, even on heavily overloaded >>>> Cassandra >>>> > >> cluster (CPU LOAD > 10 per one core) there were no lost >>>> > >> writes/reads/deletes and maximum 6 attempts to perform one >>>> operation. >>>> > >> >>>> > > >>>> > > I think that the main point about Cassandra batch operations is not >>>> about >>>> > > reliability, but about performance. If user batches up 100s of >>>> updates >>>> > in 1 >>>> > > Cassandra batch, then it will be a lot faster than doing them >>>> 1-by-1 in >>>> > > Ignite. Wrapping them into Ignite "putAll(...)" call just seems more >>>> > > logical to me, no? >>>> > > >>>> > > >>>> > >> >>>> > >> Igor Rudyak >>>> > >> >>>> > >> On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < >>>> > >> [hidden email]> wrote: >>>> > >> >>>> > >> > Hi Igor, >>>> > >> > >>>> > >> > I noticed that current Cassandra store implementation doesn't >>>> support >>>> > >> > batching for writeAll and deleteAll methods, it simply executes >>>> all >>>> > >> updates >>>> > >> > one by one (asynchronously in parallel). >>>> > >> > >>>> > >> > I think it can be useful to provide such support and created a >>>> ticket >>>> > >> [1]. >>>> > >> > Can you please give your input on this? Does it make sense in >>>> your >>>> > >> opinion? >>>> > >> > >>>> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-3588 >>>> > >> > >>>> > >> > -Val >>>> > >> > >>>> > >> >>>> > > >>>> > > >>>> > >>>> >>> >>> >> > |
Hi Igor,
I'm not a big Cassandra expert, but here are my thoughts. 1. Sending updates in a batch is always better than sending them one by one. For example, if you do putAll in Ignite with 100 entries, and these entries are split across 5 nodes, the client will send 5 requests instead of 100. This provides significant performance improvement. Is there a way to use similar approach in Cassandra? 2. As for logged batches, I can easily believe that this is a rarely used feature, but since it exists in Cassandra, I can't find a single reason why not to support it in our store as an option. Users that come across those rare cases, will only say thank you to us :) What do you think? -Val On Thu, Jul 28, 2016 at 10:41 PM, Igor Rudyak <[hidden email]> wrote: > There are actually some cases when atomic read isolation in Cassandra could > be important. Lets assume batch was persisted in Cassandra, but not > finalized yet - read operation from Cassandra returns us only partially > committed data of the batch. In the such situation we have problems when: > > 1) Some of the batch records already expired from Ignite cache and we > reading them from persistent store (Cassandra in our case). > > 2) All Ignite nodes storing the batch records (or subset records) died (or > for example became unavailable for 10sec because of network problem). While > reading such records from Ignite cache we will be redirected to persistent > store. > > 3) Network separation occurred such a way that we now have two Ignite > cluster, but all the replicas of the batch data are located only in one of > these clusters. Again while reading such records from Ignite cache on the > second cluster we will be redirected to persistent store. > > In all mentioned cases, if Cassandra batch isn't finalized yet - we will > read partially committed transaction data. > > > On Thu, Jul 28, 2016 at 6:52 AM, Luiz Felipe Trevisan < > [hidden email]> wrote: > > > I totally agree with you regarding the guarantees we have with logged > > batches and I'm also pretty much aware of the performance penalty > involved > > using this solution. > > > > But since all read operations are executed via ignite it means that > > isolation in the Cassandra level is not really important. I think the > only > > guarantee really needed is that we don't end up with a partial insert in > > Cassandra in case we have a failure in ignite and we loose the node that > > was responsible for this write operation. > > > > My other assumption is that the write operation needs to finish before an > > eviction happens for this entry and we loose the data in cache (since > batch > > doesn't guarantee isolation). However if we cannot achieve this I don't > see > > why use ignite as a cache store. > > > > Luiz > > > > -- > > Luiz Felipe Trevisan > > > > On Wed, Jul 27, 2016 at 4:55 PM, Igor Rudyak <[hidden email]> wrote: > > > >> Hi Luiz, > >> > >> Logged batches is not the solution to achieve atomic view of your Ignite > >> transaction changes in Cassandra. > >> > >> The problem with logged batches(aka atomic) is they guarantees that if > >> any part of the batch succeeds, all of it will, no other transactional > >> enforcement is done at the batch level. For example, there is no batch > >> isolation. Clients are able to read the first updated rows from the > batch, > >> while other rows are still being updated on the server (in RDBMS > >> terminology it means *READ-UNCOMMITED* isolation level). Thus Cassandra > >> mean "atomic" in the database sense that if any part of the batch > succeeds, > >> all of it will. > >> > >> Probably the best way to archive read atomic isolation for Ignite > >> transaction persisting data into Cassandra, is to implement RAMP > >> transactions (http://www.bailis.org/papers/ramp-sigmod2014.pdf) on top > >> of Cassandra. > >> > >> I may create a ticket for this if community would like it. > >> > >> > >> Igor Rudyak > >> > >> > >> On Wed, Jul 27, 2016 at 12:55 PM, Luiz Felipe Trevisan < > >> [hidden email]> wrote: > >> > >>> Hi Igor, > >>> > >>> Does it make sense for you using logged batches to guarantee atomicity > >>> in Cassandra in cases we are doing a cross cache transaction operation? > >>> > >>> Luiz > >>> > >>> -- > >>> Luiz Felipe Trevisan > >>> > >>> On Wed, Jul 27, 2016 at 2:05 AM, Dmitriy Setrakyan < > >>> [hidden email]> wrote: > >>> > >>>> I am very confused still. Ilya, can you please explain what happens in > >>>> Cassandra if user calls IgniteCache.putAll(...) method? > >>>> > >>>> In Ignite, if putAll(...) is called, Ignite will make the best effort > to > >>>> execute the update as a batch, in which case the performance is > better. > >>>> What is the analogy in Cassandra? > >>>> > >>>> D. > >>>> > >>>> On Tue, Jul 26, 2016 at 9:16 PM, Igor Rudyak <[hidden email]> > wrote: > >>>> > >>>> > Dmitriy, > >>>> > > >>>> > There is absolutely same approach for all async read/write/delete > >>>> > operations - Cassandra session just provides executeAsync(statement) > >>>> > function > >>>> > for all type of operations. > >>>> > > >>>> > To be more detailed about Cassandra batches, there are actually two > >>>> types > >>>> > of batches: > >>>> > > >>>> > 1) *Logged batch* (aka atomic) - the main purpose of such batches is > >>>> to > >>>> > keep duplicated data in sync while updating multiple tables, but at > >>>> the > >>>> > cost of performance. > >>>> > > >>>> > 2) *Unlogged batch* - the only specific case for such batch is when > >>>> all > >>>> > updates are addressed to only *one* partition key and batch having > >>>> > "*reasonable > >>>> > size*". In a such situation there *could be* performance benefits if > >>>> you > >>>> > are using Cassandra *TokenAware* load balancing policy. In this > >>>> particular > >>>> > case all the updates will go directly without any additional > >>>> > coordination to the primary node, which is responsible for storing > >>>> data for > >>>> > this partition key. > >>>> > > >>>> > The *generic rule* is that - *individual updates using async mode* > >>>> provides > >>>> > the best performance ( > >>>> > https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html). > >>>> That's > >>>> > because it spread all updates across the whole cluster. In contrast > to > >>>> > this, when you are using batches, what this is actually doing is > >>>> putting a > >>>> > huge amount of pressure on a single coordinator node. This is > because > >>>> the > >>>> > coordinator needs to forward each individual insert/update/delete to > >>>> the > >>>> > correct replicas. In general you're just losing all the benefit of > >>>> > Cassandra TokenAware load balancing policy when you're updating > >>>> different > >>>> > partitions in a single round trip to the database. > >>>> > > >>>> > Probably the only enhancement which could be done is to separate our > >>>> batch > >>>> > to smaller batches, each of which is updating records having the > same > >>>> > partition key. In this case it could provide some performance > >>>> benefits when > >>>> > used in combination with Cassandra TokenAware policy. But there are > >>>> several > >>>> > concerns: > >>>> > > >>>> > 1) It looks like rather rare case > >>>> > 2) Makes error handling more complex - you just don't know what > >>>> operations > >>>> > in a batch succeed and what failed and need to retry all batch > >>>> > 3) Retry logic could produce more load on the cluster - in case of > >>>> > individual updates you just need to retry the only mutations which > are > >>>> > failed, in case of batches you need to retry the whole batch > >>>> > 4)* Unlogged batch is deprecated in Cassandra 3.0* ( > >>>> > https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html > ), > >>>> > which > >>>> > we are currently using for Ignite Cassandra module. > >>>> > > >>>> > > >>>> > Igor Rudyak > >>>> > > >>>> > > >>>> > > >>>> > On Tue, Jul 26, 2016 at 4:45 PM, Dmitriy Setrakyan < > >>>> [hidden email]> > >>>> > wrote: > >>>> > > >>>> > > > >>>> > > > >>>> > > On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak <[hidden email]> > >>>> wrote: > >>>> > > > >>>> > >> Hi Valentin, > >>>> > >> > >>>> > >> For writeAll/readAll Cassandra cache store implementation uses > >>>> async > >>>> > >> operations ( > >>>> http://www.datastax.com/dev/blog/java-driver-async-queries) > >>>> > >> and > >>>> > >> futures, which has the best characteristics in terms of > >>>> performance. > >>>> > >> > >>>> > >> > >>>> > > Thanks, Igor. This link describes the query operations, but I > could > >>>> not > >>>> > > find the mention of writes. > >>>> > > > >>>> > > > >>>> > >> Cassandra BATCH statement is actually quite often anti-pattern > for > >>>> those > >>>> > >> who come from relational world. BATCH statement concept in > >>>> Cassandra is > >>>> > >> totally different from relational world and is not for optimizing > >>>> > >> batch/bulk operations. The main purpose of Cassandra BATCH is to > >>>> keep > >>>> > >> denormalized data in sync. For example when you duplicating the > >>>> same > >>>> > data > >>>> > >> into several tables. All other cases are not recommended for > >>>> Cassandra > >>>> > >> batches: > >>>> > >> - > >>>> > >> > >>>> > >> > >>>> > > >>>> > https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij > >>>> > >> - > >>>> > >> > >>>> > >> > >>>> > > >>>> > http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html > >>>> > >> - > >>>> https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ > >>>> > >> > >>>> > >> It's also good to mention that in CassandraCacheStore > >>>> implementation > >>>> > >> (actually in CassandraSessionImpl) all operation with Cassandra > is > >>>> > wrapped > >>>> > >> in a loop. The reason is in a case of failure it will be > performed > >>>> 20 > >>>> > >> attempts to retry the operation with incrementally increasing > >>>> timeouts > >>>> > >> starting from 100ms and specific exception handling logic > >>>> (Cassandra > >>>> > hosts > >>>> > >> unavailability and etc.). Thus it provides quite reliable > >>>> persistence > >>>> > >> mechanism. According to load tests, even on heavily overloaded > >>>> Cassandra > >>>> > >> cluster (CPU LOAD > 10 per one core) there were no lost > >>>> > >> writes/reads/deletes and maximum 6 attempts to perform one > >>>> operation. > >>>> > >> > >>>> > > > >>>> > > I think that the main point about Cassandra batch operations is > not > >>>> about > >>>> > > reliability, but about performance. If user batches up 100s of > >>>> updates > >>>> > in 1 > >>>> > > Cassandra batch, then it will be a lot faster than doing them > >>>> 1-by-1 in > >>>> > > Ignite. Wrapping them into Ignite "putAll(...)" call just seems > more > >>>> > > logical to me, no? > >>>> > > > >>>> > > > >>>> > >> > >>>> > >> Igor Rudyak > >>>> > >> > >>>> > >> On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < > >>>> > >> [hidden email]> wrote: > >>>> > >> > >>>> > >> > Hi Igor, > >>>> > >> > > >>>> > >> > I noticed that current Cassandra store implementation doesn't > >>>> support > >>>> > >> > batching for writeAll and deleteAll methods, it simply executes > >>>> all > >>>> > >> updates > >>>> > >> > one by one (asynchronously in parallel). > >>>> > >> > > >>>> > >> > I think it can be useful to provide such support and created a > >>>> ticket > >>>> > >> [1]. > >>>> > >> > Can you please give your input on this? Does it make sense in > >>>> your > >>>> > >> opinion? > >>>> > >> > > >>>> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-3588 > >>>> > >> > > >>>> > >> > -Val > >>>> > >> > > >>>> > >> > >>>> > > > >>>> > > > >>>> > > >>>> > >>> > >>> > >> > > > |
Hi Valentin,
1) According unlogged batches I think it doesn't make sense to support them, cause: - They are deprecated starting from Cassandra 3.0 (which we are currently using in Cassandra module) - According to Cassandra documentation ( http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html) "Batches are often mistakenly used in an attempt to optimize performance". Cassandra guys saying that no batches ( https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.rxkmfe209) is the fastest way to load data. I checked it with the batches having records with different partition keys and it's definitely true. For small batch of records having all the same partition key (affinity in Ignite) they could provide better performance, but I didn't investigated this case deeply (what is the optimal size of a batch, how significantly is the performance benefits and etc.) Can try to do some load tests to have better understanding of this. 2) Regarding logged batches I think that it makes sense to support them in Cassandra module for transactional caches. The bad thing is that they don't provide isolation, the good thing is they guaranty that all your changes will be eventually committed and visible to clients. Thus it's still better than nothing... However there is a better approach for this. We can implement transactional protocol on top of Cassandra, which will give us atomic read isolation - you'll either see all the changes made by transaction or none of them. For example we can implement RAMP transactions( http://www.bailis.org/papers/ramp-sigmod2014.pdf) cause it provides rather low overhead. Igor Rudyak On Thu, Jul 28, 2016 at 11:00 PM, Valentin Kulichenko < [hidden email]> wrote: > Hi Igor, > > I'm not a big Cassandra expert, but here are my thoughts. > > 1. Sending updates in a batch is always better than sending them one by > one. For example, if you do putAll in Ignite with 100 entries, and these > entries are split across 5 nodes, the client will send 5 requests instead > of 100. This provides significant performance improvement. Is there a way > to use similar approach in Cassandra? > 2. As for logged batches, I can easily believe that this is a rarely used > feature, but since it exists in Cassandra, I can't find a single reason why > not to support it in our store as an option. Users that come across those > rare cases, will only say thank you to us :) > > What do you think? > > -Val > > On Thu, Jul 28, 2016 at 10:41 PM, Igor Rudyak <[hidden email]> wrote: > >> There are actually some cases when atomic read isolation in Cassandra >> could >> be important. Lets assume batch was persisted in Cassandra, but not >> finalized yet - read operation from Cassandra returns us only partially >> committed data of the batch. In the such situation we have problems when: >> >> 1) Some of the batch records already expired from Ignite cache and we >> reading them from persistent store (Cassandra in our case). >> >> 2) All Ignite nodes storing the batch records (or subset records) died (or >> for example became unavailable for 10sec because of network problem). >> While >> reading such records from Ignite cache we will be redirected to persistent >> store. >> >> 3) Network separation occurred such a way that we now have two Ignite >> cluster, but all the replicas of the batch data are located only in one of >> these clusters. Again while reading such records from Ignite cache on the >> second cluster we will be redirected to persistent store. >> >> In all mentioned cases, if Cassandra batch isn't finalized yet - we will >> read partially committed transaction data. >> >> >> On Thu, Jul 28, 2016 at 6:52 AM, Luiz Felipe Trevisan < >> [hidden email]> wrote: >> >> > I totally agree with you regarding the guarantees we have with logged >> > batches and I'm also pretty much aware of the performance penalty >> involved >> > using this solution. >> > >> > But since all read operations are executed via ignite it means that >> > isolation in the Cassandra level is not really important. I think the >> only >> > guarantee really needed is that we don't end up with a partial insert in >> > Cassandra in case we have a failure in ignite and we loose the node that >> > was responsible for this write operation. >> > >> > My other assumption is that the write operation needs to finish before >> an >> > eviction happens for this entry and we loose the data in cache (since >> batch >> > doesn't guarantee isolation). However if we cannot achieve this I don't >> see >> > why use ignite as a cache store. >> > >> > Luiz >> > >> > -- >> > Luiz Felipe Trevisan >> > >> > On Wed, Jul 27, 2016 at 4:55 PM, Igor Rudyak <[hidden email]> wrote: >> > >> >> Hi Luiz, >> >> >> >> Logged batches is not the solution to achieve atomic view of your >> Ignite >> >> transaction changes in Cassandra. >> >> >> >> The problem with logged batches(aka atomic) is they guarantees that if >> >> any part of the batch succeeds, all of it will, no other transactional >> >> enforcement is done at the batch level. For example, there is no batch >> >> isolation. Clients are able to read the first updated rows from the >> batch, >> >> while other rows are still being updated on the server (in RDBMS >> >> terminology it means *READ-UNCOMMITED* isolation level). Thus Cassandra >> >> >> mean "atomic" in the database sense that if any part of the batch >> succeeds, >> >> all of it will. >> >> >> >> Probably the best way to archive read atomic isolation for Ignite >> >> transaction persisting data into Cassandra, is to implement RAMP >> >> transactions (http://www.bailis.org/papers/ramp-sigmod2014.pdf) on top >> >> of Cassandra. >> >> >> >> I may create a ticket for this if community would like it. >> >> >> >> >> >> Igor Rudyak >> >> >> >> >> >> On Wed, Jul 27, 2016 at 12:55 PM, Luiz Felipe Trevisan < >> >> [hidden email]> wrote: >> >> >> >>> Hi Igor, >> >>> >> >>> Does it make sense for you using logged batches to guarantee atomicity >> >>> in Cassandra in cases we are doing a cross cache transaction >> operation? >> >>> >> >>> Luiz >> >>> >> >>> -- >> >>> Luiz Felipe Trevisan >> >>> >> >>> On Wed, Jul 27, 2016 at 2:05 AM, Dmitriy Setrakyan < >> >>> [hidden email]> wrote: >> >>> >> >>>> I am very confused still. Ilya, can you please explain what happens >> in >> >>>> Cassandra if user calls IgniteCache.putAll(...) method? >> >>>> >> >>>> In Ignite, if putAll(...) is called, Ignite will make the best >> effort to >> >>>> execute the update as a batch, in which case the performance is >> better. >> >>>> What is the analogy in Cassandra? >> >>>> >> >>>> D. >> >>>> >> >>>> On Tue, Jul 26, 2016 at 9:16 PM, Igor Rudyak <[hidden email]> >> wrote: >> >>>> >> >>>> > Dmitriy, >> >>>> > >> >>>> > There is absolutely same approach for all async read/write/delete >> >>>> > operations - Cassandra session just provides >> executeAsync(statement) >> >>>> > function >> >>>> > for all type of operations. >> >>>> > >> >>>> > To be more detailed about Cassandra batches, there are actually two >> >>>> types >> >>>> > of batches: >> >>>> > >> >>>> > 1) *Logged batch* (aka atomic) - the main purpose of such batches >> is >> >>>> to >> >>>> > keep duplicated data in sync while updating multiple tables, but at >> >>>> the >> >>>> > cost of performance. >> >>>> > >> >>>> > 2) *Unlogged batch* - the only specific case for such batch is when >> >>>> all >> >>>> > updates are addressed to only *one* partition key and batch having >> >>>> > "*reasonable >> >>>> > size*". In a such situation there *could be* performance benefits >> if >> >>>> you >> >>>> > are using Cassandra *TokenAware* load balancing policy. In this >> >>>> particular >> >>>> > case all the updates will go directly without any additional >> >>>> > coordination to the primary node, which is responsible for storing >> >>>> data for >> >>>> > this partition key. >> >>>> > >> >>>> > The *generic rule* is that - *individual updates using async mode* >> >>>> provides >> >>>> > the best performance ( >> >>>> > https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html). >> >>>> That's >> >>>> > because it spread all updates across the whole cluster. In >> contrast to >> >>>> > this, when you are using batches, what this is actually doing is >> >>>> putting a >> >>>> > huge amount of pressure on a single coordinator node. This is >> because >> >>>> the >> >>>> > coordinator needs to forward each individual insert/update/delete >> to >> >>>> the >> >>>> > correct replicas. In general you're just losing all the benefit of >> >>>> > Cassandra TokenAware load balancing policy when you're updating >> >>>> different >> >>>> > partitions in a single round trip to the database. >> >>>> > >> >>>> > Probably the only enhancement which could be done is to separate >> our >> >>>> batch >> >>>> > to smaller batches, each of which is updating records having the >> same >> >>>> > partition key. In this case it could provide some performance >> >>>> benefits when >> >>>> > used in combination with Cassandra TokenAware policy. But there are >> >>>> several >> >>>> > concerns: >> >>>> > >> >>>> > 1) It looks like rather rare case >> >>>> > 2) Makes error handling more complex - you just don't know what >> >>>> operations >> >>>> > in a batch succeed and what failed and need to retry all batch >> >>>> > 3) Retry logic could produce more load on the cluster - in case of >> >>>> > individual updates you just need to retry the only mutations which >> are >> >>>> > failed, in case of batches you need to retry the whole batch >> >>>> > 4)* Unlogged batch is deprecated in Cassandra 3.0* ( >> >>>> > >> https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html), >> >>>> > which >> >>>> > we are currently using for Ignite Cassandra module. >> >>>> > >> >>>> > >> >>>> > Igor Rudyak >> >>>> > >> >>>> > >> >>>> > >> >>>> > On Tue, Jul 26, 2016 at 4:45 PM, Dmitriy Setrakyan < >> >>>> [hidden email]> >> >>>> > wrote: >> >>>> > >> >>>> > > >> >>>> > > >> >>>> > > On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak <[hidden email]> >> >>>> wrote: >> >>>> > > >> >>>> > >> Hi Valentin, >> >>>> > >> >> >>>> > >> For writeAll/readAll Cassandra cache store implementation uses >> >>>> async >> >>>> > >> operations ( >> >>>> http://www.datastax.com/dev/blog/java-driver-async-queries) >> >>>> > >> and >> >>>> > >> futures, which has the best characteristics in terms of >> >>>> performance. >> >>>> > >> >> >>>> > >> >> >>>> > > Thanks, Igor. This link describes the query operations, but I >> could >> >>>> not >> >>>> > > find the mention of writes. >> >>>> > > >> >>>> > > >> >>>> > >> Cassandra BATCH statement is actually quite often anti-pattern >> for >> >>>> those >> >>>> > >> who come from relational world. BATCH statement concept in >> >>>> Cassandra is >> >>>> > >> totally different from relational world and is not for >> optimizing >> >>>> > >> batch/bulk operations. The main purpose of Cassandra BATCH is to >> >>>> keep >> >>>> > >> denormalized data in sync. For example when you duplicating the >> >>>> same >> >>>> > data >> >>>> > >> into several tables. All other cases are not recommended for >> >>>> Cassandra >> >>>> > >> batches: >> >>>> > >> - >> >>>> > >> >> >>>> > >> >> >>>> > >> >>>> >> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij >> >>>> > >> - >> >>>> > >> >> >>>> > >> >> >>>> > >> >>>> >> http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html >> >>>> > >> - >> >>>> https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ >> >>>> > >> >> >>>> > >> It's also good to mention that in CassandraCacheStore >> >>>> implementation >> >>>> > >> (actually in CassandraSessionImpl) all operation with Cassandra >> is >> >>>> > wrapped >> >>>> > >> in a loop. The reason is in a case of failure it will be >> performed >> >>>> 20 >> >>>> > >> attempts to retry the operation with incrementally increasing >> >>>> timeouts >> >>>> > >> starting from 100ms and specific exception handling logic >> >>>> (Cassandra >> >>>> > hosts >> >>>> > >> unavailability and etc.). Thus it provides quite reliable >> >>>> persistence >> >>>> > >> mechanism. According to load tests, even on heavily overloaded >> >>>> Cassandra >> >>>> > >> cluster (CPU LOAD > 10 per one core) there were no lost >> >>>> > >> writes/reads/deletes and maximum 6 attempts to perform one >> >>>> operation. >> >>>> > >> >> >>>> > > >> >>>> > > I think that the main point about Cassandra batch operations is >> not >> >>>> about >> >>>> > > reliability, but about performance. If user batches up 100s of >> >>>> updates >> >>>> > in 1 >> >>>> > > Cassandra batch, then it will be a lot faster than doing them >> >>>> 1-by-1 in >> >>>> > > Ignite. Wrapping them into Ignite "putAll(...)" call just seems >> more >> >>>> > > logical to me, no? >> >>>> > > >> >>>> > > >> >>>> > >> >> >>>> > >> Igor Rudyak >> >>>> > >> >> >>>> > >> On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < >> >>>> > >> [hidden email]> wrote: >> >>>> > >> >> >>>> > >> > Hi Igor, >> >>>> > >> > >> >>>> > >> > I noticed that current Cassandra store implementation doesn't >> >>>> support >> >>>> > >> > batching for writeAll and deleteAll methods, it simply >> executes >> >>>> all >> >>>> > >> updates >> >>>> > >> > one by one (asynchronously in parallel). >> >>>> > >> > >> >>>> > >> > I think it can be useful to provide such support and created a >> >>>> ticket >> >>>> > >> [1]. >> >>>> > >> > Can you please give your input on this? Does it make sense in >> >>>> your >> >>>> > >> opinion? >> >>>> > >> > >> >>>> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-3588 >> >>>> > >> > >> >>>> > >> > -Val >> >>>> > >> > >> >>>> > >> >> >>>> > > >> >>>> > > >> >>>> > >> >>>> >> >>> >> >>> >> >> >> > >> > > |
Hi Igor,
1) Yes, I'm talking about splitting the entry set into per-partition (or per-node) batches. Having entries that are stores on different nodes in the same batch doesn't make much sense, of course. 2) RAMP looks interesting, but it seems to be a pretty complicated task. How about adding the support for built-in logged batches (this should be fairly easy to implement) and then improve the atomicity as a second phase? -Val On Fri, Jul 29, 2016 at 5:19 PM, Igor Rudyak <[hidden email]> wrote: > Hi Valentin, > > 1) According unlogged batches I think it doesn't make sense to support > them, cause: > - They are deprecated starting from Cassandra 3.0 (which we are currently > using in Cassandra module) > - According to Cassandra documentation ( > http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html) "Batches > are often mistakenly used in an attempt to optimize performance". Cassandra > guys saying that no batches ( > https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.rxkmfe209) > is the fastest way to load data. I checked it with the batches having > records with different partition keys and it's definitely true. For small > batch of records having all the same partition key (affinity in Ignite) > they could provide better performance, but I didn't investigated this case > deeply (what is the optimal size of a batch, how significantly is the > performance benefits and etc.) Can try to do some load tests to have better > understanding of this. > > 2) Regarding logged batches I think that it makes sense to support them in > Cassandra module for transactional caches. The bad thing is that they don't > provide isolation, the good thing is they guaranty that all your changes > will be eventually committed and visible to clients. Thus it's still better > than nothing... However there is a better approach for this. We can > implement transactional protocol on top of Cassandra, which will give us > atomic read isolation - you'll either see all the changes made by > transaction or none of them. For example we can implement RAMP transactions( > http://www.bailis.org/papers/ramp-sigmod2014.pdf) cause it provides > rather low overhead. > > Igor Rudyak > > On Thu, Jul 28, 2016 at 11:00 PM, Valentin Kulichenko < > [hidden email]> wrote: > >> Hi Igor, >> >> I'm not a big Cassandra expert, but here are my thoughts. >> >> 1. Sending updates in a batch is always better than sending them one by >> one. For example, if you do putAll in Ignite with 100 entries, and these >> entries are split across 5 nodes, the client will send 5 requests instead >> of 100. This provides significant performance improvement. Is there a way >> to use similar approach in Cassandra? >> 2. As for logged batches, I can easily believe that this is a rarely used >> feature, but since it exists in Cassandra, I can't find a single reason why >> not to support it in our store as an option. Users that come across those >> rare cases, will only say thank you to us :) >> >> What do you think? >> >> -Val >> >> On Thu, Jul 28, 2016 at 10:41 PM, Igor Rudyak <[hidden email]> wrote: >> >>> There are actually some cases when atomic read isolation in Cassandra >>> could >>> be important. Lets assume batch was persisted in Cassandra, but not >>> finalized yet - read operation from Cassandra returns us only partially >>> committed data of the batch. In the such situation we have problems when: >>> >>> 1) Some of the batch records already expired from Ignite cache and we >>> reading them from persistent store (Cassandra in our case). >>> >>> 2) All Ignite nodes storing the batch records (or subset records) died >>> (or >>> for example became unavailable for 10sec because of network problem). >>> While >>> reading such records from Ignite cache we will be redirected to >>> persistent >>> store. >>> >>> 3) Network separation occurred such a way that we now have two Ignite >>> cluster, but all the replicas of the batch data are located only in one >>> of >>> these clusters. Again while reading such records from Ignite cache on the >>> second cluster we will be redirected to persistent store. >>> >>> In all mentioned cases, if Cassandra batch isn't finalized yet - we will >>> read partially committed transaction data. >>> >>> >>> On Thu, Jul 28, 2016 at 6:52 AM, Luiz Felipe Trevisan < >>> [hidden email]> wrote: >>> >>> > I totally agree with you regarding the guarantees we have with logged >>> > batches and I'm also pretty much aware of the performance penalty >>> involved >>> > using this solution. >>> > >>> > But since all read operations are executed via ignite it means that >>> > isolation in the Cassandra level is not really important. I think the >>> only >>> > guarantee really needed is that we don't end up with a partial insert >>> in >>> > Cassandra in case we have a failure in ignite and we loose the node >>> that >>> > was responsible for this write operation. >>> > >>> > My other assumption is that the write operation needs to finish before >>> an >>> > eviction happens for this entry and we loose the data in cache (since >>> batch >>> > doesn't guarantee isolation). However if we cannot achieve this I >>> don't see >>> > why use ignite as a cache store. >>> > >>> > Luiz >>> > >>> > -- >>> > Luiz Felipe Trevisan >>> > >>> > On Wed, Jul 27, 2016 at 4:55 PM, Igor Rudyak <[hidden email]> >>> wrote: >>> > >>> >> Hi Luiz, >>> >> >>> >> Logged batches is not the solution to achieve atomic view of your >>> Ignite >>> >> transaction changes in Cassandra. >>> >> >>> >> The problem with logged batches(aka atomic) is they guarantees that if >>> >> any part of the batch succeeds, all of it will, no other transactional >>> >> enforcement is done at the batch level. For example, there is no batch >>> >> isolation. Clients are able to read the first updated rows from the >>> batch, >>> >> while other rows are still being updated on the server (in RDBMS >>> >> terminology it means *READ-UNCOMMITED* isolation level). Thus >>> Cassandra >>> >>> >> mean "atomic" in the database sense that if any part of the batch >>> succeeds, >>> >> all of it will. >>> >> >>> >> Probably the best way to archive read atomic isolation for Ignite >>> >> transaction persisting data into Cassandra, is to implement RAMP >>> >> transactions (http://www.bailis.org/papers/ramp-sigmod2014.pdf) on >>> top >>> >> of Cassandra. >>> >> >>> >> I may create a ticket for this if community would like it. >>> >> >>> >> >>> >> Igor Rudyak >>> >> >>> >> >>> >> On Wed, Jul 27, 2016 at 12:55 PM, Luiz Felipe Trevisan < >>> >> [hidden email]> wrote: >>> >> >>> >>> Hi Igor, >>> >>> >>> >>> Does it make sense for you using logged batches to guarantee >>> atomicity >>> >>> in Cassandra in cases we are doing a cross cache transaction >>> operation? >>> >>> >>> >>> Luiz >>> >>> >>> >>> -- >>> >>> Luiz Felipe Trevisan >>> >>> >>> >>> On Wed, Jul 27, 2016 at 2:05 AM, Dmitriy Setrakyan < >>> >>> [hidden email]> wrote: >>> >>> >>> >>>> I am very confused still. Ilya, can you please explain what happens >>> in >>> >>>> Cassandra if user calls IgniteCache.putAll(...) method? >>> >>>> >>> >>>> In Ignite, if putAll(...) is called, Ignite will make the best >>> effort to >>> >>>> execute the update as a batch, in which case the performance is >>> better. >>> >>>> What is the analogy in Cassandra? >>> >>>> >>> >>>> D. >>> >>>> >>> >>>> On Tue, Jul 26, 2016 at 9:16 PM, Igor Rudyak <[hidden email]> >>> wrote: >>> >>>> >>> >>>> > Dmitriy, >>> >>>> > >>> >>>> > There is absolutely same approach for all async read/write/delete >>> >>>> > operations - Cassandra session just provides >>> executeAsync(statement) >>> >>>> > function >>> >>>> > for all type of operations. >>> >>>> > >>> >>>> > To be more detailed about Cassandra batches, there are actually >>> two >>> >>>> types >>> >>>> > of batches: >>> >>>> > >>> >>>> > 1) *Logged batch* (aka atomic) - the main purpose of such batches >>> is >>> >>>> to >>> >>>> > keep duplicated data in sync while updating multiple tables, but >>> at >>> >>>> the >>> >>>> > cost of performance. >>> >>>> > >>> >>>> > 2) *Unlogged batch* - the only specific case for such batch is >>> when >>> >>>> all >>> >>>> > updates are addressed to only *one* partition key and batch having >>> >>>> > "*reasonable >>> >>>> > size*". In a such situation there *could be* performance benefits >>> if >>> >>>> you >>> >>>> > are using Cassandra *TokenAware* load balancing policy. In this >>> >>>> particular >>> >>>> > case all the updates will go directly without any additional >>> >>>> > coordination to the primary node, which is responsible for storing >>> >>>> data for >>> >>>> > this partition key. >>> >>>> > >>> >>>> > The *generic rule* is that - *individual updates using async mode* >>> >>>> provides >>> >>>> > the best performance ( >>> >>>> > https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html >>> ). >>> >>>> That's >>> >>>> > because it spread all updates across the whole cluster. In >>> contrast to >>> >>>> > this, when you are using batches, what this is actually doing is >>> >>>> putting a >>> >>>> > huge amount of pressure on a single coordinator node. This is >>> because >>> >>>> the >>> >>>> > coordinator needs to forward each individual insert/update/delete >>> to >>> >>>> the >>> >>>> > correct replicas. In general you're just losing all the benefit of >>> >>>> > Cassandra TokenAware load balancing policy when you're updating >>> >>>> different >>> >>>> > partitions in a single round trip to the database. >>> >>>> > >>> >>>> > Probably the only enhancement which could be done is to separate >>> our >>> >>>> batch >>> >>>> > to smaller batches, each of which is updating records having the >>> same >>> >>>> > partition key. In this case it could provide some performance >>> >>>> benefits when >>> >>>> > used in combination with Cassandra TokenAware policy. But there >>> are >>> >>>> several >>> >>>> > concerns: >>> >>>> > >>> >>>> > 1) It looks like rather rare case >>> >>>> > 2) Makes error handling more complex - you just don't know what >>> >>>> operations >>> >>>> > in a batch succeed and what failed and need to retry all batch >>> >>>> > 3) Retry logic could produce more load on the cluster - in case of >>> >>>> > individual updates you just need to retry the only mutations >>> which are >>> >>>> > failed, in case of batches you need to retry the whole batch >>> >>>> > 4)* Unlogged batch is deprecated in Cassandra 3.0* ( >>> >>>> > >>> https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html), >>> >>>> > which >>> >>>> > we are currently using for Ignite Cassandra module. >>> >>>> > >>> >>>> > >>> >>>> > Igor Rudyak >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> > On Tue, Jul 26, 2016 at 4:45 PM, Dmitriy Setrakyan < >>> >>>> [hidden email]> >>> >>>> > wrote: >>> >>>> > >>> >>>> > > >>> >>>> > > >>> >>>> > > On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak <[hidden email] >>> > >>> >>>> wrote: >>> >>>> > > >>> >>>> > >> Hi Valentin, >>> >>>> > >> >>> >>>> > >> For writeAll/readAll Cassandra cache store implementation uses >>> >>>> async >>> >>>> > >> operations ( >>> >>>> http://www.datastax.com/dev/blog/java-driver-async-queries) >>> >>>> > >> and >>> >>>> > >> futures, which has the best characteristics in terms of >>> >>>> performance. >>> >>>> > >> >>> >>>> > >> >>> >>>> > > Thanks, Igor. This link describes the query operations, but I >>> could >>> >>>> not >>> >>>> > > find the mention of writes. >>> >>>> > > >>> >>>> > > >>> >>>> > >> Cassandra BATCH statement is actually quite often anti-pattern >>> for >>> >>>> those >>> >>>> > >> who come from relational world. BATCH statement concept in >>> >>>> Cassandra is >>> >>>> > >> totally different from relational world and is not for >>> optimizing >>> >>>> > >> batch/bulk operations. The main purpose of Cassandra BATCH is >>> to >>> >>>> keep >>> >>>> > >> denormalized data in sync. For example when you duplicating the >>> >>>> same >>> >>>> > data >>> >>>> > >> into several tables. All other cases are not recommended for >>> >>>> Cassandra >>> >>>> > >> batches: >>> >>>> > >> - >>> >>>> > >> >>> >>>> > >> >>> >>>> > >>> >>>> >>> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij >>> >>>> > >> - >>> >>>> > >> >>> >>>> > >> >>> >>>> > >>> >>>> >>> http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html >>> >>>> > >> - >>> >>>> >>> https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ >>> >>>> > >> >>> >>>> > >> It's also good to mention that in CassandraCacheStore >>> >>>> implementation >>> >>>> > >> (actually in CassandraSessionImpl) all operation with >>> Cassandra is >>> >>>> > wrapped >>> >>>> > >> in a loop. The reason is in a case of failure it will be >>> performed >>> >>>> 20 >>> >>>> > >> attempts to retry the operation with incrementally increasing >>> >>>> timeouts >>> >>>> > >> starting from 100ms and specific exception handling logic >>> >>>> (Cassandra >>> >>>> > hosts >>> >>>> > >> unavailability and etc.). Thus it provides quite reliable >>> >>>> persistence >>> >>>> > >> mechanism. According to load tests, even on heavily overloaded >>> >>>> Cassandra >>> >>>> > >> cluster (CPU LOAD > 10 per one core) there were no lost >>> >>>> > >> writes/reads/deletes and maximum 6 attempts to perform one >>> >>>> operation. >>> >>>> > >> >>> >>>> > > >>> >>>> > > I think that the main point about Cassandra batch operations is >>> not >>> >>>> about >>> >>>> > > reliability, but about performance. If user batches up 100s of >>> >>>> updates >>> >>>> > in 1 >>> >>>> > > Cassandra batch, then it will be a lot faster than doing them >>> >>>> 1-by-1 in >>> >>>> > > Ignite. Wrapping them into Ignite "putAll(...)" call just seems >>> more >>> >>>> > > logical to me, no? >>> >>>> > > >>> >>>> > > >>> >>>> > >> >>> >>>> > >> Igor Rudyak >>> >>>> > >> >>> >>>> > >> On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < >>> >>>> > >> [hidden email]> wrote: >>> >>>> > >> >>> >>>> > >> > Hi Igor, >>> >>>> > >> > >>> >>>> > >> > I noticed that current Cassandra store implementation doesn't >>> >>>> support >>> >>>> > >> > batching for writeAll and deleteAll methods, it simply >>> executes >>> >>>> all >>> >>>> > >> updates >>> >>>> > >> > one by one (asynchronously in parallel). >>> >>>> > >> > >>> >>>> > >> > I think it can be useful to provide such support and created >>> a >>> >>>> ticket >>> >>>> > >> [1]. >>> >>>> > >> > Can you please give your input on this? Does it make sense in >>> >>>> your >>> >>>> > >> opinion? >>> >>>> > >> > >>> >>>> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-3588 >>> >>>> > >> > >>> >>>> > >> > -Val >>> >>>> > >> > >>> >>>> > >> >>> >>>> > > >>> >>>> > > >>> >>>> > >>> >>>> >>> >>> >>> >>> >>> >> >>> > >>> >> >> > |
Hi Valentin,
Sounds reasonable. I'll create a ticket to add Cassandra logged batches and will try to prepare some load tests to investigate if unlogged batches can provide better performance. Will also add ticket for RAMP as a long term enhancement. Igor Rudyak On Fri, Jul 29, 2016 at 5:45 PM, Valentin Kulichenko < [hidden email]> wrote: > Hi Igor, > > 1) Yes, I'm talking about splitting the entry set into per-partition (or > per-node) batches. Having entries that are stores on different nodes in the > same batch doesn't make much sense, of course. > > 2) RAMP looks interesting, but it seems to be a pretty complicated task. > How about adding the support for built-in logged batches (this should be > fairly easy to implement) and then improve the atomicity as a second phase? > > -Val > > On Fri, Jul 29, 2016 at 5:19 PM, Igor Rudyak <[hidden email]> wrote: > >> Hi Valentin, >> >> 1) According unlogged batches I think it doesn't make sense to support >> them, cause: >> - They are deprecated starting from Cassandra 3.0 (which we are currently >> using in Cassandra module) >> - According to Cassandra documentation ( >> http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html) >> "Batches are often mistakenly used in an attempt to optimize performance". >> Cassandra guys saying that no batches ( >> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.rxkmfe209) >> is the fastest way to load data. I checked it with the batches having >> records with different partition keys and it's definitely true. For small >> batch of records having all the same partition key (affinity in Ignite) >> they could provide better performance, but I didn't investigated this case >> deeply (what is the optimal size of a batch, how significantly is the >> performance benefits and etc.) Can try to do some load tests to have better >> understanding of this. >> >> 2) Regarding logged batches I think that it makes sense to support them >> in Cassandra module for transactional caches. The bad thing is that they >> don't provide isolation, the good thing is they guaranty that all your >> changes will be eventually committed and visible to clients. Thus it's >> still better than nothing... However there is a better approach for this. >> We can implement transactional protocol on top of Cassandra, which will >> give us atomic read isolation - you'll either see all the changes made by >> transaction or none of them. For example we can implement RAMP transactions( >> http://www.bailis.org/papers/ramp-sigmod2014.pdf) cause it provides >> rather low overhead. >> >> Igor Rudyak >> >> On Thu, Jul 28, 2016 at 11:00 PM, Valentin Kulichenko < >> [hidden email]> wrote: >> >>> Hi Igor, >>> >>> I'm not a big Cassandra expert, but here are my thoughts. >>> >>> 1. Sending updates in a batch is always better than sending them one by >>> one. For example, if you do putAll in Ignite with 100 entries, and these >>> entries are split across 5 nodes, the client will send 5 requests instead >>> of 100. This provides significant performance improvement. Is there a way >>> to use similar approach in Cassandra? >>> 2. As for logged batches, I can easily believe that this is a rarely >>> used feature, but since it exists in Cassandra, I can't find a single >>> reason why not to support it in our store as an option. Users that come >>> across those rare cases, will only say thank you to us :) >>> >>> What do you think? >>> >>> -Val >>> >>> On Thu, Jul 28, 2016 at 10:41 PM, Igor Rudyak <[hidden email]> wrote: >>> >>>> There are actually some cases when atomic read isolation in Cassandra >>>> could >>>> be important. Lets assume batch was persisted in Cassandra, but not >>>> finalized yet - read operation from Cassandra returns us only partially >>>> committed data of the batch. In the such situation we have problems >>>> when: >>>> >>>> 1) Some of the batch records already expired from Ignite cache and we >>>> reading them from persistent store (Cassandra in our case). >>>> >>>> 2) All Ignite nodes storing the batch records (or subset records) died >>>> (or >>>> for example became unavailable for 10sec because of network problem). >>>> While >>>> reading such records from Ignite cache we will be redirected to >>>> persistent >>>> store. >>>> >>>> 3) Network separation occurred such a way that we now have two Ignite >>>> cluster, but all the replicas of the batch data are located only in one >>>> of >>>> these clusters. Again while reading such records from Ignite cache on >>>> the >>>> second cluster we will be redirected to persistent store. >>>> >>>> In all mentioned cases, if Cassandra batch isn't finalized yet - we will >>>> read partially committed transaction data. >>>> >>>> >>>> On Thu, Jul 28, 2016 at 6:52 AM, Luiz Felipe Trevisan < >>>> [hidden email]> wrote: >>>> >>>> > I totally agree with you regarding the guarantees we have with logged >>>> > batches and I'm also pretty much aware of the performance penalty >>>> involved >>>> > using this solution. >>>> > >>>> > But since all read operations are executed via ignite it means that >>>> > isolation in the Cassandra level is not really important. I think the >>>> only >>>> > guarantee really needed is that we don't end up with a partial insert >>>> in >>>> > Cassandra in case we have a failure in ignite and we loose the node >>>> that >>>> > was responsible for this write operation. >>>> > >>>> > My other assumption is that the write operation needs to finish >>>> before an >>>> > eviction happens for this entry and we loose the data in cache (since >>>> batch >>>> > doesn't guarantee isolation). However if we cannot achieve this I >>>> don't see >>>> > why use ignite as a cache store. >>>> > >>>> > Luiz >>>> > >>>> > -- >>>> > Luiz Felipe Trevisan >>>> > >>>> > On Wed, Jul 27, 2016 at 4:55 PM, Igor Rudyak <[hidden email]> >>>> wrote: >>>> > >>>> >> Hi Luiz, >>>> >> >>>> >> Logged batches is not the solution to achieve atomic view of your >>>> Ignite >>>> >> transaction changes in Cassandra. >>>> >> >>>> >> The problem with logged batches(aka atomic) is they guarantees that >>>> if >>>> >> any part of the batch succeeds, all of it will, no other >>>> transactional >>>> >> enforcement is done at the batch level. For example, there is no >>>> batch >>>> >> isolation. Clients are able to read the first updated rows from the >>>> batch, >>>> >> while other rows are still being updated on the server (in RDBMS >>>> >> terminology it means *READ-UNCOMMITED* isolation level). Thus >>>> Cassandra >>>> >>>> >> mean "atomic" in the database sense that if any part of the batch >>>> succeeds, >>>> >> all of it will. >>>> >> >>>> >> Probably the best way to archive read atomic isolation for Ignite >>>> >> transaction persisting data into Cassandra, is to implement RAMP >>>> >> transactions (http://www.bailis.org/papers/ramp-sigmod2014.pdf) on >>>> top >>>> >> of Cassandra. >>>> >> >>>> >> I may create a ticket for this if community would like it. >>>> >> >>>> >> >>>> >> Igor Rudyak >>>> >> >>>> >> >>>> >> On Wed, Jul 27, 2016 at 12:55 PM, Luiz Felipe Trevisan < >>>> >> [hidden email]> wrote: >>>> >> >>>> >>> Hi Igor, >>>> >>> >>>> >>> Does it make sense for you using logged batches to guarantee >>>> atomicity >>>> >>> in Cassandra in cases we are doing a cross cache transaction >>>> operation? >>>> >>> >>>> >>> Luiz >>>> >>> >>>> >>> -- >>>> >>> Luiz Felipe Trevisan >>>> >>> >>>> >>> On Wed, Jul 27, 2016 at 2:05 AM, Dmitriy Setrakyan < >>>> >>> [hidden email]> wrote: >>>> >>> >>>> >>>> I am very confused still. Ilya, can you please explain what >>>> happens in >>>> >>>> Cassandra if user calls IgniteCache.putAll(...) method? >>>> >>>> >>>> >>>> In Ignite, if putAll(...) is called, Ignite will make the best >>>> effort to >>>> >>>> execute the update as a batch, in which case the performance is >>>> better. >>>> >>>> What is the analogy in Cassandra? >>>> >>>> >>>> >>>> D. >>>> >>>> >>>> >>>> On Tue, Jul 26, 2016 at 9:16 PM, Igor Rudyak <[hidden email]> >>>> wrote: >>>> >>>> >>>> >>>> > Dmitriy, >>>> >>>> > >>>> >>>> > There is absolutely same approach for all async read/write/delete >>>> >>>> > operations - Cassandra session just provides >>>> executeAsync(statement) >>>> >>>> > function >>>> >>>> > for all type of operations. >>>> >>>> > >>>> >>>> > To be more detailed about Cassandra batches, there are actually >>>> two >>>> >>>> types >>>> >>>> > of batches: >>>> >>>> > >>>> >>>> > 1) *Logged batch* (aka atomic) - the main purpose of such >>>> batches is >>>> >>>> to >>>> >>>> > keep duplicated data in sync while updating multiple tables, but >>>> at >>>> >>>> the >>>> >>>> > cost of performance. >>>> >>>> > >>>> >>>> > 2) *Unlogged batch* - the only specific case for such batch is >>>> when >>>> >>>> all >>>> >>>> > updates are addressed to only *one* partition key and batch >>>> having >>>> >>>> > "*reasonable >>>> >>>> > size*". In a such situation there *could be* performance >>>> benefits if >>>> >>>> you >>>> >>>> > are using Cassandra *TokenAware* load balancing policy. In this >>>> >>>> particular >>>> >>>> > case all the updates will go directly without any additional >>>> >>>> > coordination to the primary node, which is responsible for >>>> storing >>>> >>>> data for >>>> >>>> > this partition key. >>>> >>>> > >>>> >>>> > The *generic rule* is that - *individual updates using async >>>> mode* >>>> >>>> provides >>>> >>>> > the best performance ( >>>> >>>> > https://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html >>>> ). >>>> >>>> That's >>>> >>>> > because it spread all updates across the whole cluster. In >>>> contrast to >>>> >>>> > this, when you are using batches, what this is actually doing is >>>> >>>> putting a >>>> >>>> > huge amount of pressure on a single coordinator node. This is >>>> because >>>> >>>> the >>>> >>>> > coordinator needs to forward each individual >>>> insert/update/delete to >>>> >>>> the >>>> >>>> > correct replicas. In general you're just losing all the benefit >>>> of >>>> >>>> > Cassandra TokenAware load balancing policy when you're updating >>>> >>>> different >>>> >>>> > partitions in a single round trip to the database. >>>> >>>> > >>>> >>>> > Probably the only enhancement which could be done is to separate >>>> our >>>> >>>> batch >>>> >>>> > to smaller batches, each of which is updating records having the >>>> same >>>> >>>> > partition key. In this case it could provide some performance >>>> >>>> benefits when >>>> >>>> > used in combination with Cassandra TokenAware policy. But there >>>> are >>>> >>>> several >>>> >>>> > concerns: >>>> >>>> > >>>> >>>> > 1) It looks like rather rare case >>>> >>>> > 2) Makes error handling more complex - you just don't know what >>>> >>>> operations >>>> >>>> > in a batch succeed and what failed and need to retry all batch >>>> >>>> > 3) Retry logic could produce more load on the cluster - in case >>>> of >>>> >>>> > individual updates you just need to retry the only mutations >>>> which are >>>> >>>> > failed, in case of batches you need to retry the whole batch >>>> >>>> > 4)* Unlogged batch is deprecated in Cassandra 3.0* ( >>>> >>>> > >>>> https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html), >>>> >>>> > which >>>> >>>> > we are currently using for Ignite Cassandra module. >>>> >>>> > >>>> >>>> > >>>> >>>> > Igor Rudyak >>>> >>>> > >>>> >>>> > >>>> >>>> > >>>> >>>> > On Tue, Jul 26, 2016 at 4:45 PM, Dmitriy Setrakyan < >>>> >>>> [hidden email]> >>>> >>>> > wrote: >>>> >>>> > >>>> >>>> > > >>>> >>>> > > >>>> >>>> > > On Tue, Jul 26, 2016 at 5:53 PM, Igor Rudyak < >>>> [hidden email]> >>>> >>>> wrote: >>>> >>>> > > >>>> >>>> > >> Hi Valentin, >>>> >>>> > >> >>>> >>>> > >> For writeAll/readAll Cassandra cache store implementation uses >>>> >>>> async >>>> >>>> > >> operations ( >>>> >>>> http://www.datastax.com/dev/blog/java-driver-async-queries) >>>> >>>> > >> and >>>> >>>> > >> futures, which has the best characteristics in terms of >>>> >>>> performance. >>>> >>>> > >> >>>> >>>> > >> >>>> >>>> > > Thanks, Igor. This link describes the query operations, but I >>>> could >>>> >>>> not >>>> >>>> > > find the mention of writes. >>>> >>>> > > >>>> >>>> > > >>>> >>>> > >> Cassandra BATCH statement is actually quite often >>>> anti-pattern for >>>> >>>> those >>>> >>>> > >> who come from relational world. BATCH statement concept in >>>> >>>> Cassandra is >>>> >>>> > >> totally different from relational world and is not for >>>> optimizing >>>> >>>> > >> batch/bulk operations. The main purpose of Cassandra BATCH is >>>> to >>>> >>>> keep >>>> >>>> > >> denormalized data in sync. For example when you duplicating >>>> the >>>> >>>> same >>>> >>>> > data >>>> >>>> > >> into several tables. All other cases are not recommended for >>>> >>>> Cassandra >>>> >>>> > >> batches: >>>> >>>> > >> - >>>> >>>> > >> >>>> >>>> > >> >>>> >>>> > >>>> >>>> >>>> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e#.k4xfir8ij >>>> >>>> > >> - >>>> >>>> > >> >>>> >>>> > >> >>>> >>>> > >>>> >>>> >>>> http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html >>>> >>>> > >> - >>>> >>>> >>>> https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/ >>>> >>>> > >> >>>> >>>> > >> It's also good to mention that in CassandraCacheStore >>>> >>>> implementation >>>> >>>> > >> (actually in CassandraSessionImpl) all operation with >>>> Cassandra is >>>> >>>> > wrapped >>>> >>>> > >> in a loop. The reason is in a case of failure it will be >>>> performed >>>> >>>> 20 >>>> >>>> > >> attempts to retry the operation with incrementally increasing >>>> >>>> timeouts >>>> >>>> > >> starting from 100ms and specific exception handling logic >>>> >>>> (Cassandra >>>> >>>> > hosts >>>> >>>> > >> unavailability and etc.). Thus it provides quite reliable >>>> >>>> persistence >>>> >>>> > >> mechanism. According to load tests, even on heavily overloaded >>>> >>>> Cassandra >>>> >>>> > >> cluster (CPU LOAD > 10 per one core) there were no lost >>>> >>>> > >> writes/reads/deletes and maximum 6 attempts to perform one >>>> >>>> operation. >>>> >>>> > >> >>>> >>>> > > >>>> >>>> > > I think that the main point about Cassandra batch operations >>>> is not >>>> >>>> about >>>> >>>> > > reliability, but about performance. If user batches up 100s of >>>> >>>> updates >>>> >>>> > in 1 >>>> >>>> > > Cassandra batch, then it will be a lot faster than doing them >>>> >>>> 1-by-1 in >>>> >>>> > > Ignite. Wrapping them into Ignite "putAll(...)" call just >>>> seems more >>>> >>>> > > logical to me, no? >>>> >>>> > > >>>> >>>> > > >>>> >>>> > >> >>>> >>>> > >> Igor Rudyak >>>> >>>> > >> >>>> >>>> > >> On Tue, Jul 26, 2016 at 1:58 PM, Valentin Kulichenko < >>>> >>>> > >> [hidden email]> wrote: >>>> >>>> > >> >>>> >>>> > >> > Hi Igor, >>>> >>>> > >> > >>>> >>>> > >> > I noticed that current Cassandra store implementation >>>> doesn't >>>> >>>> support >>>> >>>> > >> > batching for writeAll and deleteAll methods, it simply >>>> executes >>>> >>>> all >>>> >>>> > >> updates >>>> >>>> > >> > one by one (asynchronously in parallel). >>>> >>>> > >> > >>>> >>>> > >> > I think it can be useful to provide such support and >>>> created a >>>> >>>> ticket >>>> >>>> > >> [1]. >>>> >>>> > >> > Can you please give your input on this? Does it make sense >>>> in >>>> >>>> your >>>> >>>> > >> opinion? >>>> >>>> > >> > >>>> >>>> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-3588 >>>> >>>> > >> > >>>> >>>> > >> > -Val >>>> >>>> > >> > >>>> >>>> > >> >>>> >>>> > > >>>> >>>> > > >>>> >>>> > >>>> >>>> >>>> >>> >>>> >>> >>>> >> >>>> > >>>> >>> >>> >> > |
Free forum by Nabble | Edit this page |