Hello Igniters,
One of the major improvements to DML has to be support of batch statements. I'd like to discuss its implementation. The suggested approach is to rewrite given query turning it from few INSERTs into single statement and processing arguments accordingly. I suggest this as long as the whole point of batching is to make as little interactions with cluster as possible and to make operations as condensed as possible, and in case of Ignite it means that we should send as little JdbcQueryTasks as possible. And, as long as a query task holds single query and its arguments, this approach will not require any changes to be done to current design and won't break any backward compatibility - all dirty work on rewriting will be done by JDBC driver. Without rewriting, we could introduce some new query task for batch operations, but that would make impossible sending such requests from newer clients to older servers (say, servers of version 1.8.0, which does not know about batching, let alone older versions). I'd like to hear comments and suggestions from the community. Thanks! - Alex |
Hi Alex,
To my understanding there are two possible approaches to batching in JDBC layer: 1) Rely on default batching API. Specifically *PreparedStatement.addBatch()* [1] and others. This is nice and clear API, users are used to it, and it's adoption will minimize user code changes when migrating from other JDBC sources. We simply copy updates locally and then execute them all at once with only a single network hop to servers. *IgniteDataStreamer* can be used underneath. 2) Or we can have separate connection flag which will move all INSERT/UPDATE/DELETE statements through streamer. I prefer the first approach Also we need to keep in mind that data streamer has poor performance when adding single key-value pairs due to high overhead on concurrency and other bookkeeping. Instead, it is better to pre-batch key-value pairs before giving them to streamer. Vladimir. [1] https://docs.oracle.com/javase/8/docs/api/java/sql/PreparedStatement.html#addBatch-- On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < [hidden email]> wrote: > Hello Igniters, > > One of the major improvements to DML has to be support of batch > statements. I'd like to discuss its implementation. The suggested > approach is to rewrite given query turning it from few INSERTs into > single statement and processing arguments accordingly. I suggest this > as long as the whole point of batching is to make as little > interactions with cluster as possible and to make operations as > condensed as possible, and in case of Ignite it means that we should > send as little JdbcQueryTasks as possible. And, as long as a query > task holds single query and its arguments, this approach will not > require any changes to be done to current design and won't break any > backward compatibility - all dirty work on rewriting will be done by > JDBC driver. > Without rewriting, we could introduce some new query task for batch > operations, but that would make impossible sending such requests from > newer clients to older servers (say, servers of version 1.8.0, which > does not know about batching, let alone older versions). > I'd like to hear comments and suggestions from the community. Thanks! > > - Alex > |
Vlad,
1. Of course API to my view should be well-known JDBC one - addBatch and friends. Question was not about API but rather about implementation. What do you mean by "copy locally and execute all at once"? 2. As I see it, it does not contradict the 1st approach and could be implemented on par as well. Thanks, Alex 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <[hidden email]>: > Hi Alex, > > To my understanding there are two possible approaches to batching in JDBC > layer: > > 1) Rely on default batching API. Specifically > *PreparedStatement.addBatch()* [1] > and others. This is nice and clear API, users are used to it, and it's > adoption will minimize user code changes when migrating from other JDBC > sources. We simply copy updates locally and then execute them all at once > with only a single network hop to servers. *IgniteDataStreamer* can be used > underneath. > > 2) Or we can have separate connection flag which will move all > INSERT/UPDATE/DELETE statements through streamer. > > I prefer the first approach > > Also we need to keep in mind that data streamer has poor performance when > adding single key-value pairs due to high overhead on concurrency and other > bookkeeping. Instead, it is better to pre-batch key-value pairs before > giving them to streamer. > > Vladimir. > > [1] > https://docs.oracle.com/javase/8/docs/api/java/sql/PreparedStatement.html#addBatch-- > > On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > [hidden email]> wrote: > >> Hello Igniters, >> >> One of the major improvements to DML has to be support of batch >> statements. I'd like to discuss its implementation. The suggested >> approach is to rewrite given query turning it from few INSERTs into >> single statement and processing arguments accordingly. I suggest this >> as long as the whole point of batching is to make as little >> interactions with cluster as possible and to make operations as >> condensed as possible, and in case of Ignite it means that we should >> send as little JdbcQueryTasks as possible. And, as long as a query >> task holds single query and its arguments, this approach will not >> require any changes to be done to current design and won't break any >> backward compatibility - all dirty work on rewriting will be done by >> JDBC driver. >> Without rewriting, we could introduce some new query task for batch >> operations, but that would make impossible sending such requests from >> newer clients to older servers (say, servers of version 1.8.0, which >> does not know about batching, let alone older versions). >> I'd like to hear comments and suggestions from the community. Thanks! >> >> - Alex >> |
In reply to this post by Vladimir Ozerov
Guys,
I discussed this feature with Dmitriy and we came to conclusion that batching in JDBC and Data Streaming in Ignite have different semantics and performance characteristics. Thus they are independent features (they may work together, may separately, but this is another story). Let me explain. This is how JDBC batching works: - Add N sets of parameters to a prepared statement. - Manually execute prepared statement. - Repeat until all the data is loaded. This is how data streamer works: - Keep adding data. - Streamer will buffer and load buffered per-node batches when they are big enough. - Close streamer to make sure that everything is loaded. As you can see we have a difference in semantics of when we send data: if in our JDBC we will allow sending batches to nodes without calling `execute` (and probably we will need to make `execute` to no-op here), then we are violating semantics of JDBC, if we will disallow this behavior, then this batching will underperform. Thus I suggest keeping these features (JDBC Batching and JDBC Streaming) as separate features. As I already said they can work together: Batching will batch parameters and on `execute` they will go to the Streamer in one shot and Streamer will deal with the rest. Sergi 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <[hidden email]>: > Hi Alex, > > To my understanding there are two possible approaches to batching in JDBC > layer: > > 1) Rely on default batching API. Specifically > *PreparedStatement.addBatch()* [1] > and others. This is nice and clear API, users are used to it, and it's > adoption will minimize user code changes when migrating from other JDBC > sources. We simply copy updates locally and then execute them all at once > with only a single network hop to servers. *IgniteDataStreamer* can be used > underneath. > > 2) Or we can have separate connection flag which will move all > INSERT/UPDATE/DELETE statements through streamer. > > I prefer the first approach > > Also we need to keep in mind that data streamer has poor performance when > adding single key-value pairs due to high overhead on concurrency and other > bookkeeping. Instead, it is better to pre-batch key-value pairs before > giving them to streamer. > > Vladimir. > > [1] > https://docs.oracle.com/javase/8/docs/api/java/sql/PreparedStatement.html# > addBatch-- > > On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > [hidden email]> wrote: > > > Hello Igniters, > > > > One of the major improvements to DML has to be support of batch > > statements. I'd like to discuss its implementation. The suggested > > approach is to rewrite given query turning it from few INSERTs into > > single statement and processing arguments accordingly. I suggest this > > as long as the whole point of batching is to make as little > > interactions with cluster as possible and to make operations as > > condensed as possible, and in case of Ignite it means that we should > > send as little JdbcQueryTasks as possible. And, as long as a query > > task holds single query and its arguments, this approach will not > > require any changes to be done to current design and won't break any > > backward compatibility - all dirty work on rewriting will be done by > > JDBC driver. > > Without rewriting, we could introduce some new query task for batch > > operations, but that would make impossible sending such requests from > > newer clients to older servers (say, servers of version 1.8.0, which > > does not know about batching, let alone older versions). > > I'd like to hear comments and suggestions from the community. Thanks! > > > > - Alex > > > |
Sergi,
JDBC batching might work quite differently from driver to driver. Say, MySQL happily rewrites queries as I had suggested in the beginning of this thread (it's not the only strategy, but one of the possible options) - and, BTW, would like to hear at least an opinion about it. On your first approach, section before streamer: you suggest that we send single statement and multiple param sets as a single query task, am I right? (Just to make sure that I got you properly.) If so, do you also mean that API (namely JdbcQueryTask) between server and client should also change? Or should new API means be added to facilitate batching tasks? - Alex 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <[hidden email]>: > Guys, > > I discussed this feature with Dmitriy and we came to conclusion that > batching in JDBC and Data Streaming in Ignite have different semantics and > performance characteristics. Thus they are independent features (they may > work together, may separately, but this is another story). > > Let me explain. > > This is how JDBC batching works: > - Add N sets of parameters to a prepared statement. > - Manually execute prepared statement. > - Repeat until all the data is loaded. > > > This is how data streamer works: > - Keep adding data. > - Streamer will buffer and load buffered per-node batches when they are big > enough. > - Close streamer to make sure that everything is loaded. > > As you can see we have a difference in semantics of when we send data: if > in our JDBC we will allow sending batches to nodes without calling > `execute` (and probably we will need to make `execute` to no-op here), then > we are violating semantics of JDBC, if we will disallow this behavior, then > this batching will underperform. > > Thus I suggest keeping these features (JDBC Batching and JDBC Streaming) as > separate features. > > As I already said they can work together: Batching will batch parameters > and on `execute` they will go to the Streamer in one shot and Streamer will > deal with the rest. > > Sergi > > > > > > > > 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <[hidden email]>: > >> Hi Alex, >> >> To my understanding there are two possible approaches to batching in JDBC >> layer: >> >> 1) Rely on default batching API. Specifically >> *PreparedStatement.addBatch()* [1] >> and others. This is nice and clear API, users are used to it, and it's >> adoption will minimize user code changes when migrating from other JDBC >> sources. We simply copy updates locally and then execute them all at once >> with only a single network hop to servers. *IgniteDataStreamer* can be used >> underneath. >> >> 2) Or we can have separate connection flag which will move all >> INSERT/UPDATE/DELETE statements through streamer. >> >> I prefer the first approach >> >> Also we need to keep in mind that data streamer has poor performance when >> adding single key-value pairs due to high overhead on concurrency and other >> bookkeeping. Instead, it is better to pre-batch key-value pairs before >> giving them to streamer. >> >> Vladimir. >> >> [1] >> https://docs.oracle.com/javase/8/docs/api/java/sql/PreparedStatement.html# >> addBatch-- >> >> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < >> [hidden email]> wrote: >> >> > Hello Igniters, >> > >> > One of the major improvements to DML has to be support of batch >> > statements. I'd like to discuss its implementation. The suggested >> > approach is to rewrite given query turning it from few INSERTs into >> > single statement and processing arguments accordingly. I suggest this >> > as long as the whole point of batching is to make as little >> > interactions with cluster as possible and to make operations as >> > condensed as possible, and in case of Ignite it means that we should >> > send as little JdbcQueryTasks as possible. And, as long as a query >> > task holds single query and its arguments, this approach will not >> > require any changes to be done to current design and won't break any >> > backward compatibility - all dirty work on rewriting will be done by >> > JDBC driver. >> > Without rewriting, we could introduce some new query task for batch >> > operations, but that would make impossible sending such requests from >> > newer clients to older servers (say, servers of version 1.8.0, which >> > does not know about batching, let alone older versions). >> > I'd like to hear comments and suggestions from the community. Thanks! >> > >> > - Alex >> > >> |
If we are bothered with performance and JDBC rules violation, then we can
easily do the following: 1) Add boolean flag "*batch_streaming*" to JDBC string. 2) If it is "*false*" (default) - we copy all updates locally and flush them only on "*executeBatch*" call. This way JDBC semantics is preserved. 3) If it is "*true*", all adds to batch goes to streamer directly. This way it might be faster, but violates JDBC. E.g. call to "*clearBatch*" doesn't work anymore and we should throw an exception. Bottom line is that normal non-batched operations should never go through streamer. Streamer is only involved when: a) user explicitly declared that he performs batch update b) special flag in connection string is set. Vladimir. On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko < [hidden email]> wrote: > Sergi, > > JDBC batching might work quite differently from driver to driver. Say, > MySQL happily rewrites queries as I had suggested in the beginning of > this thread (it's not the only strategy, but one of the possible > options) - and, BTW, would like to hear at least an opinion about it. > > On your first approach, section before streamer: you suggest that we > send single statement and multiple param sets as a single query task, > am I right? (Just to make sure that I got you properly.) If so, do you > also mean that API (namely JdbcQueryTask) between server and client > should also change? Or should new API means be added to facilitate > batching tasks? > > - Alex > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <[hidden email]>: > > Guys, > > > > I discussed this feature with Dmitriy and we came to conclusion that > > batching in JDBC and Data Streaming in Ignite have different semantics > and > > performance characteristics. Thus they are independent features (they may > > work together, may separately, but this is another story). > > > > Let me explain. > > > > This is how JDBC batching works: > > - Add N sets of parameters to a prepared statement. > > - Manually execute prepared statement. > > - Repeat until all the data is loaded. > > > > > > This is how data streamer works: > > - Keep adding data. > > - Streamer will buffer and load buffered per-node batches when they are > big > > enough. > > - Close streamer to make sure that everything is loaded. > > > > As you can see we have a difference in semantics of when we send data: if > > in our JDBC we will allow sending batches to nodes without calling > > `execute` (and probably we will need to make `execute` to no-op here), > then > > we are violating semantics of JDBC, if we will disallow this behavior, > then > > this batching will underperform. > > > > Thus I suggest keeping these features (JDBC Batching and JDBC Streaming) > as > > separate features. > > > > As I already said they can work together: Batching will batch parameters > > and on `execute` they will go to the Streamer in one shot and Streamer > will > > deal with the rest. > > > > Sergi > > > > > > > > > > > > > > > > 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <[hidden email]>: > > > >> Hi Alex, > >> > >> To my understanding there are two possible approaches to batching in > JDBC > >> layer: > >> > >> 1) Rely on default batching API. Specifically > >> *PreparedStatement.addBatch()* [1] > >> and others. This is nice and clear API, users are used to it, and it's > >> adoption will minimize user code changes when migrating from other JDBC > >> sources. We simply copy updates locally and then execute them all at > once > >> with only a single network hop to servers. *IgniteDataStreamer* can be > used > >> underneath. > >> > >> 2) Or we can have separate connection flag which will move all > >> INSERT/UPDATE/DELETE statements through streamer. > >> > >> I prefer the first approach > >> > >> Also we need to keep in mind that data streamer has poor performance > when > >> adding single key-value pairs due to high overhead on concurrency and > other > >> bookkeeping. Instead, it is better to pre-batch key-value pairs before > >> giving them to streamer. > >> > >> Vladimir. > >> > >> [1] > >> https://docs.oracle.com/javase/8/docs/api/java/sql/ > PreparedStatement.html# > >> addBatch-- > >> > >> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > >> [hidden email]> wrote: > >> > >> > Hello Igniters, > >> > > >> > One of the major improvements to DML has to be support of batch > >> > statements. I'd like to discuss its implementation. The suggested > >> > approach is to rewrite given query turning it from few INSERTs into > >> > single statement and processing arguments accordingly. I suggest this > >> > as long as the whole point of batching is to make as little > >> > interactions with cluster as possible and to make operations as > >> > condensed as possible, and in case of Ignite it means that we should > >> > send as little JdbcQueryTasks as possible. And, as long as a query > >> > task holds single query and its arguments, this approach will not > >> > require any changes to be done to current design and won't break any > >> > backward compatibility - all dirty work on rewriting will be done by > >> > JDBC driver. > >> > Without rewriting, we could introduce some new query task for batch > >> > operations, but that would make impossible sending such requests from > >> > newer clients to older servers (say, servers of version 1.8.0, which > >> > does not know about batching, let alone older versions). > >> > I'd like to hear comments and suggestions from the community. Thanks! > >> > > >> > - Alex > >> > > >> > |
In reply to this post by al.psc
Alex,
In most cases JdbcQueryTask should be executed locally on client node started by JDBC driver. JdbcQueryTask.QueryResult res = loc ? qryTask.call() : ignite.compute(ignite.cluster().forNodeId(nodeId)).call(qryTask); Is it valid behavior after introducing DML functionality? In cases when user wants to execute query on specific node he should fully understand what he wants and what can go in wrong way. On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko <[hidden email]> wrote: > Sergi, > > JDBC batching might work quite differently from driver to driver. Say, > MySQL happily rewrites queries as I had suggested in the beginning of > this thread (it's not the only strategy, but one of the possible > options) - and, BTW, would like to hear at least an opinion about it. > > On your first approach, section before streamer: you suggest that we > send single statement and multiple param sets as a single query task, > am I right? (Just to make sure that I got you properly.) If so, do you > also mean that API (namely JdbcQueryTask) between server and client > should also change? Or should new API means be added to facilitate > batching tasks? > > - Alex > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <[hidden email]>: >> Guys, >> >> I discussed this feature with Dmitriy and we came to conclusion that >> batching in JDBC and Data Streaming in Ignite have different semantics and >> performance characteristics. Thus they are independent features (they may >> work together, may separately, but this is another story). >> >> Let me explain. >> >> This is how JDBC batching works: >> - Add N sets of parameters to a prepared statement. >> - Manually execute prepared statement. >> - Repeat until all the data is loaded. >> >> >> This is how data streamer works: >> - Keep adding data. >> - Streamer will buffer and load buffered per-node batches when they are big >> enough. >> - Close streamer to make sure that everything is loaded. >> >> As you can see we have a difference in semantics of when we send data: if >> in our JDBC we will allow sending batches to nodes without calling >> `execute` (and probably we will need to make `execute` to no-op here), then >> we are violating semantics of JDBC, if we will disallow this behavior, then >> this batching will underperform. >> >> Thus I suggest keeping these features (JDBC Batching and JDBC Streaming) as >> separate features. >> >> As I already said they can work together: Batching will batch parameters >> and on `execute` they will go to the Streamer in one shot and Streamer will >> deal with the rest. >> >> Sergi >> >> >> >> >> >> >> >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <[hidden email]>: >> >>> Hi Alex, >>> >>> To my understanding there are two possible approaches to batching in JDBC >>> layer: >>> >>> 1) Rely on default batching API. Specifically >>> *PreparedStatement.addBatch()* [1] >>> and others. This is nice and clear API, users are used to it, and it's >>> adoption will minimize user code changes when migrating from other JDBC >>> sources. We simply copy updates locally and then execute them all at once >>> with only a single network hop to servers. *IgniteDataStreamer* can be used >>> underneath. >>> >>> 2) Or we can have separate connection flag which will move all >>> INSERT/UPDATE/DELETE statements through streamer. >>> >>> I prefer the first approach >>> >>> Also we need to keep in mind that data streamer has poor performance when >>> adding single key-value pairs due to high overhead on concurrency and other >>> bookkeeping. Instead, it is better to pre-batch key-value pairs before >>> giving them to streamer. >>> >>> Vladimir. >>> >>> [1] >>> https://docs.oracle.com/javase/8/docs/api/java/sql/PreparedStatement.html# >>> addBatch-- >>> >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < >>> [hidden email]> wrote: >>> >>> > Hello Igniters, >>> > >>> > One of the major improvements to DML has to be support of batch >>> > statements. I'd like to discuss its implementation. The suggested >>> > approach is to rewrite given query turning it from few INSERTs into >>> > single statement and processing arguments accordingly. I suggest this >>> > as long as the whole point of batching is to make as little >>> > interactions with cluster as possible and to make operations as >>> > condensed as possible, and in case of Ignite it means that we should >>> > send as little JdbcQueryTasks as possible. And, as long as a query >>> > task holds single query and its arguments, this approach will not >>> > require any changes to be done to current design and won't break any >>> > backward compatibility - all dirty work on rewriting will be done by >>> > JDBC driver. >>> > Without rewriting, we could introduce some new query task for batch >>> > operations, but that would make impossible sending such requests from >>> > newer clients to older servers (say, servers of version 1.8.0, which >>> > does not know about batching, let alone older versions). >>> > I'd like to hear comments and suggestions from the community. Thanks! >>> > >>> > - Alex >>> > >>> |
Vladimir,
I see no reason to forbid Streamer usage from non-batched statement execution. It is common that users already have their ETL tools and you can't be sure if they use batching or not. Alex, I guess we have to decide on Streaming first and then we will discuss Batching separately, ok? Because this decision may become important for batching implementation. Sergi 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: > Alex, > > In most cases JdbcQueryTask should be executed locally on client node > started by JDBC driver. > > JdbcQueryTask.QueryResult res = > loc ? qryTask.call() : > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(qryTask); > > Is it valid behavior after introducing DML functionality? > > In cases when user wants to execute query on specific node he should > fully understand what he wants and what can go in wrong way. > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > <[hidden email]> wrote: > > Sergi, > > > > JDBC batching might work quite differently from driver to driver. Say, > > MySQL happily rewrites queries as I had suggested in the beginning of > > this thread (it's not the only strategy, but one of the possible > > options) - and, BTW, would like to hear at least an opinion about it. > > > > On your first approach, section before streamer: you suggest that we > > send single statement and multiple param sets as a single query task, > > am I right? (Just to make sure that I got you properly.) If so, do you > > also mean that API (namely JdbcQueryTask) between server and client > > should also change? Or should new API means be added to facilitate > > batching tasks? > > > > - Alex > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <[hidden email]>: > >> Guys, > >> > >> I discussed this feature with Dmitriy and we came to conclusion that > >> batching in JDBC and Data Streaming in Ignite have different semantics > and > >> performance characteristics. Thus they are independent features (they > may > >> work together, may separately, but this is another story). > >> > >> Let me explain. > >> > >> This is how JDBC batching works: > >> - Add N sets of parameters to a prepared statement. > >> - Manually execute prepared statement. > >> - Repeat until all the data is loaded. > >> > >> > >> This is how data streamer works: > >> - Keep adding data. > >> - Streamer will buffer and load buffered per-node batches when they are > big > >> enough. > >> - Close streamer to make sure that everything is loaded. > >> > >> As you can see we have a difference in semantics of when we send data: > if > >> in our JDBC we will allow sending batches to nodes without calling > >> `execute` (and probably we will need to make `execute` to no-op here), > then > >> we are violating semantics of JDBC, if we will disallow this behavior, > then > >> this batching will underperform. > >> > >> Thus I suggest keeping these features (JDBC Batching and JDBC > Streaming) as > >> separate features. > >> > >> As I already said they can work together: Batching will batch parameters > >> and on `execute` they will go to the Streamer in one shot and Streamer > will > >> deal with the rest. > >> > >> Sergi > >> > >> > >> > >> > >> > >> > >> > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <[hidden email]>: > >> > >>> Hi Alex, > >>> > >>> To my understanding there are two possible approaches to batching in > JDBC > >>> layer: > >>> > >>> 1) Rely on default batching API. Specifically > >>> *PreparedStatement.addBatch()* [1] > >>> and others. This is nice and clear API, users are used to it, and it's > >>> adoption will minimize user code changes when migrating from other JDBC > >>> sources. We simply copy updates locally and then execute them all at > once > >>> with only a single network hop to servers. *IgniteDataStreamer* can be > used > >>> underneath. > >>> > >>> 2) Or we can have separate connection flag which will move all > >>> INSERT/UPDATE/DELETE statements through streamer. > >>> > >>> I prefer the first approach > >>> > >>> Also we need to keep in mind that data streamer has poor performance > when > >>> adding single key-value pairs due to high overhead on concurrency and > other > >>> bookkeeping. Instead, it is better to pre-batch key-value pairs before > >>> giving them to streamer. > >>> > >>> Vladimir. > >>> > >>> [1] > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > PreparedStatement.html# > >>> addBatch-- > >>> > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > >>> [hidden email]> wrote: > >>> > >>> > Hello Igniters, > >>> > > >>> > One of the major improvements to DML has to be support of batch > >>> > statements. I'd like to discuss its implementation. The suggested > >>> > approach is to rewrite given query turning it from few INSERTs into > >>> > single statement and processing arguments accordingly. I suggest this > >>> > as long as the whole point of batching is to make as little > >>> > interactions with cluster as possible and to make operations as > >>> > condensed as possible, and in case of Ignite it means that we should > >>> > send as little JdbcQueryTasks as possible. And, as long as a query > >>> > task holds single query and its arguments, this approach will not > >>> > require any changes to be done to current design and won't break any > >>> > backward compatibility - all dirty work on rewriting will be done by > >>> > JDBC driver. > >>> > Without rewriting, we could introduce some new query task for batch > >>> > operations, but that would make impossible sending such requests from > >>> > newer clients to older servers (say, servers of version 1.8.0, which > >>> > does not know about batching, let alone older versions). > >>> > I'd like to hear comments and suggestions from the community. Thanks! > >>> > > >>> > - Alex > >>> > > >>> > |
Sergi,
If user call single *execute() *operation, than most likely it is not batching. We should not rely on strange case where user perform batching without using standard and well-adopted batching JDBC API. The main problem with streamer is that it is async and hence break happens-before guarantees in a single thread: SELECT after INSERT might not return inserted value. Honestly, I do not really understand why we are trying to re-invent a bicycle here. There is standard API - let's just use it and make flexible enough to take advantage of IgniteDataStreamer if needed. Is there any use case which is not covered with this solution? Or let me ask from the opposite side - are there any well-known JDBC drivers which perform batching/streaming from non-batched update statements? Vladimir. On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <[hidden email]> wrote: > Vladimir, > > I see no reason to forbid Streamer usage from non-batched statement > execution. > It is common that users already have their ETL tools and you can't be sure > if they use batching or not. > > Alex, > > I guess we have to decide on Streaming first and then we will discuss > Batching separately, ok? Because this decision may become important for > batching implementation. > > Sergi > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: > > > Alex, > > > > In most cases JdbcQueryTask should be executed locally on client node > > started by JDBC driver. > > > > JdbcQueryTask.QueryResult res = > > loc ? qryTask.call() : > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(qryTask); > > > > Is it valid behavior after introducing DML functionality? > > > > In cases when user wants to execute query on specific node he should > > fully understand what he wants and what can go in wrong way. > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > > <[hidden email]> wrote: > > > Sergi, > > > > > > JDBC batching might work quite differently from driver to driver. Say, > > > MySQL happily rewrites queries as I had suggested in the beginning of > > > this thread (it's not the only strategy, but one of the possible > > > options) - and, BTW, would like to hear at least an opinion about it. > > > > > > On your first approach, section before streamer: you suggest that we > > > send single statement and multiple param sets as a single query task, > > > am I right? (Just to make sure that I got you properly.) If so, do you > > > also mean that API (namely JdbcQueryTask) between server and client > > > should also change? Or should new API means be added to facilitate > > > batching tasks? > > > > > > - Alex > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <[hidden email]>: > > >> Guys, > > >> > > >> I discussed this feature with Dmitriy and we came to conclusion that > > >> batching in JDBC and Data Streaming in Ignite have different semantics > > and > > >> performance characteristics. Thus they are independent features (they > > may > > >> work together, may separately, but this is another story). > > >> > > >> Let me explain. > > >> > > >> This is how JDBC batching works: > > >> - Add N sets of parameters to a prepared statement. > > >> - Manually execute prepared statement. > > >> - Repeat until all the data is loaded. > > >> > > >> > > >> This is how data streamer works: > > >> - Keep adding data. > > >> - Streamer will buffer and load buffered per-node batches when they > are > > big > > >> enough. > > >> - Close streamer to make sure that everything is loaded. > > >> > > >> As you can see we have a difference in semantics of when we send data: > > if > > >> in our JDBC we will allow sending batches to nodes without calling > > >> `execute` (and probably we will need to make `execute` to no-op here), > > then > > >> we are violating semantics of JDBC, if we will disallow this behavior, > > then > > >> this batching will underperform. > > >> > > >> Thus I suggest keeping these features (JDBC Batching and JDBC > > Streaming) as > > >> separate features. > > >> > > >> As I already said they can work together: Batching will batch > parameters > > >> and on `execute` they will go to the Streamer in one shot and Streamer > > will > > >> deal with the rest. > > >> > > >> Sergi > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <[hidden email]>: > > >> > > >>> Hi Alex, > > >>> > > >>> To my understanding there are two possible approaches to batching in > > JDBC > > >>> layer: > > >>> > > >>> 1) Rely on default batching API. Specifically > > >>> *PreparedStatement.addBatch()* [1] > > >>> and others. This is nice and clear API, users are used to it, and > it's > > >>> adoption will minimize user code changes when migrating from other > JDBC > > >>> sources. We simply copy updates locally and then execute them all at > > once > > >>> with only a single network hop to servers. *IgniteDataStreamer* can > be > > used > > >>> underneath. > > >>> > > >>> 2) Or we can have separate connection flag which will move all > > >>> INSERT/UPDATE/DELETE statements through streamer. > > >>> > > >>> I prefer the first approach > > >>> > > >>> Also we need to keep in mind that data streamer has poor performance > > when > > >>> adding single key-value pairs due to high overhead on concurrency and > > other > > >>> bookkeeping. Instead, it is better to pre-batch key-value pairs > before > > >>> giving them to streamer. > > >>> > > >>> Vladimir. > > >>> > > >>> [1] > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > > PreparedStatement.html# > > >>> addBatch-- > > >>> > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > > >>> [hidden email]> wrote: > > >>> > > >>> > Hello Igniters, > > >>> > > > >>> > One of the major improvements to DML has to be support of batch > > >>> > statements. I'd like to discuss its implementation. The suggested > > >>> > approach is to rewrite given query turning it from few INSERTs into > > >>> > single statement and processing arguments accordingly. I suggest > this > > >>> > as long as the whole point of batching is to make as little > > >>> > interactions with cluster as possible and to make operations as > > >>> > condensed as possible, and in case of Ignite it means that we > should > > >>> > send as little JdbcQueryTasks as possible. And, as long as a query > > >>> > task holds single query and its arguments, this approach will not > > >>> > require any changes to be done to current design and won't break > any > > >>> > backward compatibility - all dirty work on rewriting will be done > by > > >>> > JDBC driver. > > >>> > Without rewriting, we could introduce some new query task for batch > > >>> > operations, but that would make impossible sending such requests > from > > >>> > newer clients to older servers (say, servers of version 1.8.0, > which > > >>> > does not know about batching, let alone older versions). > > >>> > I'd like to hear comments and suggestions from the community. > Thanks! > > >>> > > > >>> > - Alex > > >>> > > > >>> > > > |
If we use Streamer, then we always have `happens-before` broken. This is
ok, because Streamer is for data loading, not for usual operating. We are not inventing any bicycles, just separating concerns: Batching and Streaming. My point here is that they should not depend on each other at all: Batching can work with or without Streaming, as well as Streaming can work with or without Batching. Your proposal is a set of non-obvious rules for them to work. I see no reasons for these complications. Sergi 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]>: > Sergi, > > If user call single *execute() *operation, than most likely it is not > batching. We should not rely on strange case where user perform batching > without using standard and well-adopted batching JDBC API. The main problem > with streamer is that it is async and hence break happens-before guarantees > in a single thread: SELECT after INSERT might not return inserted value. > > Honestly, I do not really understand why we are trying to re-invent a > bicycle here. There is standard API - let's just use it and make flexible > enough to take advantage of IgniteDataStreamer if needed. > > Is there any use case which is not covered with this solution? Or let me > ask from the opposite side - are there any well-known JDBC drivers which > perform batching/streaming from non-batched update statements? > > Vladimir. > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <[hidden email]> > wrote: > > > Vladimir, > > > > I see no reason to forbid Streamer usage from non-batched statement > > execution. > > It is common that users already have their ETL tools and you can't be > sure > > if they use batching or not. > > > > Alex, > > > > I guess we have to decide on Streaming first and then we will discuss > > Batching separately, ok? Because this decision may become important for > > batching implementation. > > > > Sergi > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: > > > > > Alex, > > > > > > In most cases JdbcQueryTask should be executed locally on client node > > > started by JDBC driver. > > > > > > JdbcQueryTask.QueryResult res = > > > loc ? qryTask.call() : > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(qryTask); > > > > > > Is it valid behavior after introducing DML functionality? > > > > > > In cases when user wants to execute query on specific node he should > > > fully understand what he wants and what can go in wrong way. > > > > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > > > <[hidden email]> wrote: > > > > Sergi, > > > > > > > > JDBC batching might work quite differently from driver to driver. > Say, > > > > MySQL happily rewrites queries as I had suggested in the beginning of > > > > this thread (it's not the only strategy, but one of the possible > > > > options) - and, BTW, would like to hear at least an opinion about it. > > > > > > > > On your first approach, section before streamer: you suggest that we > > > > send single statement and multiple param sets as a single query task, > > > > am I right? (Just to make sure that I got you properly.) If so, do > you > > > > also mean that API (namely JdbcQueryTask) between server and client > > > > should also change? Or should new API means be added to facilitate > > > > batching tasks? > > > > > > > > - Alex > > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <[hidden email] > >: > > > >> Guys, > > > >> > > > >> I discussed this feature with Dmitriy and we came to conclusion that > > > >> batching in JDBC and Data Streaming in Ignite have different > semantics > > > and > > > >> performance characteristics. Thus they are independent features > (they > > > may > > > >> work together, may separately, but this is another story). > > > >> > > > >> Let me explain. > > > >> > > > >> This is how JDBC batching works: > > > >> - Add N sets of parameters to a prepared statement. > > > >> - Manually execute prepared statement. > > > >> - Repeat until all the data is loaded. > > > >> > > > >> > > > >> This is how data streamer works: > > > >> - Keep adding data. > > > >> - Streamer will buffer and load buffered per-node batches when they > > are > > > big > > > >> enough. > > > >> - Close streamer to make sure that everything is loaded. > > > >> > > > >> As you can see we have a difference in semantics of when we send > data: > > > if > > > >> in our JDBC we will allow sending batches to nodes without calling > > > >> `execute` (and probably we will need to make `execute` to no-op > here), > > > then > > > >> we are violating semantics of JDBC, if we will disallow this > behavior, > > > then > > > >> this batching will underperform. > > > >> > > > >> Thus I suggest keeping these features (JDBC Batching and JDBC > > > Streaming) as > > > >> separate features. > > > >> > > > >> As I already said they can work together: Batching will batch > > parameters > > > >> and on `execute` they will go to the Streamer in one shot and > Streamer > > > will > > > >> deal with the rest. > > > >> > > > >> Sergi > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <[hidden email]>: > > > >> > > > >>> Hi Alex, > > > >>> > > > >>> To my understanding there are two possible approaches to batching > in > > > JDBC > > > >>> layer: > > > >>> > > > >>> 1) Rely on default batching API. Specifically > > > >>> *PreparedStatement.addBatch()* [1] > > > >>> and others. This is nice and clear API, users are used to it, and > > it's > > > >>> adoption will minimize user code changes when migrating from other > > JDBC > > > >>> sources. We simply copy updates locally and then execute them all > at > > > once > > > >>> with only a single network hop to servers. *IgniteDataStreamer* can > > be > > > used > > > >>> underneath. > > > >>> > > > >>> 2) Or we can have separate connection flag which will move all > > > >>> INSERT/UPDATE/DELETE statements through streamer. > > > >>> > > > >>> I prefer the first approach > > > >>> > > > >>> Also we need to keep in mind that data streamer has poor > performance > > > when > > > >>> adding single key-value pairs due to high overhead on concurrency > and > > > other > > > >>> bookkeeping. Instead, it is better to pre-batch key-value pairs > > before > > > >>> giving them to streamer. > > > >>> > > > >>> Vladimir. > > > >>> > > > >>> [1] > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > > > PreparedStatement.html# > > > >>> addBatch-- > > > >>> > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > > > >>> [hidden email]> wrote: > > > >>> > > > >>> > Hello Igniters, > > > >>> > > > > >>> > One of the major improvements to DML has to be support of batch > > > >>> > statements. I'd like to discuss its implementation. The suggested > > > >>> > approach is to rewrite given query turning it from few INSERTs > into > > > >>> > single statement and processing arguments accordingly. I suggest > > this > > > >>> > as long as the whole point of batching is to make as little > > > >>> > interactions with cluster as possible and to make operations as > > > >>> > condensed as possible, and in case of Ignite it means that we > > should > > > >>> > send as little JdbcQueryTasks as possible. And, as long as a > query > > > >>> > task holds single query and its arguments, this approach will not > > > >>> > require any changes to be done to current design and won't break > > any > > > >>> > backward compatibility - all dirty work on rewriting will be done > > by > > > >>> > JDBC driver. > > > >>> > Without rewriting, we could introduce some new query task for > batch > > > >>> > operations, but that would make impossible sending such requests > > from > > > >>> > newer clients to older servers (say, servers of version 1.8.0, > > which > > > >>> > does not know about batching, let alone older versions). > > > >>> > I'd like to hear comments and suggestions from the community. > > Thanks! > > > >>> > > > > >>> > - Alex > > > >>> > > > > >>> > > > > > > |
Gents,
As Sergi suggested, batching and streaming are very different semantically. To use standard JDBC batching, all we need to do is convert it to a cache.putAll() method, as semantically a putAll(...) call is identical to a JDBC batch. Of course, if we see and UPDATE with a WHERE clause in between, then we may have to break a batch into several chunks and execute the update in between. The DataStreamer should not be used here. I believe that for streaming we need to add a special JDBC/ODBC connection flag. Whenever this flag is set to true, then we only should allow INSERT or single-UPDATE operations and use DataStreamer API internally. All operations other than INSERT or single-UPDATE should be prohibited. I think this design is semantically clear. Any objections? D. On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <[hidden email]> wrote: > If we use Streamer, then we always have `happens-before` broken. This is > ok, because Streamer is for data loading, not for usual operating. > > We are not inventing any bicycles, just separating concerns: Batching and > Streaming. > > My point here is that they should not depend on each other at all: Batching > can work with or without Streaming, as well as Streaming can work with or > without Batching. > > Your proposal is a set of non-obvious rules for them to work. I see no > reasons for these complications. > > Sergi > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]>: > > > Sergi, > > > > If user call single *execute() *operation, than most likely it is not > > batching. We should not rely on strange case where user perform batching > > without using standard and well-adopted batching JDBC API. The main > problem > > with streamer is that it is async and hence break happens-before > guarantees > > in a single thread: SELECT after INSERT might not return inserted value. > > > > Honestly, I do not really understand why we are trying to re-invent a > > bicycle here. There is standard API - let's just use it and make flexible > > enough to take advantage of IgniteDataStreamer if needed. > > > > Is there any use case which is not covered with this solution? Or let me > > ask from the opposite side - are there any well-known JDBC drivers which > > perform batching/streaming from non-batched update statements? > > > > Vladimir. > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <[hidden email] > > > > wrote: > > > > > Vladimir, > > > > > > I see no reason to forbid Streamer usage from non-batched statement > > > execution. > > > It is common that users already have their ETL tools and you can't be > > sure > > > if they use batching or not. > > > > > > Alex, > > > > > > I guess we have to decide on Streaming first and then we will discuss > > > Batching separately, ok? Because this decision may become important for > > > batching implementation. > > > > > > Sergi > > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: > > > > > > > Alex, > > > > > > > > In most cases JdbcQueryTask should be executed locally on client node > > > > started by JDBC driver. > > > > > > > > JdbcQueryTask.QueryResult res = > > > > loc ? qryTask.call() : > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(qryTask); > > > > > > > > Is it valid behavior after introducing DML functionality? > > > > > > > > In cases when user wants to execute query on specific node he should > > > > fully understand what he wants and what can go in wrong way. > > > > > > > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > > > > <[hidden email]> wrote: > > > > > Sergi, > > > > > > > > > > JDBC batching might work quite differently from driver to driver. > > Say, > > > > > MySQL happily rewrites queries as I had suggested in the beginning > of > > > > > this thread (it's not the only strategy, but one of the possible > > > > > options) - and, BTW, would like to hear at least an opinion about > it. > > > > > > > > > > On your first approach, section before streamer: you suggest that > we > > > > > send single statement and multiple param sets as a single query > task, > > > > > am I right? (Just to make sure that I got you properly.) If so, do > > you > > > > > also mean that API (namely JdbcQueryTask) between server and client > > > > > should also change? Or should new API means be added to facilitate > > > > > batching tasks? > > > > > > > > > > - Alex > > > > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < > [hidden email] > > >: > > > > >> Guys, > > > > >> > > > > >> I discussed this feature with Dmitriy and we came to conclusion > that > > > > >> batching in JDBC and Data Streaming in Ignite have different > > semantics > > > > and > > > > >> performance characteristics. Thus they are independent features > > (they > > > > may > > > > >> work together, may separately, but this is another story). > > > > >> > > > > >> Let me explain. > > > > >> > > > > >> This is how JDBC batching works: > > > > >> - Add N sets of parameters to a prepared statement. > > > > >> - Manually execute prepared statement. > > > > >> - Repeat until all the data is loaded. > > > > >> > > > > >> > > > > >> This is how data streamer works: > > > > >> - Keep adding data. > > > > >> - Streamer will buffer and load buffered per-node batches when > they > > > are > > > > big > > > > >> enough. > > > > >> - Close streamer to make sure that everything is loaded. > > > > >> > > > > >> As you can see we have a difference in semantics of when we send > > data: > > > > if > > > > >> in our JDBC we will allow sending batches to nodes without calling > > > > >> `execute` (and probably we will need to make `execute` to no-op > > here), > > > > then > > > > >> we are violating semantics of JDBC, if we will disallow this > > behavior, > > > > then > > > > >> this batching will underperform. > > > > >> > > > > >> Thus I suggest keeping these features (JDBC Batching and JDBC > > > > Streaming) as > > > > >> separate features. > > > > >> > > > > >> As I already said they can work together: Batching will batch > > > parameters > > > > >> and on `execute` they will go to the Streamer in one shot and > > Streamer > > > > will > > > > >> deal with the rest. > > > > >> > > > > >> Sergi > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <[hidden email] > >: > > > > >> > > > > >>> Hi Alex, > > > > >>> > > > > >>> To my understanding there are two possible approaches to batching > > in > > > > JDBC > > > > >>> layer: > > > > >>> > > > > >>> 1) Rely on default batching API. Specifically > > > > >>> *PreparedStatement.addBatch()* [1] > > > > >>> and others. This is nice and clear API, users are used to it, and > > > it's > > > > >>> adoption will minimize user code changes when migrating from > other > > > JDBC > > > > >>> sources. We simply copy updates locally and then execute them all > > at > > > > once > > > > >>> with only a single network hop to servers. *IgniteDataStreamer* > can > > > be > > > > used > > > > >>> underneath. > > > > >>> > > > > >>> 2) Or we can have separate connection flag which will move all > > > > >>> INSERT/UPDATE/DELETE statements through streamer. > > > > >>> > > > > >>> I prefer the first approach > > > > >>> > > > > >>> Also we need to keep in mind that data streamer has poor > > performance > > > > when > > > > >>> adding single key-value pairs due to high overhead on concurrency > > and > > > > other > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value pairs > > > before > > > > >>> giving them to streamer. > > > > >>> > > > > >>> Vladimir. > > > > >>> > > > > >>> [1] > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > > > > PreparedStatement.html# > > > > >>> addBatch-- > > > > >>> > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > > > > >>> [hidden email]> wrote: > > > > >>> > > > > >>> > Hello Igniters, > > > > >>> > > > > > >>> > One of the major improvements to DML has to be support of batch > > > > >>> > statements. I'd like to discuss its implementation. The > suggested > > > > >>> > approach is to rewrite given query turning it from few INSERTs > > into > > > > >>> > single statement and processing arguments accordingly. I > suggest > > > this > > > > >>> > as long as the whole point of batching is to make as little > > > > >>> > interactions with cluster as possible and to make operations as > > > > >>> > condensed as possible, and in case of Ignite it means that we > > > should > > > > >>> > send as little JdbcQueryTasks as possible. And, as long as a > > query > > > > >>> > task holds single query and its arguments, this approach will > not > > > > >>> > require any changes to be done to current design and won't > break > > > any > > > > >>> > backward compatibility - all dirty work on rewriting will be > done > > > by > > > > >>> > JDBC driver. > > > > >>> > Without rewriting, we could introduce some new query task for > > batch > > > > >>> > operations, but that would make impossible sending such > requests > > > from > > > > >>> > newer clients to older servers (say, servers of version 1.8.0, > > > which > > > > >>> > does not know about batching, let alone older versions). > > > > >>> > I'd like to hear comments and suggestions from the community. > > > Thanks! > > > > >>> > > > > > >>> > - Alex > > > > >>> > > > > > >>> > > > > > > > > > > |
I already expressed my concern - this is counterintuitive approach. Because
without happens-before pure streaming model can be applied only on independent chunks of data. It mean that mentioned ETL use case is not feasible - ETL always depend on implicit or explicit links between tables, and hence streaming is not applicable here. And my question stands still - what produce except of possibly Ignite do this kind of JDBC streaming? Any example? Another problem is that connection-wide property doesn't fit well in JDBC On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <[hidden email]> wrote: > Gents, > > As Sergi suggested, batching and streaming are very different semantically. > > To use standard JDBC batching, all we need to do is convert it to a > cache.putAll() method, as semantically a putAll(...) call is identical to a > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in between, > then we may have to break a batch into several chunks and execute the > update in between. The DataStreamer should not be used here. > > I believe that for streaming we need to add a special JDBC/ODBC connection > flag. Whenever this flag is set to true, then we only should allow INSERT > or single-UPDATE operations and use DataStreamer API internally. All > operations other than INSERT or single-UPDATE should be prohibited. > > I think this design is semantically clear. Any objections? > > D. > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <[hidden email]> > wrote: > > > If we use Streamer, then we always have `happens-before` broken. This is > > ok, because Streamer is for data loading, not for usual operating. > > > > We are not inventing any bicycles, just separating concerns: Batching and > > Streaming. > > > > My point here is that they should not depend on each other at all: > Batching > > can work with or without Streaming, as well as Streaming can work with or > > without Batching. > > > > Your proposal is a set of non-obvious rules for them to work. I see no > > reasons for these complications. > > > > Sergi > > > > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]>: > > > > > Sergi, > > > > > > If user call single *execute() *operation, than most likely it is not > > > batching. We should not rely on strange case where user perform > batching > > > without using standard and well-adopted batching JDBC API. The main > > problem > > > with streamer is that it is async and hence break happens-before > > guarantees > > > in a single thread: SELECT after INSERT might not return inserted > value. > > > > > > Honestly, I do not really understand why we are trying to re-invent a > > > bicycle here. There is standard API - let's just use it and make > flexible > > > enough to take advantage of IgniteDataStreamer if needed. > > > > > > Is there any use case which is not covered with this solution? Or let > me > > > ask from the opposite side - are there any well-known JDBC drivers > which > > > perform batching/streaming from non-batched update statements? > > > > > > Vladimir. > > > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < > [hidden email] > > > > > > wrote: > > > > > > > Vladimir, > > > > > > > > I see no reason to forbid Streamer usage from non-batched statement > > > > execution. > > > > It is common that users already have their ETL tools and you can't be > > > sure > > > > if they use batching or not. > > > > > > > > Alex, > > > > > > > > I guess we have to decide on Streaming first and then we will discuss > > > > Batching separately, ok? Because this decision may become important > for > > > > batching implementation. > > > > > > > > Sergi > > > > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: > > > > > > > > > Alex, > > > > > > > > > > In most cases JdbcQueryTask should be executed locally on client > node > > > > > started by JDBC driver. > > > > > > > > > > JdbcQueryTask.QueryResult res = > > > > > loc ? qryTask.call() : > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(qryTask); > > > > > > > > > > Is it valid behavior after introducing DML functionality? > > > > > > > > > > In cases when user wants to execute query on specific node he > should > > > > > fully understand what he wants and what can go in wrong way. > > > > > > > > > > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > > > > > <[hidden email]> wrote: > > > > > > Sergi, > > > > > > > > > > > > JDBC batching might work quite differently from driver to driver. > > > Say, > > > > > > MySQL happily rewrites queries as I had suggested in the > beginning > > of > > > > > > this thread (it's not the only strategy, but one of the possible > > > > > > options) - and, BTW, would like to hear at least an opinion about > > it. > > > > > > > > > > > > On your first approach, section before streamer: you suggest that > > we > > > > > > send single statement and multiple param sets as a single query > > task, > > > > > > am I right? (Just to make sure that I got you properly.) If so, > do > > > you > > > > > > also mean that API (namely JdbcQueryTask) between server and > client > > > > > > should also change? Or should new API means be added to > facilitate > > > > > > batching tasks? > > > > > > > > > > > > - Alex > > > > > > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < > > [hidden email] > > > >: > > > > > >> Guys, > > > > > >> > > > > > >> I discussed this feature with Dmitriy and we came to conclusion > > that > > > > > >> batching in JDBC and Data Streaming in Ignite have different > > > semantics > > > > > and > > > > > >> performance characteristics. Thus they are independent features > > > (they > > > > > may > > > > > >> work together, may separately, but this is another story). > > > > > >> > > > > > >> Let me explain. > > > > > >> > > > > > >> This is how JDBC batching works: > > > > > >> - Add N sets of parameters to a prepared statement. > > > > > >> - Manually execute prepared statement. > > > > > >> - Repeat until all the data is loaded. > > > > > >> > > > > > >> > > > > > >> This is how data streamer works: > > > > > >> - Keep adding data. > > > > > >> - Streamer will buffer and load buffered per-node batches when > > they > > > > are > > > > > big > > > > > >> enough. > > > > > >> - Close streamer to make sure that everything is loaded. > > > > > >> > > > > > >> As you can see we have a difference in semantics of when we send > > > data: > > > > > if > > > > > >> in our JDBC we will allow sending batches to nodes without > calling > > > > > >> `execute` (and probably we will need to make `execute` to no-op > > > here), > > > > > then > > > > > >> we are violating semantics of JDBC, if we will disallow this > > > behavior, > > > > > then > > > > > >> this batching will underperform. > > > > > >> > > > > > >> Thus I suggest keeping these features (JDBC Batching and JDBC > > > > > Streaming) as > > > > > >> separate features. > > > > > >> > > > > > >> As I already said they can work together: Batching will batch > > > > parameters > > > > > >> and on `execute` they will go to the Streamer in one shot and > > > Streamer > > > > > will > > > > > >> deal with the rest. > > > > > >> > > > > > >> Sergi > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < > [hidden email] > > >: > > > > > >> > > > > > >>> Hi Alex, > > > > > >>> > > > > > >>> To my understanding there are two possible approaches to > batching > > > in > > > > > JDBC > > > > > >>> layer: > > > > > >>> > > > > > >>> 1) Rely on default batching API. Specifically > > > > > >>> *PreparedStatement.addBatch()* [1] > > > > > >>> and others. This is nice and clear API, users are used to it, > and > > > > it's > > > > > >>> adoption will minimize user code changes when migrating from > > other > > > > JDBC > > > > > >>> sources. We simply copy updates locally and then execute them > all > > > at > > > > > once > > > > > >>> with only a single network hop to servers. *IgniteDataStreamer* > > can > > > > be > > > > > used > > > > > >>> underneath. > > > > > >>> > > > > > >>> 2) Or we can have separate connection flag which will move all > > > > > >>> INSERT/UPDATE/DELETE statements through streamer. > > > > > >>> > > > > > >>> I prefer the first approach > > > > > >>> > > > > > >>> Also we need to keep in mind that data streamer has poor > > > performance > > > > > when > > > > > >>> adding single key-value pairs due to high overhead on > concurrency > > > and > > > > > other > > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value pairs > > > > before > > > > > >>> giving them to streamer. > > > > > >>> > > > > > >>> Vladimir. > > > > > >>> > > > > > >>> [1] > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > > > > > PreparedStatement.html# > > > > > >>> addBatch-- > > > > > >>> > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > > > > > >>> [hidden email]> wrote: > > > > > >>> > > > > > >>> > Hello Igniters, > > > > > >>> > > > > > > >>> > One of the major improvements to DML has to be support of > batch > > > > > >>> > statements. I'd like to discuss its implementation. The > > suggested > > > > > >>> > approach is to rewrite given query turning it from few > INSERTs > > > into > > > > > >>> > single statement and processing arguments accordingly. I > > suggest > > > > this > > > > > >>> > as long as the whole point of batching is to make as little > > > > > >>> > interactions with cluster as possible and to make operations > as > > > > > >>> > condensed as possible, and in case of Ignite it means that we > > > > should > > > > > >>> > send as little JdbcQueryTasks as possible. And, as long as a > > > query > > > > > >>> > task holds single query and its arguments, this approach will > > not > > > > > >>> > require any changes to be done to current design and won't > > break > > > > any > > > > > >>> > backward compatibility - all dirty work on rewriting will be > > done > > > > by > > > > > >>> > JDBC driver. > > > > > >>> > Without rewriting, we could introduce some new query task for > > > batch > > > > > >>> > operations, but that would make impossible sending such > > requests > > > > from > > > > > >>> > newer clients to older servers (say, servers of version > 1.8.0, > > > > which > > > > > >>> > does not know about batching, let alone older versions). > > > > > >>> > I'd like to hear comments and suggestions from the community. > > > > Thanks! > > > > > >>> > > > > > > >>> > - Alex > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > -- Vladimir Ozerov Senior Software Architect GridGain Systems www.gridgain.com *+7 (960) 283 98 40* |
In reply to this post by dsetrakyan
I already expressed my concern - this is counterintuitive approach. Because
without happens-before pure streaming model can be applied only on independent chunks of data. It mean that mentioned ETL use case is not feasible - ETL always depend on implicit or explicit links between tables, and hence streaming is not applicable here. My question stands still - what products except of possibly Ignite do this kind of JDBC streaming? Another problem is that connection-wide property doesn't fit well in JDBC pooling model. Users will have use different connections for streaming and non-streaming approaches. Please see how Oracle did that, this is precisely what I am talking about: https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf.htm#i1056232 Two batching modes - one with explicit flush, another one with implicit flush, when Oracle decides on it's own when it is better to communicate the server. Batching mode can be declared globally or on per-statement level. Simple and flexible. On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <[hidden email]> wrote: > Gents, > > As Sergi suggested, batching and streaming are very different semantically. > > To use standard JDBC batching, all we need to do is convert it to a > cache.putAll() method, as semantically a putAll(...) call is identical to a > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in between, > then we may have to break a batch into several chunks and execute the > update in between. The DataStreamer should not be used here. > > I believe that for streaming we need to add a special JDBC/ODBC connection > flag. Whenever this flag is set to true, then we only should allow INSERT > or single-UPDATE operations and use DataStreamer API internally. All > operations other than INSERT or single-UPDATE should be prohibited. > > I think this design is semantically clear. Any objections? > > D. > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <[hidden email]> > wrote: > > > If we use Streamer, then we always have `happens-before` broken. This is > > ok, because Streamer is for data loading, not for usual operating. > > > > We are not inventing any bicycles, just separating concerns: Batching and > > Streaming. > > > > My point here is that they should not depend on each other at all: > Batching > > can work with or without Streaming, as well as Streaming can work with or > > without Batching. > > > > Your proposal is a set of non-obvious rules for them to work. I see no > > reasons for these complications. > > > > Sergi > > > > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]>: > > > > > Sergi, > > > > > > If user call single *execute() *operation, than most likely it is not > > > batching. We should not rely on strange case where user perform > batching > > > without using standard and well-adopted batching JDBC API. The main > > problem > > > with streamer is that it is async and hence break happens-before > > guarantees > > > in a single thread: SELECT after INSERT might not return inserted > value. > > > > > > Honestly, I do not really understand why we are trying to re-invent a > > > bicycle here. There is standard API - let's just use it and make > flexible > > > enough to take advantage of IgniteDataStreamer if needed. > > > > > > Is there any use case which is not covered with this solution? Or let > me > > > ask from the opposite side - are there any well-known JDBC drivers > which > > > perform batching/streaming from non-batched update statements? > > > > > > Vladimir. > > > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < > [hidden email] > > > > > > wrote: > > > > > > > Vladimir, > > > > > > > > I see no reason to forbid Streamer usage from non-batched statement > > > > execution. > > > > It is common that users already have their ETL tools and you can't be > > > sure > > > > if they use batching or not. > > > > > > > > Alex, > > > > > > > > I guess we have to decide on Streaming first and then we will discuss > > > > Batching separately, ok? Because this decision may become important > for > > > > batching implementation. > > > > > > > > Sergi > > > > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: > > > > > > > > > Alex, > > > > > > > > > > In most cases JdbcQueryTask should be executed locally on client > node > > > > > started by JDBC driver. > > > > > > > > > > JdbcQueryTask.QueryResult res = > > > > > loc ? qryTask.call() : > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(qryTask); > > > > > > > > > > Is it valid behavior after introducing DML functionality? > > > > > > > > > > In cases when user wants to execute query on specific node he > should > > > > > fully understand what he wants and what can go in wrong way. > > > > > > > > > > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > > > > > <[hidden email]> wrote: > > > > > > Sergi, > > > > > > > > > > > > JDBC batching might work quite differently from driver to driver. > > > Say, > > > > > > MySQL happily rewrites queries as I had suggested in the > beginning > > of > > > > > > this thread (it's not the only strategy, but one of the possible > > > > > > options) - and, BTW, would like to hear at least an opinion about > > it. > > > > > > > > > > > > On your first approach, section before streamer: you suggest that > > we > > > > > > send single statement and multiple param sets as a single query > > task, > > > > > > am I right? (Just to make sure that I got you properly.) If so, > do > > > you > > > > > > also mean that API (namely JdbcQueryTask) between server and > client > > > > > > should also change? Or should new API means be added to > facilitate > > > > > > batching tasks? > > > > > > > > > > > > - Alex > > > > > > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < > > [hidden email] > > > >: > > > > > >> Guys, > > > > > >> > > > > > >> I discussed this feature with Dmitriy and we came to conclusion > > that > > > > > >> batching in JDBC and Data Streaming in Ignite have different > > > semantics > > > > > and > > > > > >> performance characteristics. Thus they are independent features > > > (they > > > > > may > > > > > >> work together, may separately, but this is another story). > > > > > >> > > > > > >> Let me explain. > > > > > >> > > > > > >> This is how JDBC batching works: > > > > > >> - Add N sets of parameters to a prepared statement. > > > > > >> - Manually execute prepared statement. > > > > > >> - Repeat until all the data is loaded. > > > > > >> > > > > > >> > > > > > >> This is how data streamer works: > > > > > >> - Keep adding data. > > > > > >> - Streamer will buffer and load buffered per-node batches when > > they > > > > are > > > > > big > > > > > >> enough. > > > > > >> - Close streamer to make sure that everything is loaded. > > > > > >> > > > > > >> As you can see we have a difference in semantics of when we send > > > data: > > > > > if > > > > > >> in our JDBC we will allow sending batches to nodes without > calling > > > > > >> `execute` (and probably we will need to make `execute` to no-op > > > here), > > > > > then > > > > > >> we are violating semantics of JDBC, if we will disallow this > > > behavior, > > > > > then > > > > > >> this batching will underperform. > > > > > >> > > > > > >> Thus I suggest keeping these features (JDBC Batching and JDBC > > > > > Streaming) as > > > > > >> separate features. > > > > > >> > > > > > >> As I already said they can work together: Batching will batch > > > > parameters > > > > > >> and on `execute` they will go to the Streamer in one shot and > > > Streamer > > > > > will > > > > > >> deal with the rest. > > > > > >> > > > > > >> Sergi > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < > [hidden email] > > >: > > > > > >> > > > > > >>> Hi Alex, > > > > > >>> > > > > > >>> To my understanding there are two possible approaches to > batching > > > in > > > > > JDBC > > > > > >>> layer: > > > > > >>> > > > > > >>> 1) Rely on default batching API. Specifically > > > > > >>> *PreparedStatement.addBatch()* [1] > > > > > >>> and others. This is nice and clear API, users are used to it, > and > > > > it's > > > > > >>> adoption will minimize user code changes when migrating from > > other > > > > JDBC > > > > > >>> sources. We simply copy updates locally and then execute them > all > > > at > > > > > once > > > > > >>> with only a single network hop to servers. *IgniteDataStreamer* > > can > > > > be > > > > > used > > > > > >>> underneath. > > > > > >>> > > > > > >>> 2) Or we can have separate connection flag which will move all > > > > > >>> INSERT/UPDATE/DELETE statements through streamer. > > > > > >>> > > > > > >>> I prefer the first approach > > > > > >>> > > > > > >>> Also we need to keep in mind that data streamer has poor > > > performance > > > > > when > > > > > >>> adding single key-value pairs due to high overhead on > concurrency > > > and > > > > > other > > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value pairs > > > > before > > > > > >>> giving them to streamer. > > > > > >>> > > > > > >>> Vladimir. > > > > > >>> > > > > > >>> [1] > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > > > > > PreparedStatement.html# > > > > > >>> addBatch-- > > > > > >>> > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > > > > > >>> [hidden email]> wrote: > > > > > >>> > > > > > >>> > Hello Igniters, > > > > > >>> > > > > > > >>> > One of the major improvements to DML has to be support of > batch > > > > > >>> > statements. I'd like to discuss its implementation. The > > suggested > > > > > >>> > approach is to rewrite given query turning it from few > INSERTs > > > into > > > > > >>> > single statement and processing arguments accordingly. I > > suggest > > > > this > > > > > >>> > as long as the whole point of batching is to make as little > > > > > >>> > interactions with cluster as possible and to make operations > as > > > > > >>> > condensed as possible, and in case of Ignite it means that we > > > > should > > > > > >>> > send as little JdbcQueryTasks as possible. And, as long as a > > > query > > > > > >>> > task holds single query and its arguments, this approach will > > not > > > > > >>> > require any changes to be done to current design and won't > > break > > > > any > > > > > >>> > backward compatibility - all dirty work on rewriting will be > > done > > > > by > > > > > >>> > JDBC driver. > > > > > >>> > Without rewriting, we could introduce some new query task for > > > batch > > > > > >>> > operations, but that would make impossible sending such > > requests > > > > from > > > > > >>> > newer clients to older servers (say, servers of version > 1.8.0, > > > > which > > > > > >>> > does not know about batching, let alone older versions). > > > > > >>> > I'd like to hear comments and suggestions from the community. > > > > Thanks! > > > > > >>> > > > > > > >>> > - Alex > > > > > >>> > > > > > > >>> > > > > > > > > > > > > > > > -- Vladimir Ozerov Senior Software Architect GridGain Systems www.gridgain.com *+7 (960) 283 98 40* |
On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <[hidden email]>
wrote: > I already expressed my concern - this is counterintuitive approach. Because > without happens-before pure streaming model can be applied only on > independent chunks of data. It mean that mentioned ETL use case is not > feasible - ETL always depend on implicit or explicit links between tables, > and hence streaming is not applicable here. My question stands still - what > products except of possibly Ignite do this kind of JDBC streaming? > Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or DataStreamer.addData(). JDBC batching and putAll() are absolutely identical. If you see it as counter-intuitive, I would ask for a concrete example. As far as links between data, Ignite does not have foreign-key constraints, so DataStreamer can insert data in any order (but again, not as part of JDBC batch). > > Another problem is that connection-wide property doesn't fit well in JDBC > pooling model. Users will have use different connections for streaming and > non-streaming approaches. > Using DataStreamer is not possible within JDBC batching paradigm, period. I wish we could drop the high-level-feels-good discussions altogether, because it seems like we are spinning wheels here. There is no way to use the streamer in JDBC context, unless we add a connection flag. Again, if you disagree, I would prefer to see a concrete example explaining why. > Please see how Oracle did that, this is precisely what I am talking about: > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf.htm#i1056232 > Two batching modes - one with explicit flush, another one with implicit > flush, when Oracle decides on it's own when it is better to communicate the > server. Batching mode can be declared globally or on per-statement level. > Simple and flexible. > > > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <[hidden email]> > wrote: > > > Gents, > > > > As Sergi suggested, batching and streaming are very different > semantically. > > > > To use standard JDBC batching, all we need to do is convert it to a > > cache.putAll() method, as semantically a putAll(...) call is identical > to a > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in > between, > > then we may have to break a batch into several chunks and execute the > > update in between. The DataStreamer should not be used here. > > > > I believe that for streaming we need to add a special JDBC/ODBC > connection > > flag. Whenever this flag is set to true, then we only should allow INSERT > > or single-UPDATE operations and use DataStreamer API internally. All > > operations other than INSERT or single-UPDATE should be prohibited. > > > > I think this design is semantically clear. Any objections? > > > > D. > > > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <[hidden email] > > > > wrote: > > > > > If we use Streamer, then we always have `happens-before` broken. This > is > > > ok, because Streamer is for data loading, not for usual operating. > > > > > > We are not inventing any bicycles, just separating concerns: Batching > and > > > Streaming. > > > > > > My point here is that they should not depend on each other at all: > > Batching > > > can work with or without Streaming, as well as Streaming can work with > or > > > without Batching. > > > > > > Your proposal is a set of non-obvious rules for them to work. I see no > > > reasons for these complications. > > > > > > Sergi > > > > > > > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]>: > > > > > > > Sergi, > > > > > > > > If user call single *execute() *operation, than most likely it is not > > > > batching. We should not rely on strange case where user perform > > batching > > > > without using standard and well-adopted batching JDBC API. The main > > > problem > > > > with streamer is that it is async and hence break happens-before > > > guarantees > > > > in a single thread: SELECT after INSERT might not return inserted > > value. > > > > > > > > Honestly, I do not really understand why we are trying to re-invent a > > > > bicycle here. There is standard API - let's just use it and make > > flexible > > > > enough to take advantage of IgniteDataStreamer if needed. > > > > > > > > Is there any use case which is not covered with this solution? Or let > > me > > > > ask from the opposite side - are there any well-known JDBC drivers > > which > > > > perform batching/streaming from non-batched update statements? > > > > > > > > Vladimir. > > > > > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < > > [hidden email] > > > > > > > > wrote: > > > > > > > > > Vladimir, > > > > > > > > > > I see no reason to forbid Streamer usage from non-batched statement > > > > > execution. > > > > > It is common that users already have their ETL tools and you can't > be > > > > sure > > > > > if they use batching or not. > > > > > > > > > > Alex, > > > > > > > > > > I guess we have to decide on Streaming first and then we will > discuss > > > > > Batching separately, ok? Because this decision may become important > > for > > > > > batching implementation. > > > > > > > > > > Sergi > > > > > > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: > > > > > > > > > > > Alex, > > > > > > > > > > > > In most cases JdbcQueryTask should be executed locally on client > > node > > > > > > started by JDBC driver. > > > > > > > > > > > > JdbcQueryTask.QueryResult res = > > > > > > loc ? qryTask.call() : > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call( > qryTask); > > > > > > > > > > > > Is it valid behavior after introducing DML functionality? > > > > > > > > > > > > In cases when user wants to execute query on specific node he > > should > > > > > > fully understand what he wants and what can go in wrong way. > > > > > > > > > > > > > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > > > > > > <[hidden email]> wrote: > > > > > > > Sergi, > > > > > > > > > > > > > > JDBC batching might work quite differently from driver to > driver. > > > > Say, > > > > > > > MySQL happily rewrites queries as I had suggested in the > > beginning > > > of > > > > > > > this thread (it's not the only strategy, but one of the > possible > > > > > > > options) - and, BTW, would like to hear at least an opinion > about > > > it. > > > > > > > > > > > > > > On your first approach, section before streamer: you suggest > that > > > we > > > > > > > send single statement and multiple param sets as a single query > > > task, > > > > > > > am I right? (Just to make sure that I got you properly.) If so, > > do > > > > you > > > > > > > also mean that API (namely JdbcQueryTask) between server and > > client > > > > > > > should also change? Or should new API means be added to > > facilitate > > > > > > > batching tasks? > > > > > > > > > > > > > > - Alex > > > > > > > > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < > > > [hidden email] > > > > >: > > > > > > >> Guys, > > > > > > >> > > > > > > >> I discussed this feature with Dmitriy and we came to > conclusion > > > that > > > > > > >> batching in JDBC and Data Streaming in Ignite have different > > > > semantics > > > > > > and > > > > > > >> performance characteristics. Thus they are independent > features > > > > (they > > > > > > may > > > > > > >> work together, may separately, but this is another story). > > > > > > >> > > > > > > >> Let me explain. > > > > > > >> > > > > > > >> This is how JDBC batching works: > > > > > > >> - Add N sets of parameters to a prepared statement. > > > > > > >> - Manually execute prepared statement. > > > > > > >> - Repeat until all the data is loaded. > > > > > > >> > > > > > > >> > > > > > > >> This is how data streamer works: > > > > > > >> - Keep adding data. > > > > > > >> - Streamer will buffer and load buffered per-node batches when > > > they > > > > > are > > > > > > big > > > > > > >> enough. > > > > > > >> - Close streamer to make sure that everything is loaded. > > > > > > >> > > > > > > >> As you can see we have a difference in semantics of when we > send > > > > data: > > > > > > if > > > > > > >> in our JDBC we will allow sending batches to nodes without > > calling > > > > > > >> `execute` (and probably we will need to make `execute` to > no-op > > > > here), > > > > > > then > > > > > > >> we are violating semantics of JDBC, if we will disallow this > > > > behavior, > > > > > > then > > > > > > >> this batching will underperform. > > > > > > >> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and JDBC > > > > > > Streaming) as > > > > > > >> separate features. > > > > > > >> > > > > > > >> As I already said they can work together: Batching will batch > > > > > parameters > > > > > > >> and on `execute` they will go to the Streamer in one shot and > > > > Streamer > > > > > > will > > > > > > >> deal with the rest. > > > > > > >> > > > > > > >> Sergi > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < > > [hidden email] > > > >: > > > > > > >> > > > > > > >>> Hi Alex, > > > > > > >>> > > > > > > >>> To my understanding there are two possible approaches to > > batching > > > > in > > > > > > JDBC > > > > > > >>> layer: > > > > > > >>> > > > > > > >>> 1) Rely on default batching API. Specifically > > > > > > >>> *PreparedStatement.addBatch()* [1] > > > > > > >>> and others. This is nice and clear API, users are used to it, > > and > > > > > it's > > > > > > >>> adoption will minimize user code changes when migrating from > > > other > > > > > JDBC > > > > > > >>> sources. We simply copy updates locally and then execute them > > all > > > > at > > > > > > once > > > > > > >>> with only a single network hop to servers. > *IgniteDataStreamer* > > > can > > > > > be > > > > > > used > > > > > > >>> underneath. > > > > > > >>> > > > > > > >>> 2) Or we can have separate connection flag which will move > all > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer. > > > > > > >>> > > > > > > >>> I prefer the first approach > > > > > > >>> > > > > > > >>> Also we need to keep in mind that data streamer has poor > > > > performance > > > > > > when > > > > > > >>> adding single key-value pairs due to high overhead on > > concurrency > > > > and > > > > > > other > > > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value > pairs > > > > > before > > > > > > >>> giving them to streamer. > > > > > > >>> > > > > > > >>> Vladimir. > > > > > > >>> > > > > > > >>> [1] > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > > > > > > PreparedStatement.html# > > > > > > >>> addBatch-- > > > > > > >>> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > > > > > > >>> [hidden email]> wrote: > > > > > > >>> > > > > > > >>> > Hello Igniters, > > > > > > >>> > > > > > > > >>> > One of the major improvements to DML has to be support of > > batch > > > > > > >>> > statements. I'd like to discuss its implementation. The > > > suggested > > > > > > >>> > approach is to rewrite given query turning it from few > > INSERTs > > > > into > > > > > > >>> > single statement and processing arguments accordingly. I > > > suggest > > > > > this > > > > > > >>> > as long as the whole point of batching is to make as little > > > > > > >>> > interactions with cluster as possible and to make > operations > > as > > > > > > >>> > condensed as possible, and in case of Ignite it means that > we > > > > > should > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as long as > a > > > > query > > > > > > >>> > task holds single query and its arguments, this approach > will > > > not > > > > > > >>> > require any changes to be done to current design and won't > > > break > > > > > any > > > > > > >>> > backward compatibility - all dirty work on rewriting will > be > > > done > > > > > by > > > > > > >>> > JDBC driver. > > > > > > >>> > Without rewriting, we could introduce some new query task > for > > > > batch > > > > > > >>> > operations, but that would make impossible sending such > > > requests > > > > > from > > > > > > >>> > newer clients to older servers (say, servers of version > > 1.8.0, > > > > > which > > > > > > >>> > does not know about batching, let alone older versions). > > > > > > >>> > I'd like to hear comments and suggestions from the > community. > > > > > Thanks! > > > > > > >>> > > > > > > > >>> > - Alex > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > -- > Vladimir Ozerov > Senior Software Architect > GridGain Systems > www.gridgain.com > *+7 (960) 283 98 40* > |
Dima,
I would like to point out that data streamer support had already been implemented in the course of work on DML in 1.8 exactly as you are suggesting now (turned on via connection flag; allowed only MERGE — data streamer can't do putIfAbsent stuff, right?; absolutely no relation w/JDBC), *but* that patch had been reverted — by advice from Vlad which I believe had been agreed with you, so it didn't make it to 1.8 after all. Also, while it's possible to maintain INSERT vs MERGE semantic using streamer's allowOverwrite flag, I can't see how we could mimic UPDATE here as long as the streamer does not put data to cache only in case when key is present AND allowOverwrite is false, while UPDATE should not put anything when the key is *missing* — i.e., there's no way to emulate cache's *replace* operation semantic with streamer (update value only if key is present, otherwise do nothing). — Alex 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" < [hidden email]> написал: > On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <[hidden email]> > wrote: > > > I already expressed my concern - this is counterintuitive approach. > Because > > without happens-before pure streaming model can be applied only on > > independent chunks of data. It mean that mentioned ETL use case is not > > feasible - ETL always depend on implicit or explicit links between > tables, > > and hence streaming is not applicable here. My question stands still - > what > > products except of possibly Ignite do this kind of JDBC streaming? > > > > Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or > DataStreamer.addData(). > > JDBC batching and putAll() are absolutely identical. If you see it as > counter-intuitive, I would ask for a concrete example. > > As far as links between data, Ignite does not have foreign-key constraints, > so DataStreamer can insert data in any order (but again, not as part of > JDBC batch). > > > > > > Another problem is that connection-wide property doesn't fit well in JDBC > > pooling model. Users will have use different connections for streaming > and > > non-streaming approaches. > > > > Using DataStreamer is not possible within JDBC batching paradigm, period. I > wish we could drop the high-level-feels-good discussions altogether, > because it seems like we are spinning wheels here. > > There is no way to use the streamer in JDBC context, unless we add a > connection flag. Again, if you disagree, I would prefer to see a concrete > example explaining why. > > > > Please see how Oracle did that, this is precisely what I am talking > about: > > https://docs.oracle.com/cd/B28359_01/java.111/b31224/ > oraperf.htm#i1056232 > > Two batching modes - one with explicit flush, another one with implicit > > flush, when Oracle decides on it's own when it is better to communicate > the > > server. Batching mode can be declared globally or on per-statement level. > > Simple and flexible. > > > > > > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <[hidden email] > > > > wrote: > > > > > Gents, > > > > > > As Sergi suggested, batching and streaming are very different > > semantically. > > > > > > To use standard JDBC batching, all we need to do is convert it to a > > > cache.putAll() method, as semantically a putAll(...) call is identical > > to a > > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in > > between, > > > then we may have to break a batch into several chunks and execute the > > > update in between. The DataStreamer should not be used here. > > > > > > I believe that for streaming we need to add a special JDBC/ODBC > > connection > > > flag. Whenever this flag is set to true, then we only should allow > INSERT > > > or single-UPDATE operations and use DataStreamer API internally. All > > > operations other than INSERT or single-UPDATE should be prohibited. > > > > > > I think this design is semantically clear. Any objections? > > > > > > D. > > > > > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin < > [hidden email] > > > > > > wrote: > > > > > > > If we use Streamer, then we always have `happens-before` broken. This > > is > > > > ok, because Streamer is for data loading, not for usual operating. > > > > > > > > We are not inventing any bicycles, just separating concerns: Batching > > and > > > > Streaming. > > > > > > > > My point here is that they should not depend on each other at all: > > > Batching > > > > can work with or without Streaming, as well as Streaming can work > with > > or > > > > without Batching. > > > > > > > > Your proposal is a set of non-obvious rules for them to work. I see > no > > > > reasons for these complications. > > > > > > > > Sergi > > > > > > > > > > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]>: > > > > > > > > > Sergi, > > > > > > > > > > If user call single *execute() *operation, than most likely it is > not > > > > > batching. We should not rely on strange case where user perform > > > batching > > > > > without using standard and well-adopted batching JDBC API. The main > > > > problem > > > > > with streamer is that it is async and hence break happens-before > > > > guarantees > > > > > in a single thread: SELECT after INSERT might not return inserted > > > value. > > > > > > > > > > Honestly, I do not really understand why we are trying to > re-invent a > > > > > bicycle here. There is standard API - let's just use it and make > > > flexible > > > > > enough to take advantage of IgniteDataStreamer if needed. > > > > > > > > > > Is there any use case which is not covered with this solution? Or > let > > > me > > > > > ask from the opposite side - are there any well-known JDBC drivers > > > which > > > > > perform batching/streaming from non-batched update statements? > > > > > > > > > > Vladimir. > > > > > > > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < > > > [hidden email] > > > > > > > > > > wrote: > > > > > > > > > > > Vladimir, > > > > > > > > > > > > I see no reason to forbid Streamer usage from non-batched > statement > > > > > > execution. > > > > > > It is common that users already have their ETL tools and you > can't > > be > > > > > sure > > > > > > if they use batching or not. > > > > > > > > > > > > Alex, > > > > > > > > > > > > I guess we have to decide on Streaming first and then we will > > discuss > > > > > > Batching separately, ok? Because this decision may become > important > > > for > > > > > > batching implementation. > > > > > > > > > > > > Sergi > > > > > > > > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: > > > > > > > > > > > > > Alex, > > > > > > > > > > > > > > In most cases JdbcQueryTask should be executed locally on > client > > > node > > > > > > > started by JDBC driver. > > > > > > > > > > > > > > JdbcQueryTask.QueryResult res = > > > > > > > loc ? qryTask.call() : > > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call( > > qryTask); > > > > > > > > > > > > > > Is it valid behavior after introducing DML functionality? > > > > > > > > > > > > > > In cases when user wants to execute query on specific node he > > > should > > > > > > > fully understand what he wants and what can go in wrong way. > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > > > > > > > <[hidden email]> wrote: > > > > > > > > Sergi, > > > > > > > > > > > > > > > > JDBC batching might work quite differently from driver to > > driver. > > > > > Say, > > > > > > > > MySQL happily rewrites queries as I had suggested in the > > > beginning > > > > of > > > > > > > > this thread (it's not the only strategy, but one of the > > possible > > > > > > > > options) - and, BTW, would like to hear at least an opinion > > about > > > > it. > > > > > > > > > > > > > > > > On your first approach, section before streamer: you suggest > > that > > > > we > > > > > > > > send single statement and multiple param sets as a single > query > > > > task, > > > > > > > > am I right? (Just to make sure that I got you properly.) If > so, > > > do > > > > > you > > > > > > > > also mean that API (namely JdbcQueryTask) between server and > > > client > > > > > > > > should also change? Or should new API means be added to > > > facilitate > > > > > > > > batching tasks? > > > > > > > > > > > > > > > > - Alex > > > > > > > > > > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < > > > > [hidden email] > > > > > >: > > > > > > > >> Guys, > > > > > > > >> > > > > > > > >> I discussed this feature with Dmitriy and we came to > > conclusion > > > > that > > > > > > > >> batching in JDBC and Data Streaming in Ignite have different > > > > > semantics > > > > > > > and > > > > > > > >> performance characteristics. Thus they are independent > > features > > > > > (they > > > > > > > may > > > > > > > >> work together, may separately, but this is another story). > > > > > > > >> > > > > > > > >> Let me explain. > > > > > > > >> > > > > > > > >> This is how JDBC batching works: > > > > > > > >> - Add N sets of parameters to a prepared statement. > > > > > > > >> - Manually execute prepared statement. > > > > > > > >> - Repeat until all the data is loaded. > > > > > > > >> > > > > > > > >> > > > > > > > >> This is how data streamer works: > > > > > > > >> - Keep adding data. > > > > > > > >> - Streamer will buffer and load buffered per-node batches > when > > > > they > > > > > > are > > > > > > > big > > > > > > > >> enough. > > > > > > > >> - Close streamer to make sure that everything is loaded. > > > > > > > >> > > > > > > > >> As you can see we have a difference in semantics of when we > > send > > > > > data: > > > > > > > if > > > > > > > >> in our JDBC we will allow sending batches to nodes without > > > calling > > > > > > > >> `execute` (and probably we will need to make `execute` to > > no-op > > > > > here), > > > > > > > then > > > > > > > >> we are violating semantics of JDBC, if we will disallow this > > > > > behavior, > > > > > > > then > > > > > > > >> this batching will underperform. > > > > > > > >> > > > > > > > >> Thus I suggest keeping these features (JDBC Batching and > JDBC > > > > > > > Streaming) as > > > > > > > >> separate features. > > > > > > > >> > > > > > > > >> As I already said they can work together: Batching will > batch > > > > > > parameters > > > > > > > >> and on `execute` they will go to the Streamer in one shot > and > > > > > Streamer > > > > > > > will > > > > > > > >> deal with the rest. > > > > > > > >> > > > > > > > >> Sergi > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < > > > [hidden email] > > > > >: > > > > > > > >> > > > > > > > >>> Hi Alex, > > > > > > > >>> > > > > > > > >>> To my understanding there are two possible approaches to > > > batching > > > > > in > > > > > > > JDBC > > > > > > > >>> layer: > > > > > > > >>> > > > > > > > >>> 1) Rely on default batching API. Specifically > > > > > > > >>> *PreparedStatement.addBatch()* [1] > > > > > > > >>> and others. This is nice and clear API, users are used to > it, > > > and > > > > > > it's > > > > > > > >>> adoption will minimize user code changes when migrating > from > > > > other > > > > > > JDBC > > > > > > > >>> sources. We simply copy updates locally and then execute > them > > > all > > > > > at > > > > > > > once > > > > > > > >>> with only a single network hop to servers. > > *IgniteDataStreamer* > > > > can > > > > > > be > > > > > > > used > > > > > > > >>> underneath. > > > > > > > >>> > > > > > > > >>> 2) Or we can have separate connection flag which will move > > all > > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer. > > > > > > > >>> > > > > > > > >>> I prefer the first approach > > > > > > > >>> > > > > > > > >>> Also we need to keep in mind that data streamer has poor > > > > > performance > > > > > > > when > > > > > > > >>> adding single key-value pairs due to high overhead on > > > concurrency > > > > > and > > > > > > > other > > > > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value > > pairs > > > > > > before > > > > > > > >>> giving them to streamer. > > > > > > > >>> > > > > > > > >>> Vladimir. > > > > > > > >>> > > > > > > > >>> [1] > > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > > > > > > > PreparedStatement.html# > > > > > > > >>> addBatch-- > > > > > > > >>> > > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > > > > > > > >>> [hidden email]> wrote: > > > > > > > >>> > > > > > > > >>> > Hello Igniters, > > > > > > > >>> > > > > > > > > >>> > One of the major improvements to DML has to be support of > > > batch > > > > > > > >>> > statements. I'd like to discuss its implementation. The > > > > suggested > > > > > > > >>> > approach is to rewrite given query turning it from few > > > INSERTs > > > > > into > > > > > > > >>> > single statement and processing arguments accordingly. I > > > > suggest > > > > > > this > > > > > > > >>> > as long as the whole point of batching is to make as > little > > > > > > > >>> > interactions with cluster as possible and to make > > operations > > > as > > > > > > > >>> > condensed as possible, and in case of Ignite it means > that > > we > > > > > > should > > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as long > as > > a > > > > > query > > > > > > > >>> > task holds single query and its arguments, this approach > > will > > > > not > > > > > > > >>> > require any changes to be done to current design and > won't > > > > break > > > > > > any > > > > > > > >>> > backward compatibility - all dirty work on rewriting will > > be > > > > done > > > > > > by > > > > > > > >>> > JDBC driver. > > > > > > > >>> > Without rewriting, we could introduce some new query task > > for > > > > > batch > > > > > > > >>> > operations, but that would make impossible sending such > > > > requests > > > > > > from > > > > > > > >>> > newer clients to older servers (say, servers of version > > > 1.8.0, > > > > > > which > > > > > > > >>> > does not know about batching, let alone older versions). > > > > > > > >>> > I'd like to hear comments and suggestions from the > > community. > > > > > > Thanks! > > > > > > > >>> > > > > > > > > >>> > - Alex > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Vladimir Ozerov > > Senior Software Architect > > GridGain Systems > > www.gridgain.com > > *+7 (960) 283 98 40* > > > |
Sorry, "no relation w/JDBC" in my previous message should read "no relation
w/JDBC batching". — Alex 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" < [hidden email]> написал: > Dima, > > I would like to point out that data streamer support had already been > implemented in the course of work on DML in 1.8 exactly as you are > suggesting now (turned on via connection flag; allowed only MERGE — data > streamer can't do putIfAbsent stuff, right?; absolutely no relation > w/JDBC), *but* that patch had been reverted — by advice from Vlad which I > believe had been agreed with you, so it didn't make it to 1.8 after all. > Also, while it's possible to maintain INSERT vs MERGE semantic using > streamer's allowOverwrite flag, I can't see how we could mimic UPDATE here > as long as the streamer does not put data to cache only in case when key is > present AND allowOverwrite is false, while UPDATE should not put anything > when the key is *missing* — i.e., there's no way to emulate cache's > *replace* operation semantic with streamer (update value only if key is > present, otherwise do nothing). > > — Alex > 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" < > [hidden email]> написал: > >> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <[hidden email]> >> wrote: >> >> > I already expressed my concern - this is counterintuitive approach. >> Because >> > without happens-before pure streaming model can be applied only on >> > independent chunks of data. It mean that mentioned ETL use case is not >> > feasible - ETL always depend on implicit or explicit links between >> tables, >> > and hence streaming is not applicable here. My question stands still - >> what >> > products except of possibly Ignite do this kind of JDBC streaming? >> > >> >> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or >> DataStreamer.addData(). >> >> JDBC batching and putAll() are absolutely identical. If you see it as >> counter-intuitive, I would ask for a concrete example. >> >> As far as links between data, Ignite does not have foreign-key >> constraints, >> so DataStreamer can insert data in any order (but again, not as part of >> JDBC batch). >> >> >> > >> > Another problem is that connection-wide property doesn't fit well in >> JDBC >> > pooling model. Users will have use different connections for streaming >> and >> > non-streaming approaches. >> > >> >> Using DataStreamer is not possible within JDBC batching paradigm, period. >> I >> wish we could drop the high-level-feels-good discussions altogether, >> because it seems like we are spinning wheels here. >> >> There is no way to use the streamer in JDBC context, unless we add a >> connection flag. Again, if you disagree, I would prefer to see a concrete >> example explaining why. >> >> >> > Please see how Oracle did that, this is precisely what I am talking >> about: >> > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf >> .htm#i1056232 >> > Two batching modes - one with explicit flush, another one with implicit >> > flush, when Oracle decides on it's own when it is better to communicate >> the >> > server. Batching mode can be declared globally or on per-statement >> level. >> > Simple and flexible. >> > >> > >> > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan < >> [hidden email]> >> > wrote: >> > >> > > Gents, >> > > >> > > As Sergi suggested, batching and streaming are very different >> > semantically. >> > > >> > > To use standard JDBC batching, all we need to do is convert it to a >> > > cache.putAll() method, as semantically a putAll(...) call is identical >> > to a >> > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in >> > between, >> > > then we may have to break a batch into several chunks and execute the >> > > update in between. The DataStreamer should not be used here. >> > > >> > > I believe that for streaming we need to add a special JDBC/ODBC >> > connection >> > > flag. Whenever this flag is set to true, then we only should allow >> INSERT >> > > or single-UPDATE operations and use DataStreamer API internally. All >> > > operations other than INSERT or single-UPDATE should be prohibited. >> > > >> > > I think this design is semantically clear. Any objections? >> > > >> > > D. >> > > >> > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin < >> [hidden email] >> > > >> > > wrote: >> > > >> > > > If we use Streamer, then we always have `happens-before` broken. >> This >> > is >> > > > ok, because Streamer is for data loading, not for usual operating. >> > > > >> > > > We are not inventing any bicycles, just separating concerns: >> Batching >> > and >> > > > Streaming. >> > > > >> > > > My point here is that they should not depend on each other at all: >> > > Batching >> > > > can work with or without Streaming, as well as Streaming can work >> with >> > or >> > > > without Batching. >> > > > >> > > > Your proposal is a set of non-obvious rules for them to work. I see >> no >> > > > reasons for these complications. >> > > > >> > > > Sergi >> > > > >> > > > >> > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]>: >> > > > >> > > > > Sergi, >> > > > > >> > > > > If user call single *execute() *operation, than most likely it is >> not >> > > > > batching. We should not rely on strange case where user perform >> > > batching >> > > > > without using standard and well-adopted batching JDBC API. The >> main >> > > > problem >> > > > > with streamer is that it is async and hence break happens-before >> > > > guarantees >> > > > > in a single thread: SELECT after INSERT might not return inserted >> > > value. >> > > > > >> > > > > Honestly, I do not really understand why we are trying to >> re-invent a >> > > > > bicycle here. There is standard API - let's just use it and make >> > > flexible >> > > > > enough to take advantage of IgniteDataStreamer if needed. >> > > > > >> > > > > Is there any use case which is not covered with this solution? Or >> let >> > > me >> > > > > ask from the opposite side - are there any well-known JDBC drivers >> > > which >> > > > > perform batching/streaming from non-batched update statements? >> > > > > >> > > > > Vladimir. >> > > > > >> > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < >> > > [hidden email] >> > > > > >> > > > > wrote: >> > > > > >> > > > > > Vladimir, >> > > > > > >> > > > > > I see no reason to forbid Streamer usage from non-batched >> statement >> > > > > > execution. >> > > > > > It is common that users already have their ETL tools and you >> can't >> > be >> > > > > sure >> > > > > > if they use batching or not. >> > > > > > >> > > > > > Alex, >> > > > > > >> > > > > > I guess we have to decide on Streaming first and then we will >> > discuss >> > > > > > Batching separately, ok? Because this decision may become >> important >> > > for >> > > > > > batching implementation. >> > > > > > >> > > > > > Sergi >> > > > > > >> > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: >> > > > > > >> > > > > > > Alex, >> > > > > > > >> > > > > > > In most cases JdbcQueryTask should be executed locally on >> client >> > > node >> > > > > > > started by JDBC driver. >> > > > > > > >> > > > > > > JdbcQueryTask.QueryResult res = >> > > > > > > loc ? qryTask.call() : >> > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call( >> > qryTask); >> > > > > > > >> > > > > > > Is it valid behavior after introducing DML functionality? >> > > > > > > >> > > > > > > In cases when user wants to execute query on specific node he >> > > should >> > > > > > > fully understand what he wants and what can go in wrong way. >> > > > > > > >> > > > > > > >> > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko >> > > > > > > <[hidden email]> wrote: >> > > > > > > > Sergi, >> > > > > > > > >> > > > > > > > JDBC batching might work quite differently from driver to >> > driver. >> > > > > Say, >> > > > > > > > MySQL happily rewrites queries as I had suggested in the >> > > beginning >> > > > of >> > > > > > > > this thread (it's not the only strategy, but one of the >> > possible >> > > > > > > > options) - and, BTW, would like to hear at least an opinion >> > about >> > > > it. >> > > > > > > > >> > > > > > > > On your first approach, section before streamer: you suggest >> > that >> > > > we >> > > > > > > > send single statement and multiple param sets as a single >> query >> > > > task, >> > > > > > > > am I right? (Just to make sure that I got you properly.) If >> so, >> > > do >> > > > > you >> > > > > > > > also mean that API (namely JdbcQueryTask) between server and >> > > client >> > > > > > > > should also change? Or should new API means be added to >> > > facilitate >> > > > > > > > batching tasks? >> > > > > > > > >> > > > > > > > - Alex >> > > > > > > > >> > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < >> > > > [hidden email] >> > > > > >: >> > > > > > > >> Guys, >> > > > > > > >> >> > > > > > > >> I discussed this feature with Dmitriy and we came to >> > conclusion >> > > > that >> > > > > > > >> batching in JDBC and Data Streaming in Ignite have >> different >> > > > > semantics >> > > > > > > and >> > > > > > > >> performance characteristics. Thus they are independent >> > features >> > > > > (they >> > > > > > > may >> > > > > > > >> work together, may separately, but this is another story). >> > > > > > > >> >> > > > > > > >> Let me explain. >> > > > > > > >> >> > > > > > > >> This is how JDBC batching works: >> > > > > > > >> - Add N sets of parameters to a prepared statement. >> > > > > > > >> - Manually execute prepared statement. >> > > > > > > >> - Repeat until all the data is loaded. >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> This is how data streamer works: >> > > > > > > >> - Keep adding data. >> > > > > > > >> - Streamer will buffer and load buffered per-node batches >> when >> > > > they >> > > > > > are >> > > > > > > big >> > > > > > > >> enough. >> > > > > > > >> - Close streamer to make sure that everything is loaded. >> > > > > > > >> >> > > > > > > >> As you can see we have a difference in semantics of when we >> > send >> > > > > data: >> > > > > > > if >> > > > > > > >> in our JDBC we will allow sending batches to nodes without >> > > calling >> > > > > > > >> `execute` (and probably we will need to make `execute` to >> > no-op >> > > > > here), >> > > > > > > then >> > > > > > > >> we are violating semantics of JDBC, if we will disallow >> this >> > > > > behavior, >> > > > > > > then >> > > > > > > >> this batching will underperform. >> > > > > > > >> >> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and >> JDBC >> > > > > > > Streaming) as >> > > > > > > >> separate features. >> > > > > > > >> >> > > > > > > >> As I already said they can work together: Batching will >> batch >> > > > > > parameters >> > > > > > > >> and on `execute` they will go to the Streamer in one shot >> and >> > > > > Streamer >> > > > > > > will >> > > > > > > >> deal with the rest. >> > > > > > > >> >> > > > > > > >> Sergi >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < >> > > [hidden email] >> > > > >: >> > > > > > > >> >> > > > > > > >>> Hi Alex, >> > > > > > > >>> >> > > > > > > >>> To my understanding there are two possible approaches to >> > > batching >> > > > > in >> > > > > > > JDBC >> > > > > > > >>> layer: >> > > > > > > >>> >> > > > > > > >>> 1) Rely on default batching API. Specifically >> > > > > > > >>> *PreparedStatement.addBatch()* [1] >> > > > > > > >>> and others. This is nice and clear API, users are used to >> it, >> > > and >> > > > > > it's >> > > > > > > >>> adoption will minimize user code changes when migrating >> from >> > > > other >> > > > > > JDBC >> > > > > > > >>> sources. We simply copy updates locally and then execute >> them >> > > all >> > > > > at >> > > > > > > once >> > > > > > > >>> with only a single network hop to servers. >> > *IgniteDataStreamer* >> > > > can >> > > > > > be >> > > > > > > used >> > > > > > > >>> underneath. >> > > > > > > >>> >> > > > > > > >>> 2) Or we can have separate connection flag which will move >> > all >> > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer. >> > > > > > > >>> >> > > > > > > >>> I prefer the first approach >> > > > > > > >>> >> > > > > > > >>> Also we need to keep in mind that data streamer has poor >> > > > > performance >> > > > > > > when >> > > > > > > >>> adding single key-value pairs due to high overhead on >> > > concurrency >> > > > > and >> > > > > > > other >> > > > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value >> > pairs >> > > > > > before >> > > > > > > >>> giving them to streamer. >> > > > > > > >>> >> > > > > > > >>> Vladimir. >> > > > > > > >>> >> > > > > > > >>> [1] >> > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ >> > > > > > > PreparedStatement.html# >> > > > > > > >>> addBatch-- >> > > > > > > >>> >> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < >> > > > > > > >>> [hidden email]> wrote: >> > > > > > > >>> >> > > > > > > >>> > Hello Igniters, >> > > > > > > >>> > >> > > > > > > >>> > One of the major improvements to DML has to be support >> of >> > > batch >> > > > > > > >>> > statements. I'd like to discuss its implementation. The >> > > > suggested >> > > > > > > >>> > approach is to rewrite given query turning it from few >> > > INSERTs >> > > > > into >> > > > > > > >>> > single statement and processing arguments accordingly. I >> > > > suggest >> > > > > > this >> > > > > > > >>> > as long as the whole point of batching is to make as >> little >> > > > > > > >>> > interactions with cluster as possible and to make >> > operations >> > > as >> > > > > > > >>> > condensed as possible, and in case of Ignite it means >> that >> > we >> > > > > > should >> > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as long >> as >> > a >> > > > > query >> > > > > > > >>> > task holds single query and its arguments, this approach >> > will >> > > > not >> > > > > > > >>> > require any changes to be done to current design and >> won't >> > > > break >> > > > > > any >> > > > > > > >>> > backward compatibility - all dirty work on rewriting >> will >> > be >> > > > done >> > > > > > by >> > > > > > > >>> > JDBC driver. >> > > > > > > >>> > Without rewriting, we could introduce some new query >> task >> > for >> > > > > batch >> > > > > > > >>> > operations, but that would make impossible sending such >> > > > requests >> > > > > > from >> > > > > > > >>> > newer clients to older servers (say, servers of version >> > > 1.8.0, >> > > > > > which >> > > > > > > >>> > does not know about batching, let alone older versions). >> > > > > > > >>> > I'd like to hear comments and suggestions from the >> > community. >> > > > > > Thanks! >> > > > > > > >>> > >> > > > > > > >>> > - Alex >> > > > > > > >>> > >> > > > > > > >>> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> > -- >> > Vladimir Ozerov >> > Senior Software Architect >> > GridGain Systems >> > www.gridgain.com >> > *+7 (960) 283 98 40* >> > >> > |
Alex,
It seams to me that replace semantic can be implemented with StreamReceiver, no? D. On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko < [hidden email]> wrote: > Sorry, "no relation w/JDBC" in my previous message should read "no relation > w/JDBC batching". > > — Alex > 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" < > [hidden email]> написал: > > > Dima, > > > > I would like to point out that data streamer support had already been > > implemented in the course of work on DML in 1.8 exactly as you are > > suggesting now (turned on via connection flag; allowed only MERGE — data > > streamer can't do putIfAbsent stuff, right?; absolutely no relation > > w/JDBC), *but* that patch had been reverted — by advice from Vlad which I > > believe had been agreed with you, so it didn't make it to 1.8 after all. > > Also, while it's possible to maintain INSERT vs MERGE semantic using > > streamer's allowOverwrite flag, I can't see how we could mimic UPDATE > here > > as long as the streamer does not put data to cache only in case when key > is > > present AND allowOverwrite is false, while UPDATE should not put anything > > when the key is *missing* — i.e., there's no way to emulate cache's > > *replace* operation semantic with streamer (update value only if key is > > present, otherwise do nothing). > > > > — Alex > > 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" < > > [hidden email]> написал: > > > >> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <[hidden email]> > >> wrote: > >> > >> > I already expressed my concern - this is counterintuitive approach. > >> Because > >> > without happens-before pure streaming model can be applied only on > >> > independent chunks of data. It mean that mentioned ETL use case is not > >> > feasible - ETL always depend on implicit or explicit links between > >> tables, > >> > and hence streaming is not applicable here. My question stands still - > >> what > >> > products except of possibly Ignite do this kind of JDBC streaming? > >> > > >> > >> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or > >> DataStreamer.addData(). > >> > >> JDBC batching and putAll() are absolutely identical. If you see it as > >> counter-intuitive, I would ask for a concrete example. > >> > >> As far as links between data, Ignite does not have foreign-key > >> constraints, > >> so DataStreamer can insert data in any order (but again, not as part of > >> JDBC batch). > >> > >> > >> > > >> > Another problem is that connection-wide property doesn't fit well in > >> JDBC > >> > pooling model. Users will have use different connections for streaming > >> and > >> > non-streaming approaches. > >> > > >> > >> Using DataStreamer is not possible within JDBC batching paradigm, > period. > >> I > >> wish we could drop the high-level-feels-good discussions altogether, > >> because it seems like we are spinning wheels here. > >> > >> There is no way to use the streamer in JDBC context, unless we add a > >> connection flag. Again, if you disagree, I would prefer to see a > concrete > >> example explaining why. > >> > >> > >> > Please see how Oracle did that, this is precisely what I am talking > >> about: > >> > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf > >> .htm#i1056232 > >> > Two batching modes - one with explicit flush, another one with > implicit > >> > flush, when Oracle decides on it's own when it is better to > communicate > >> the > >> > server. Batching mode can be declared globally or on per-statement > >> level. > >> > Simple and flexible. > >> > > >> > > >> > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan < > >> [hidden email]> > >> > wrote: > >> > > >> > > Gents, > >> > > > >> > > As Sergi suggested, batching and streaming are very different > >> > semantically. > >> > > > >> > > To use standard JDBC batching, all we need to do is convert it to a > >> > > cache.putAll() method, as semantically a putAll(...) call is > identical > >> > to a > >> > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in > >> > between, > >> > > then we may have to break a batch into several chunks and execute > the > >> > > update in between. The DataStreamer should not be used here. > >> > > > >> > > I believe that for streaming we need to add a special JDBC/ODBC > >> > connection > >> > > flag. Whenever this flag is set to true, then we only should allow > >> INSERT > >> > > or single-UPDATE operations and use DataStreamer API internally. All > >> > > operations other than INSERT or single-UPDATE should be prohibited. > >> > > > >> > > I think this design is semantically clear. Any objections? > >> > > > >> > > D. > >> > > > >> > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin < > >> [hidden email] > >> > > > >> > > wrote: > >> > > > >> > > > If we use Streamer, then we always have `happens-before` broken. > >> This > >> > is > >> > > > ok, because Streamer is for data loading, not for usual operating. > >> > > > > >> > > > We are not inventing any bicycles, just separating concerns: > >> Batching > >> > and > >> > > > Streaming. > >> > > > > >> > > > My point here is that they should not depend on each other at all: > >> > > Batching > >> > > > can work with or without Streaming, as well as Streaming can work > >> with > >> > or > >> > > > without Batching. > >> > > > > >> > > > Your proposal is a set of non-obvious rules for them to work. I > see > >> no > >> > > > reasons for these complications. > >> > > > > >> > > > Sergi > >> > > > > >> > > > > >> > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email] > >: > >> > > > > >> > > > > Sergi, > >> > > > > > >> > > > > If user call single *execute() *operation, than most likely it > is > >> not > >> > > > > batching. We should not rely on strange case where user perform > >> > > batching > >> > > > > without using standard and well-adopted batching JDBC API. The > >> main > >> > > > problem > >> > > > > with streamer is that it is async and hence break happens-before > >> > > > guarantees > >> > > > > in a single thread: SELECT after INSERT might not return > inserted > >> > > value. > >> > > > > > >> > > > > Honestly, I do not really understand why we are trying to > >> re-invent a > >> > > > > bicycle here. There is standard API - let's just use it and make > >> > > flexible > >> > > > > enough to take advantage of IgniteDataStreamer if needed. > >> > > > > > >> > > > > Is there any use case which is not covered with this solution? > Or > >> let > >> > > me > >> > > > > ask from the opposite side - are there any well-known JDBC > drivers > >> > > which > >> > > > > perform batching/streaming from non-batched update statements? > >> > > > > > >> > > > > Vladimir. > >> > > > > > >> > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < > >> > > [hidden email] > >> > > > > > >> > > > > wrote: > >> > > > > > >> > > > > > Vladimir, > >> > > > > > > >> > > > > > I see no reason to forbid Streamer usage from non-batched > >> statement > >> > > > > > execution. > >> > > > > > It is common that users already have their ETL tools and you > >> can't > >> > be > >> > > > > sure > >> > > > > > if they use batching or not. > >> > > > > > > >> > > > > > Alex, > >> > > > > > > >> > > > > > I guess we have to decide on Streaming first and then we will > >> > discuss > >> > > > > > Batching separately, ok? Because this decision may become > >> important > >> > > for > >> > > > > > batching implementation. > >> > > > > > > >> > > > > > Sergi > >> > > > > > > >> > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: > >> > > > > > > >> > > > > > > Alex, > >> > > > > > > > >> > > > > > > In most cases JdbcQueryTask should be executed locally on > >> client > >> > > node > >> > > > > > > started by JDBC driver. > >> > > > > > > > >> > > > > > > JdbcQueryTask.QueryResult res = > >> > > > > > > loc ? qryTask.call() : > >> > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call( > >> > qryTask); > >> > > > > > > > >> > > > > > > Is it valid behavior after introducing DML functionality? > >> > > > > > > > >> > > > > > > In cases when user wants to execute query on specific node > he > >> > > should > >> > > > > > > fully understand what he wants and what can go in wrong way. > >> > > > > > > > >> > > > > > > > >> > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > >> > > > > > > <[hidden email]> wrote: > >> > > > > > > > Sergi, > >> > > > > > > > > >> > > > > > > > JDBC batching might work quite differently from driver to > >> > driver. > >> > > > > Say, > >> > > > > > > > MySQL happily rewrites queries as I had suggested in the > >> > > beginning > >> > > > of > >> > > > > > > > this thread (it's not the only strategy, but one of the > >> > possible > >> > > > > > > > options) - and, BTW, would like to hear at least an > opinion > >> > about > >> > > > it. > >> > > > > > > > > >> > > > > > > > On your first approach, section before streamer: you > suggest > >> > that > >> > > > we > >> > > > > > > > send single statement and multiple param sets as a single > >> query > >> > > > task, > >> > > > > > > > am I right? (Just to make sure that I got you properly.) > If > >> so, > >> > > do > >> > > > > you > >> > > > > > > > also mean that API (namely JdbcQueryTask) between server > and > >> > > client > >> > > > > > > > should also change? Or should new API means be added to > >> > > facilitate > >> > > > > > > > batching tasks? > >> > > > > > > > > >> > > > > > > > - Alex > >> > > > > > > > > >> > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < > >> > > > [hidden email] > >> > > > > >: > >> > > > > > > >> Guys, > >> > > > > > > >> > >> > > > > > > >> I discussed this feature with Dmitriy and we came to > >> > conclusion > >> > > > that > >> > > > > > > >> batching in JDBC and Data Streaming in Ignite have > >> different > >> > > > > semantics > >> > > > > > > and > >> > > > > > > >> performance characteristics. Thus they are independent > >> > features > >> > > > > (they > >> > > > > > > may > >> > > > > > > >> work together, may separately, but this is another > story). > >> > > > > > > >> > >> > > > > > > >> Let me explain. > >> > > > > > > >> > >> > > > > > > >> This is how JDBC batching works: > >> > > > > > > >> - Add N sets of parameters to a prepared statement. > >> > > > > > > >> - Manually execute prepared statement. > >> > > > > > > >> - Repeat until all the data is loaded. > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> This is how data streamer works: > >> > > > > > > >> - Keep adding data. > >> > > > > > > >> - Streamer will buffer and load buffered per-node batches > >> when > >> > > > they > >> > > > > > are > >> > > > > > > big > >> > > > > > > >> enough. > >> > > > > > > >> - Close streamer to make sure that everything is loaded. > >> > > > > > > >> > >> > > > > > > >> As you can see we have a difference in semantics of when > we > >> > send > >> > > > > data: > >> > > > > > > if > >> > > > > > > >> in our JDBC we will allow sending batches to nodes > without > >> > > calling > >> > > > > > > >> `execute` (and probably we will need to make `execute` to > >> > no-op > >> > > > > here), > >> > > > > > > then > >> > > > > > > >> we are violating semantics of JDBC, if we will disallow > >> this > >> > > > > behavior, > >> > > > > > > then > >> > > > > > > >> this batching will underperform. > >> > > > > > > >> > >> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and > >> JDBC > >> > > > > > > Streaming) as > >> > > > > > > >> separate features. > >> > > > > > > >> > >> > > > > > > >> As I already said they can work together: Batching will > >> batch > >> > > > > > parameters > >> > > > > > > >> and on `execute` they will go to the Streamer in one shot > >> and > >> > > > > Streamer > >> > > > > > > will > >> > > > > > > >> deal with the rest. > >> > > > > > > >> > >> > > > > > > >> Sergi > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < > >> > > [hidden email] > >> > > > >: > >> > > > > > > >> > >> > > > > > > >>> Hi Alex, > >> > > > > > > >>> > >> > > > > > > >>> To my understanding there are two possible approaches to > >> > > batching > >> > > > > in > >> > > > > > > JDBC > >> > > > > > > >>> layer: > >> > > > > > > >>> > >> > > > > > > >>> 1) Rely on default batching API. Specifically > >> > > > > > > >>> *PreparedStatement.addBatch()* [1] > >> > > > > > > >>> and others. This is nice and clear API, users are used > to > >> it, > >> > > and > >> > > > > > it's > >> > > > > > > >>> adoption will minimize user code changes when migrating > >> from > >> > > > other > >> > > > > > JDBC > >> > > > > > > >>> sources. We simply copy updates locally and then execute > >> them > >> > > all > >> > > > > at > >> > > > > > > once > >> > > > > > > >>> with only a single network hop to servers. > >> > *IgniteDataStreamer* > >> > > > can > >> > > > > > be > >> > > > > > > used > >> > > > > > > >>> underneath. > >> > > > > > > >>> > >> > > > > > > >>> 2) Or we can have separate connection flag which will > move > >> > all > >> > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer. > >> > > > > > > >>> > >> > > > > > > >>> I prefer the first approach > >> > > > > > > >>> > >> > > > > > > >>> Also we need to keep in mind that data streamer has poor > >> > > > > performance > >> > > > > > > when > >> > > > > > > >>> adding single key-value pairs due to high overhead on > >> > > concurrency > >> > > > > and > >> > > > > > > other > >> > > > > > > >>> bookkeeping. Instead, it is better to pre-batch > key-value > >> > pairs > >> > > > > > before > >> > > > > > > >>> giving them to streamer. > >> > > > > > > >>> > >> > > > > > > >>> Vladimir. > >> > > > > > > >>> > >> > > > > > > >>> [1] > >> > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > >> > > > > > > PreparedStatement.html# > >> > > > > > > >>> addBatch-- > >> > > > > > > >>> > >> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > >> > > > > > > >>> [hidden email]> wrote: > >> > > > > > > >>> > >> > > > > > > >>> > Hello Igniters, > >> > > > > > > >>> > > >> > > > > > > >>> > One of the major improvements to DML has to be support > >> of > >> > > batch > >> > > > > > > >>> > statements. I'd like to discuss its implementation. > The > >> > > > suggested > >> > > > > > > >>> > approach is to rewrite given query turning it from few > >> > > INSERTs > >> > > > > into > >> > > > > > > >>> > single statement and processing arguments > accordingly. I > >> > > > suggest > >> > > > > > this > >> > > > > > > >>> > as long as the whole point of batching is to make as > >> little > >> > > > > > > >>> > interactions with cluster as possible and to make > >> > operations > >> > > as > >> > > > > > > >>> > condensed as possible, and in case of Ignite it means > >> that > >> > we > >> > > > > > should > >> > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as > long > >> as > >> > a > >> > > > > query > >> > > > > > > >>> > task holds single query and its arguments, this > approach > >> > will > >> > > > not > >> > > > > > > >>> > require any changes to be done to current design and > >> won't > >> > > > break > >> > > > > > any > >> > > > > > > >>> > backward compatibility - all dirty work on rewriting > >> will > >> > be > >> > > > done > >> > > > > > by > >> > > > > > > >>> > JDBC driver. > >> > > > > > > >>> > Without rewriting, we could introduce some new query > >> task > >> > for > >> > > > > batch > >> > > > > > > >>> > operations, but that would make impossible sending > such > >> > > > requests > >> > > > > > from > >> > > > > > > >>> > newer clients to older servers (say, servers of > version > >> > > 1.8.0, > >> > > > > > which > >> > > > > > > >>> > does not know about batching, let alone older > versions). > >> > > > > > > >>> > I'd like to hear comments and suggestions from the > >> > community. > >> > > > > > Thanks! > >> > > > > > > >>> > > >> > > > > > > >>> > - Alex > >> > > > > > > >>> > > >> > > > > > > >>> > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >> > > >> > -- > >> > Vladimir Ozerov > >> > Senior Software Architect > >> > GridGain Systems > >> > www.gridgain.com > >> > *+7 (960) 283 98 40* > >> > > >> > > > |
OK folks, both data streamer support and batching support have been implemented.
Resulting design fully conforms to what Dima suggested initially - these two concepts are separated. Streamed statements are turned on by connection flag, stream auto flush timeout can be tuned in the same way; these statements support INSERT and MERGE w/o subquery as well as fast key bounded DELETE and UPDATE; each prepared statement in streamed mode has its own streamer object and their lifecycles are the same - on close, the statement closes its streamer. Streaming mode is available only in "local" mode of connection between JDBC driver and Ignite client (default mode when JDBC driver creates Ignite client node by itself) - there would be no sense in streaming if query args would have to travel over network. Batched statements sre used via conventional JDBC API (setArgs... addBatch... executeBatch...), they also support INSERT and MERGE w/o subquery as well as fast key (and, optionally, value) bounded DELETE and UPDATE. These work in the similar manner to non batched statements and likewise rely on traditional putAll/invokeAll routines. Essentially, batching is just the way to pass a bigger map to cache.putAll without writing single very long query. This works in local as well as "remote" Ignite JDBC connectivity mode. More info (details are in the comments): Batching - https://issues.apache.org/jira/browse/IGNITE-4269 Streaming - https://issues.apache.org/jira/browse/IGNITE-4169 Regards, Alex 2016-12-10 23:39 GMT+03:00 Dmitriy Setrakyan <[hidden email]>: > Alex, > > It seams to me that replace semantic can be implemented with > StreamReceiver, no? > > D. > > On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko < > [hidden email]> wrote: > >> Sorry, "no relation w/JDBC" in my previous message should read "no relation >> w/JDBC batching". >> >> — Alex >> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" < >> [hidden email]> написал: >> >> > Dima, >> > >> > I would like to point out that data streamer support had already been >> > implemented in the course of work on DML in 1.8 exactly as you are >> > suggesting now (turned on via connection flag; allowed only MERGE — data >> > streamer can't do putIfAbsent stuff, right?; absolutely no relation >> > w/JDBC), *but* that patch had been reverted — by advice from Vlad which I >> > believe had been agreed with you, so it didn't make it to 1.8 after all. >> > Also, while it's possible to maintain INSERT vs MERGE semantic using >> > streamer's allowOverwrite flag, I can't see how we could mimic UPDATE >> here >> > as long as the streamer does not put data to cache only in case when key >> is >> > present AND allowOverwrite is false, while UPDATE should not put anything >> > when the key is *missing* — i.e., there's no way to emulate cache's >> > *replace* operation semantic with streamer (update value only if key is >> > present, otherwise do nothing). >> > >> > — Alex >> > 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" < >> > [hidden email]> написал: >> > >> >> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <[hidden email]> >> >> wrote: >> >> >> >> > I already expressed my concern - this is counterintuitive approach. >> >> Because >> >> > without happens-before pure streaming model can be applied only on >> >> > independent chunks of data. It mean that mentioned ETL use case is not >> >> > feasible - ETL always depend on implicit or explicit links between >> >> tables, >> >> > and hence streaming is not applicable here. My question stands still - >> >> what >> >> > products except of possibly Ignite do this kind of JDBC streaming? >> >> > >> >> >> >> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or >> >> DataStreamer.addData(). >> >> >> >> JDBC batching and putAll() are absolutely identical. If you see it as >> >> counter-intuitive, I would ask for a concrete example. >> >> >> >> As far as links between data, Ignite does not have foreign-key >> >> constraints, >> >> so DataStreamer can insert data in any order (but again, not as part of >> >> JDBC batch). >> >> >> >> >> >> > >> >> > Another problem is that connection-wide property doesn't fit well in >> >> JDBC >> >> > pooling model. Users will have use different connections for streaming >> >> and >> >> > non-streaming approaches. >> >> > >> >> >> >> Using DataStreamer is not possible within JDBC batching paradigm, >> period. >> >> I >> >> wish we could drop the high-level-feels-good discussions altogether, >> >> because it seems like we are spinning wheels here. >> >> >> >> There is no way to use the streamer in JDBC context, unless we add a >> >> connection flag. Again, if you disagree, I would prefer to see a >> concrete >> >> example explaining why. >> >> >> >> >> >> > Please see how Oracle did that, this is precisely what I am talking >> >> about: >> >> > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf >> >> .htm#i1056232 >> >> > Two batching modes - one with explicit flush, another one with >> implicit >> >> > flush, when Oracle decides on it's own when it is better to >> communicate >> >> the >> >> > server. Batching mode can be declared globally or on per-statement >> >> level. >> >> > Simple and flexible. >> >> > >> >> > >> >> > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan < >> >> [hidden email]> >> >> > wrote: >> >> > >> >> > > Gents, >> >> > > >> >> > > As Sergi suggested, batching and streaming are very different >> >> > semantically. >> >> > > >> >> > > To use standard JDBC batching, all we need to do is convert it to a >> >> > > cache.putAll() method, as semantically a putAll(...) call is >> identical >> >> > to a >> >> > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in >> >> > between, >> >> > > then we may have to break a batch into several chunks and execute >> the >> >> > > update in between. The DataStreamer should not be used here. >> >> > > >> >> > > I believe that for streaming we need to add a special JDBC/ODBC >> >> > connection >> >> > > flag. Whenever this flag is set to true, then we only should allow >> >> INSERT >> >> > > or single-UPDATE operations and use DataStreamer API internally. All >> >> > > operations other than INSERT or single-UPDATE should be prohibited. >> >> > > >> >> > > I think this design is semantically clear. Any objections? >> >> > > >> >> > > D. >> >> > > >> >> > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin < >> >> [hidden email] >> >> > > >> >> > > wrote: >> >> > > >> >> > > > If we use Streamer, then we always have `happens-before` broken. >> >> This >> >> > is >> >> > > > ok, because Streamer is for data loading, not for usual operating. >> >> > > > >> >> > > > We are not inventing any bicycles, just separating concerns: >> >> Batching >> >> > and >> >> > > > Streaming. >> >> > > > >> >> > > > My point here is that they should not depend on each other at all: >> >> > > Batching >> >> > > > can work with or without Streaming, as well as Streaming can work >> >> with >> >> > or >> >> > > > without Batching. >> >> > > > >> >> > > > Your proposal is a set of non-obvious rules for them to work. I >> see >> >> no >> >> > > > reasons for these complications. >> >> > > > >> >> > > > Sergi >> >> > > > >> >> > > > >> >> > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email] >> >: >> >> > > > >> >> > > > > Sergi, >> >> > > > > >> >> > > > > If user call single *execute() *operation, than most likely it >> is >> >> not >> >> > > > > batching. We should not rely on strange case where user perform >> >> > > batching >> >> > > > > without using standard and well-adopted batching JDBC API. The >> >> main >> >> > > > problem >> >> > > > > with streamer is that it is async and hence break happens-before >> >> > > > guarantees >> >> > > > > in a single thread: SELECT after INSERT might not return >> inserted >> >> > > value. >> >> > > > > >> >> > > > > Honestly, I do not really understand why we are trying to >> >> re-invent a >> >> > > > > bicycle here. There is standard API - let's just use it and make >> >> > > flexible >> >> > > > > enough to take advantage of IgniteDataStreamer if needed. >> >> > > > > >> >> > > > > Is there any use case which is not covered with this solution? >> Or >> >> let >> >> > > me >> >> > > > > ask from the opposite side - are there any well-known JDBC >> drivers >> >> > > which >> >> > > > > perform batching/streaming from non-batched update statements? >> >> > > > > >> >> > > > > Vladimir. >> >> > > > > >> >> > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < >> >> > > [hidden email] >> >> > > > > >> >> > > > > wrote: >> >> > > > > >> >> > > > > > Vladimir, >> >> > > > > > >> >> > > > > > I see no reason to forbid Streamer usage from non-batched >> >> statement >> >> > > > > > execution. >> >> > > > > > It is common that users already have their ETL tools and you >> >> can't >> >> > be >> >> > > > > sure >> >> > > > > > if they use batching or not. >> >> > > > > > >> >> > > > > > Alex, >> >> > > > > > >> >> > > > > > I guess we have to decide on Streaming first and then we will >> >> > discuss >> >> > > > > > Batching separately, ok? Because this decision may become >> >> important >> >> > > for >> >> > > > > > batching implementation. >> >> > > > > > >> >> > > > > > Sergi >> >> > > > > > >> >> > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: >> >> > > > > > >> >> > > > > > > Alex, >> >> > > > > > > >> >> > > > > > > In most cases JdbcQueryTask should be executed locally on >> >> client >> >> > > node >> >> > > > > > > started by JDBC driver. >> >> > > > > > > >> >> > > > > > > JdbcQueryTask.QueryResult res = >> >> > > > > > > loc ? qryTask.call() : >> >> > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call( >> >> > qryTask); >> >> > > > > > > >> >> > > > > > > Is it valid behavior after introducing DML functionality? >> >> > > > > > > >> >> > > > > > > In cases when user wants to execute query on specific node >> he >> >> > > should >> >> > > > > > > fully understand what he wants and what can go in wrong way. >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko >> >> > > > > > > <[hidden email]> wrote: >> >> > > > > > > > Sergi, >> >> > > > > > > > >> >> > > > > > > > JDBC batching might work quite differently from driver to >> >> > driver. >> >> > > > > Say, >> >> > > > > > > > MySQL happily rewrites queries as I had suggested in the >> >> > > beginning >> >> > > > of >> >> > > > > > > > this thread (it's not the only strategy, but one of the >> >> > possible >> >> > > > > > > > options) - and, BTW, would like to hear at least an >> opinion >> >> > about >> >> > > > it. >> >> > > > > > > > >> >> > > > > > > > On your first approach, section before streamer: you >> suggest >> >> > that >> >> > > > we >> >> > > > > > > > send single statement and multiple param sets as a single >> >> query >> >> > > > task, >> >> > > > > > > > am I right? (Just to make sure that I got you properly.) >> If >> >> so, >> >> > > do >> >> > > > > you >> >> > > > > > > > also mean that API (namely JdbcQueryTask) between server >> and >> >> > > client >> >> > > > > > > > should also change? Or should new API means be added to >> >> > > facilitate >> >> > > > > > > > batching tasks? >> >> > > > > > > > >> >> > > > > > > > - Alex >> >> > > > > > > > >> >> > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < >> >> > > > [hidden email] >> >> > > > > >: >> >> > > > > > > >> Guys, >> >> > > > > > > >> >> >> > > > > > > >> I discussed this feature with Dmitriy and we came to >> >> > conclusion >> >> > > > that >> >> > > > > > > >> batching in JDBC and Data Streaming in Ignite have >> >> different >> >> > > > > semantics >> >> > > > > > > and >> >> > > > > > > >> performance characteristics. Thus they are independent >> >> > features >> >> > > > > (they >> >> > > > > > > may >> >> > > > > > > >> work together, may separately, but this is another >> story). >> >> > > > > > > >> >> >> > > > > > > >> Let me explain. >> >> > > > > > > >> >> >> > > > > > > >> This is how JDBC batching works: >> >> > > > > > > >> - Add N sets of parameters to a prepared statement. >> >> > > > > > > >> - Manually execute prepared statement. >> >> > > > > > > >> - Repeat until all the data is loaded. >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> This is how data streamer works: >> >> > > > > > > >> - Keep adding data. >> >> > > > > > > >> - Streamer will buffer and load buffered per-node batches >> >> when >> >> > > > they >> >> > > > > > are >> >> > > > > > > big >> >> > > > > > > >> enough. >> >> > > > > > > >> - Close streamer to make sure that everything is loaded. >> >> > > > > > > >> >> >> > > > > > > >> As you can see we have a difference in semantics of when >> we >> >> > send >> >> > > > > data: >> >> > > > > > > if >> >> > > > > > > >> in our JDBC we will allow sending batches to nodes >> without >> >> > > calling >> >> > > > > > > >> `execute` (and probably we will need to make `execute` to >> >> > no-op >> >> > > > > here), >> >> > > > > > > then >> >> > > > > > > >> we are violating semantics of JDBC, if we will disallow >> >> this >> >> > > > > behavior, >> >> > > > > > > then >> >> > > > > > > >> this batching will underperform. >> >> > > > > > > >> >> >> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and >> >> JDBC >> >> > > > > > > Streaming) as >> >> > > > > > > >> separate features. >> >> > > > > > > >> >> >> > > > > > > >> As I already said they can work together: Batching will >> >> batch >> >> > > > > > parameters >> >> > > > > > > >> and on `execute` they will go to the Streamer in one shot >> >> and >> >> > > > > Streamer >> >> > > > > > > will >> >> > > > > > > >> deal with the rest. >> >> > > > > > > >> >> >> > > > > > > >> Sergi >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < >> >> > > [hidden email] >> >> > > > >: >> >> > > > > > > >> >> >> > > > > > > >>> Hi Alex, >> >> > > > > > > >>> >> >> > > > > > > >>> To my understanding there are two possible approaches to >> >> > > batching >> >> > > > > in >> >> > > > > > > JDBC >> >> > > > > > > >>> layer: >> >> > > > > > > >>> >> >> > > > > > > >>> 1) Rely on default batching API. Specifically >> >> > > > > > > >>> *PreparedStatement.addBatch()* [1] >> >> > > > > > > >>> and others. This is nice and clear API, users are used >> to >> >> it, >> >> > > and >> >> > > > > > it's >> >> > > > > > > >>> adoption will minimize user code changes when migrating >> >> from >> >> > > > other >> >> > > > > > JDBC >> >> > > > > > > >>> sources. We simply copy updates locally and then execute >> >> them >> >> > > all >> >> > > > > at >> >> > > > > > > once >> >> > > > > > > >>> with only a single network hop to servers. >> >> > *IgniteDataStreamer* >> >> > > > can >> >> > > > > > be >> >> > > > > > > used >> >> > > > > > > >>> underneath. >> >> > > > > > > >>> >> >> > > > > > > >>> 2) Or we can have separate connection flag which will >> move >> >> > all >> >> > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer. >> >> > > > > > > >>> >> >> > > > > > > >>> I prefer the first approach >> >> > > > > > > >>> >> >> > > > > > > >>> Also we need to keep in mind that data streamer has poor >> >> > > > > performance >> >> > > > > > > when >> >> > > > > > > >>> adding single key-value pairs due to high overhead on >> >> > > concurrency >> >> > > > > and >> >> > > > > > > other >> >> > > > > > > >>> bookkeeping. Instead, it is better to pre-batch >> key-value >> >> > pairs >> >> > > > > > before >> >> > > > > > > >>> giving them to streamer. >> >> > > > > > > >>> >> >> > > > > > > >>> Vladimir. >> >> > > > > > > >>> >> >> > > > > > > >>> [1] >> >> > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ >> >> > > > > > > PreparedStatement.html# >> >> > > > > > > >>> addBatch-- >> >> > > > > > > >>> >> >> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < >> >> > > > > > > >>> [hidden email]> wrote: >> >> > > > > > > >>> >> >> > > > > > > >>> > Hello Igniters, >> >> > > > > > > >>> > >> >> > > > > > > >>> > One of the major improvements to DML has to be support >> >> of >> >> > > batch >> >> > > > > > > >>> > statements. I'd like to discuss its implementation. >> The >> >> > > > suggested >> >> > > > > > > >>> > approach is to rewrite given query turning it from few >> >> > > INSERTs >> >> > > > > into >> >> > > > > > > >>> > single statement and processing arguments >> accordingly. I >> >> > > > suggest >> >> > > > > > this >> >> > > > > > > >>> > as long as the whole point of batching is to make as >> >> little >> >> > > > > > > >>> > interactions with cluster as possible and to make >> >> > operations >> >> > > as >> >> > > > > > > >>> > condensed as possible, and in case of Ignite it means >> >> that >> >> > we >> >> > > > > > should >> >> > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as >> long >> >> as >> >> > a >> >> > > > > query >> >> > > > > > > >>> > task holds single query and its arguments, this >> approach >> >> > will >> >> > > > not >> >> > > > > > > >>> > require any changes to be done to current design and >> >> won't >> >> > > > break >> >> > > > > > any >> >> > > > > > > >>> > backward compatibility - all dirty work on rewriting >> >> will >> >> > be >> >> > > > done >> >> > > > > > by >> >> > > > > > > >>> > JDBC driver. >> >> > > > > > > >>> > Without rewriting, we could introduce some new query >> >> task >> >> > for >> >> > > > > batch >> >> > > > > > > >>> > operations, but that would make impossible sending >> such >> >> > > > requests >> >> > > > > > from >> >> > > > > > > >>> > newer clients to older servers (say, servers of >> version >> >> > > 1.8.0, >> >> > > > > > which >> >> > > > > > > >>> > does not know about batching, let alone older >> versions). >> >> > > > > > > >>> > I'd like to hear comments and suggestions from the >> >> > community. >> >> > > > > > Thanks! >> >> > > > > > > >>> > >> >> > > > > > > >>> > - Alex >> >> > > > > > > >>> > >> >> > > > > > > >>> >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> > >> >> > >> >> > -- >> >> > Vladimir Ozerov >> >> > Senior Software Architect >> >> > GridGain Systems >> >> > www.gridgain.com >> >> > *+7 (960) 283 98 40* >> >> > >> >> >> > >> |
Alexander,
A couple of comments in regards to the streaming mode. I would rename rename the existed property to “ignite.jdbc.streaming” and add additional ones that will help to manage and tune the streaming behavior: ignite.jdbc.streaming.perNodeBufferSize ignite.jdbc.streaming.perNodeParallelOperations ignite.jdbc.streaming.autoFlushFrequency Any other thoughts? — Denis > On Dec 19, 2016, at 8:02 AM, Alexander Paschenko <[hidden email]> wrote: > > OK folks, both data streamer support and batching support have been implemented. > > Resulting design fully conforms to what Dima suggested initially - > these two concepts are separated. > > Streamed statements are turned on by connection flag, stream auto > flush timeout can be tuned in the same way; these statements support > INSERT and MERGE w/o subquery as well as fast key bounded DELETE and > UPDATE; each prepared statement in streamed mode has its own streamer > object and their lifecycles are the same - on close, the statement > closes its streamer. Streaming mode is available only in "local" mode > of connection between JDBC driver and Ignite client (default mode when > JDBC driver creates Ignite client node by itself) - there would be no > sense in streaming if query args would have to travel over network. > > Batched statements sre used via conventional JDBC API (setArgs... > addBatch... executeBatch...), they also support INSERT and MERGE w/o > subquery as well as fast key (and, optionally, value) bounded DELETE > and UPDATE. These work in the similar manner to non batched statements > and likewise rely on traditional putAll/invokeAll routines. > Essentially, batching is just the way to pass a bigger map to > cache.putAll without writing single very long query. This works in > local as well as "remote" Ignite JDBC connectivity mode. > > More info (details are in the comments): > > Batching - https://issues.apache.org/jira/browse/IGNITE-4269 > Streaming - https://issues.apache.org/jira/browse/IGNITE-4169 > > Regards, > Alex > > 2016-12-10 23:39 GMT+03:00 Dmitriy Setrakyan <[hidden email]>: >> Alex, >> >> It seams to me that replace semantic can be implemented with >> StreamReceiver, no? >> >> D. >> >> On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko < >> [hidden email]> wrote: >> >>> Sorry, "no relation w/JDBC" in my previous message should read "no relation >>> w/JDBC batching". >>> >>> — Alex >>> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" < >>> [hidden email]> написал: >>> >>>> Dima, >>>> >>>> I would like to point out that data streamer support had already been >>>> implemented in the course of work on DML in 1.8 exactly as you are >>>> suggesting now (turned on via connection flag; allowed only MERGE — data >>>> streamer can't do putIfAbsent stuff, right?; absolutely no relation >>>> w/JDBC), *but* that patch had been reverted — by advice from Vlad which I >>>> believe had been agreed with you, so it didn't make it to 1.8 after all. >>>> Also, while it's possible to maintain INSERT vs MERGE semantic using >>>> streamer's allowOverwrite flag, I can't see how we could mimic UPDATE >>> here >>>> as long as the streamer does not put data to cache only in case when key >>> is >>>> present AND allowOverwrite is false, while UPDATE should not put anything >>>> when the key is *missing* — i.e., there's no way to emulate cache's >>>> *replace* operation semantic with streamer (update value only if key is >>>> present, otherwise do nothing). >>>> >>>> — Alex >>>> 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" < >>>> [hidden email]> написал: >>>> >>>>> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <[hidden email]> >>>>> wrote: >>>>> >>>>>> I already expressed my concern - this is counterintuitive approach. >>>>> Because >>>>>> without happens-before pure streaming model can be applied only on >>>>>> independent chunks of data. It mean that mentioned ETL use case is not >>>>>> feasible - ETL always depend on implicit or explicit links between >>>>> tables, >>>>>> and hence streaming is not applicable here. My question stands still - >>>>> what >>>>>> products except of possibly Ignite do this kind of JDBC streaming? >>>>>> >>>>> >>>>> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or >>>>> DataStreamer.addData(). >>>>> >>>>> JDBC batching and putAll() are absolutely identical. If you see it as >>>>> counter-intuitive, I would ask for a concrete example. >>>>> >>>>> As far as links between data, Ignite does not have foreign-key >>>>> constraints, >>>>> so DataStreamer can insert data in any order (but again, not as part of >>>>> JDBC batch). >>>>> >>>>> >>>>>> >>>>>> Another problem is that connection-wide property doesn't fit well in >>>>> JDBC >>>>>> pooling model. Users will have use different connections for streaming >>>>> and >>>>>> non-streaming approaches. >>>>>> >>>>> >>>>> Using DataStreamer is not possible within JDBC batching paradigm, >>> period. >>>>> I >>>>> wish we could drop the high-level-feels-good discussions altogether, >>>>> because it seems like we are spinning wheels here. >>>>> >>>>> There is no way to use the streamer in JDBC context, unless we add a >>>>> connection flag. Again, if you disagree, I would prefer to see a >>> concrete >>>>> example explaining why. >>>>> >>>>> >>>>>> Please see how Oracle did that, this is precisely what I am talking >>>>> about: >>>>>> https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf >>>>> .htm#i1056232 >>>>>> Two batching modes - one with explicit flush, another one with >>> implicit >>>>>> flush, when Oracle decides on it's own when it is better to >>> communicate >>>>> the >>>>>> server. Batching mode can be declared globally or on per-statement >>>>> level. >>>>>> Simple and flexible. >>>>>> >>>>>> >>>>>> On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan < >>>>> [hidden email]> >>>>>> wrote: >>>>>> >>>>>>> Gents, >>>>>>> >>>>>>> As Sergi suggested, batching and streaming are very different >>>>>> semantically. >>>>>>> >>>>>>> To use standard JDBC batching, all we need to do is convert it to a >>>>>>> cache.putAll() method, as semantically a putAll(...) call is >>> identical >>>>>> to a >>>>>>> JDBC batch. Of course, if we see and UPDATE with a WHERE clause in >>>>>> between, >>>>>>> then we may have to break a batch into several chunks and execute >>> the >>>>>>> update in between. The DataStreamer should not be used here. >>>>>>> >>>>>>> I believe that for streaming we need to add a special JDBC/ODBC >>>>>> connection >>>>>>> flag. Whenever this flag is set to true, then we only should allow >>>>> INSERT >>>>>>> or single-UPDATE operations and use DataStreamer API internally. All >>>>>>> operations other than INSERT or single-UPDATE should be prohibited. >>>>>>> >>>>>>> I think this design is semantically clear. Any objections? >>>>>>> >>>>>>> D. >>>>>>> >>>>>>> On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin < >>>>> [hidden email] >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> If we use Streamer, then we always have `happens-before` broken. >>>>> This >>>>>> is >>>>>>>> ok, because Streamer is for data loading, not for usual operating. >>>>>>>> >>>>>>>> We are not inventing any bicycles, just separating concerns: >>>>> Batching >>>>>> and >>>>>>>> Streaming. >>>>>>>> >>>>>>>> My point here is that they should not depend on each other at all: >>>>>>> Batching >>>>>>>> can work with or without Streaming, as well as Streaming can work >>>>> with >>>>>> or >>>>>>>> without Batching. >>>>>>>> >>>>>>>> Your proposal is a set of non-obvious rules for them to work. I >>> see >>>>> no >>>>>>>> reasons for these complications. >>>>>>>> >>>>>>>> Sergi >>>>>>>> >>>>>>>> >>>>>>>> 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email] >>>> : >>>>>>>> >>>>>>>>> Sergi, >>>>>>>>> >>>>>>>>> If user call single *execute() *operation, than most likely it >>> is >>>>> not >>>>>>>>> batching. We should not rely on strange case where user perform >>>>>>> batching >>>>>>>>> without using standard and well-adopted batching JDBC API. The >>>>> main >>>>>>>> problem >>>>>>>>> with streamer is that it is async and hence break happens-before >>>>>>>> guarantees >>>>>>>>> in a single thread: SELECT after INSERT might not return >>> inserted >>>>>>> value. >>>>>>>>> >>>>>>>>> Honestly, I do not really understand why we are trying to >>>>> re-invent a >>>>>>>>> bicycle here. There is standard API - let's just use it and make >>>>>>> flexible >>>>>>>>> enough to take advantage of IgniteDataStreamer if needed. >>>>>>>>> >>>>>>>>> Is there any use case which is not covered with this solution? >>> Or >>>>> let >>>>>>> me >>>>>>>>> ask from the opposite side - are there any well-known JDBC >>> drivers >>>>>>> which >>>>>>>>> perform batching/streaming from non-batched update statements? >>>>>>>>> >>>>>>>>> Vladimir. >>>>>>>>> >>>>>>>>> On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < >>>>>>> [hidden email] >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Vladimir, >>>>>>>>>> >>>>>>>>>> I see no reason to forbid Streamer usage from non-batched >>>>> statement >>>>>>>>>> execution. >>>>>>>>>> It is common that users already have their ETL tools and you >>>>> can't >>>>>> be >>>>>>>>> sure >>>>>>>>>> if they use batching or not. >>>>>>>>>> >>>>>>>>>> Alex, >>>>>>>>>> >>>>>>>>>> I guess we have to decide on Streaming first and then we will >>>>>> discuss >>>>>>>>>> Batching separately, ok? Because this decision may become >>>>> important >>>>>>> for >>>>>>>>>> batching implementation. >>>>>>>>>> >>>>>>>>>> Sergi >>>>>>>>>> >>>>>>>>>> 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: >>>>>>>>>> >>>>>>>>>>> Alex, >>>>>>>>>>> >>>>>>>>>>> In most cases JdbcQueryTask should be executed locally on >>>>> client >>>>>>> node >>>>>>>>>>> started by JDBC driver. >>>>>>>>>>> >>>>>>>>>>> JdbcQueryTask.QueryResult res = >>>>>>>>>>> loc ? qryTask.call() : >>>>>>>>>>> ignite.compute(ignite.cluster().forNodeId(nodeId)).call( >>>>>> qryTask); >>>>>>>>>>> >>>>>>>>>>> Is it valid behavior after introducing DML functionality? >>>>>>>>>>> >>>>>>>>>>> In cases when user wants to execute query on specific node >>> he >>>>>>> should >>>>>>>>>>> fully understand what he wants and what can go in wrong way. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko >>>>>>>>>>> <[hidden email]> wrote: >>>>>>>>>>>> Sergi, >>>>>>>>>>>> >>>>>>>>>>>> JDBC batching might work quite differently from driver to >>>>>> driver. >>>>>>>>> Say, >>>>>>>>>>>> MySQL happily rewrites queries as I had suggested in the >>>>>>> beginning >>>>>>>> of >>>>>>>>>>>> this thread (it's not the only strategy, but one of the >>>>>> possible >>>>>>>>>>>> options) - and, BTW, would like to hear at least an >>> opinion >>>>>> about >>>>>>>> it. >>>>>>>>>>>> >>>>>>>>>>>> On your first approach, section before streamer: you >>> suggest >>>>>> that >>>>>>>> we >>>>>>>>>>>> send single statement and multiple param sets as a single >>>>> query >>>>>>>> task, >>>>>>>>>>>> am I right? (Just to make sure that I got you properly.) >>> If >>>>> so, >>>>>>> do >>>>>>>>> you >>>>>>>>>>>> also mean that API (namely JdbcQueryTask) between server >>> and >>>>>>> client >>>>>>>>>>>> should also change? Or should new API means be added to >>>>>>> facilitate >>>>>>>>>>>> batching tasks? >>>>>>>>>>>> >>>>>>>>>>>> - Alex >>>>>>>>>>>> >>>>>>>>>>>> 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < >>>>>>>> [hidden email] >>>>>>>>>> : >>>>>>>>>>>>> Guys, >>>>>>>>>>>>> >>>>>>>>>>>>> I discussed this feature with Dmitriy and we came to >>>>>> conclusion >>>>>>>> that >>>>>>>>>>>>> batching in JDBC and Data Streaming in Ignite have >>>>> different >>>>>>>>> semantics >>>>>>>>>>> and >>>>>>>>>>>>> performance characteristics. Thus they are independent >>>>>> features >>>>>>>>> (they >>>>>>>>>>> may >>>>>>>>>>>>> work together, may separately, but this is another >>> story). >>>>>>>>>>>>> >>>>>>>>>>>>> Let me explain. >>>>>>>>>>>>> >>>>>>>>>>>>> This is how JDBC batching works: >>>>>>>>>>>>> - Add N sets of parameters to a prepared statement. >>>>>>>>>>>>> - Manually execute prepared statement. >>>>>>>>>>>>> - Repeat until all the data is loaded. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> This is how data streamer works: >>>>>>>>>>>>> - Keep adding data. >>>>>>>>>>>>> - Streamer will buffer and load buffered per-node batches >>>>> when >>>>>>>> they >>>>>>>>>> are >>>>>>>>>>> big >>>>>>>>>>>>> enough. >>>>>>>>>>>>> - Close streamer to make sure that everything is loaded. >>>>>>>>>>>>> >>>>>>>>>>>>> As you can see we have a difference in semantics of when >>> we >>>>>> send >>>>>>>>> data: >>>>>>>>>>> if >>>>>>>>>>>>> in our JDBC we will allow sending batches to nodes >>> without >>>>>>> calling >>>>>>>>>>>>> `execute` (and probably we will need to make `execute` to >>>>>> no-op >>>>>>>>> here), >>>>>>>>>>> then >>>>>>>>>>>>> we are violating semantics of JDBC, if we will disallow >>>>> this >>>>>>>>> behavior, >>>>>>>>>>> then >>>>>>>>>>>>> this batching will underperform. >>>>>>>>>>>>> >>>>>>>>>>>>> Thus I suggest keeping these features (JDBC Batching and >>>>> JDBC >>>>>>>>>>> Streaming) as >>>>>>>>>>>>> separate features. >>>>>>>>>>>>> >>>>>>>>>>>>> As I already said they can work together: Batching will >>>>> batch >>>>>>>>>> parameters >>>>>>>>>>>>> and on `execute` they will go to the Streamer in one shot >>>>> and >>>>>>>>> Streamer >>>>>>>>>>> will >>>>>>>>>>>>> deal with the rest. >>>>>>>>>>>>> >>>>>>>>>>>>> Sergi >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < >>>>>>> [hidden email] >>>>>>>>> : >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Alex, >>>>>>>>>>>>>> >>>>>>>>>>>>>> To my understanding there are two possible approaches to >>>>>>> batching >>>>>>>>> in >>>>>>>>>>> JDBC >>>>>>>>>>>>>> layer: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1) Rely on default batching API. Specifically >>>>>>>>>>>>>> *PreparedStatement.addBatch()* [1] >>>>>>>>>>>>>> and others. This is nice and clear API, users are used >>> to >>>>> it, >>>>>>> and >>>>>>>>>> it's >>>>>>>>>>>>>> adoption will minimize user code changes when migrating >>>>> from >>>>>>>> other >>>>>>>>>> JDBC >>>>>>>>>>>>>> sources. We simply copy updates locally and then execute >>>>> them >>>>>>> all >>>>>>>>> at >>>>>>>>>>> once >>>>>>>>>>>>>> with only a single network hop to servers. >>>>>> *IgniteDataStreamer* >>>>>>>> can >>>>>>>>>> be >>>>>>>>>>> used >>>>>>>>>>>>>> underneath. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2) Or we can have separate connection flag which will >>> move >>>>>> all >>>>>>>>>>>>>> INSERT/UPDATE/DELETE statements through streamer. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I prefer the first approach >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also we need to keep in mind that data streamer has poor >>>>>>>>> performance >>>>>>>>>>> when >>>>>>>>>>>>>> adding single key-value pairs due to high overhead on >>>>>>> concurrency >>>>>>>>> and >>>>>>>>>>> other >>>>>>>>>>>>>> bookkeeping. Instead, it is better to pre-batch >>> key-value >>>>>> pairs >>>>>>>>>> before >>>>>>>>>>>>>> giving them to streamer. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Vladimir. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://docs.oracle.com/javase/8/docs/api/java/sql/ >>>>>>>>>>> PreparedStatement.html# >>>>>>>>>>>>>> addBatch-- >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < >>>>>>>>>>>>>> [hidden email]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello Igniters, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> One of the major improvements to DML has to be support >>>>> of >>>>>>> batch >>>>>>>>>>>>>>> statements. I'd like to discuss its implementation. >>> The >>>>>>>> suggested >>>>>>>>>>>>>>> approach is to rewrite given query turning it from few >>>>>>> INSERTs >>>>>>>>> into >>>>>>>>>>>>>>> single statement and processing arguments >>> accordingly. I >>>>>>>> suggest >>>>>>>>>> this >>>>>>>>>>>>>>> as long as the whole point of batching is to make as >>>>> little >>>>>>>>>>>>>>> interactions with cluster as possible and to make >>>>>> operations >>>>>>> as >>>>>>>>>>>>>>> condensed as possible, and in case of Ignite it means >>>>> that >>>>>> we >>>>>>>>>> should >>>>>>>>>>>>>>> send as little JdbcQueryTasks as possible. And, as >>> long >>>>> as >>>>>> a >>>>>>>>> query >>>>>>>>>>>>>>> task holds single query and its arguments, this >>> approach >>>>>> will >>>>>>>> not >>>>>>>>>>>>>>> require any changes to be done to current design and >>>>> won't >>>>>>>> break >>>>>>>>>> any >>>>>>>>>>>>>>> backward compatibility - all dirty work on rewriting >>>>> will >>>>>> be >>>>>>>> done >>>>>>>>>> by >>>>>>>>>>>>>>> JDBC driver. >>>>>>>>>>>>>>> Without rewriting, we could introduce some new query >>>>> task >>>>>> for >>>>>>>>> batch >>>>>>>>>>>>>>> operations, but that would make impossible sending >>> such >>>>>>>> requests >>>>>>>>>> from >>>>>>>>>>>>>>> newer clients to older servers (say, servers of >>> version >>>>>>> 1.8.0, >>>>>>>>>> which >>>>>>>>>>>>>>> does not know about batching, let alone older >>> versions). >>>>>>>>>>>>>>> I'd like to hear comments and suggestions from the >>>>>> community. >>>>>>>>>> Thanks! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Alex >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Vladimir Ozerov >>>>>> Senior Software Architect >>>>>> GridGain Systems >>>>>> www.gridgain.com >>>>>> *+7 (960) 283 98 40* >>>>>> >>>>> >>>> >>> |
Auto flush freq is already there, I just forgot to mention it in the
comments. Will add the rest today. — Alex 19 дек. 2016 г. 10:29 PM пользователь "Denis Magda" <[hidden email]> написал: > Alexander, > > A couple of comments in regards to the streaming mode. > > I would rename rename the existed property to “ignite.jdbc.streaming” and > add additional ones that will help to manage and tune the streaming > behavior: > ignite.jdbc.streaming.perNodeBufferSize > ignite.jdbc.streaming.perNodeParallelOperations > ignite.jdbc.streaming.autoFlushFrequency > > > Any other thoughts? > > — > Denis > > > On Dec 19, 2016, at 8:02 AM, Alexander Paschenko < > [hidden email]> wrote: > > > > OK folks, both data streamer support and batching support have been > implemented. > > > > Resulting design fully conforms to what Dima suggested initially - > > these two concepts are separated. > > > > Streamed statements are turned on by connection flag, stream auto > > flush timeout can be tuned in the same way; these statements support > > INSERT and MERGE w/o subquery as well as fast key bounded DELETE and > > UPDATE; each prepared statement in streamed mode has its own streamer > > object and their lifecycles are the same - on close, the statement > > closes its streamer. Streaming mode is available only in "local" mode > > of connection between JDBC driver and Ignite client (default mode when > > JDBC driver creates Ignite client node by itself) - there would be no > > sense in streaming if query args would have to travel over network. > > > > Batched statements sre used via conventional JDBC API (setArgs... > > addBatch... executeBatch...), they also support INSERT and MERGE w/o > > subquery as well as fast key (and, optionally, value) bounded DELETE > > and UPDATE. These work in the similar manner to non batched statements > > and likewise rely on traditional putAll/invokeAll routines. > > Essentially, batching is just the way to pass a bigger map to > > cache.putAll without writing single very long query. This works in > > local as well as "remote" Ignite JDBC connectivity mode. > > > > More info (details are in the comments): > > > > Batching - https://issues.apache.org/jira/browse/IGNITE-4269 > > Streaming - https://issues.apache.org/jira/browse/IGNITE-4169 > > > > Regards, > > Alex > > > > 2016-12-10 23:39 GMT+03:00 Dmitriy Setrakyan <[hidden email]>: > >> Alex, > >> > >> It seams to me that replace semantic can be implemented with > >> StreamReceiver, no? > >> > >> D. > >> > >> On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko < > >> [hidden email]> wrote: > >> > >>> Sorry, "no relation w/JDBC" in my previous message should read "no > relation > >>> w/JDBC batching". > >>> > >>> — Alex > >>> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" < > >>> [hidden email]> написал: > >>> > >>>> Dima, > >>>> > >>>> I would like to point out that data streamer support had already been > >>>> implemented in the course of work on DML in 1.8 exactly as you are > >>>> suggesting now (turned on via connection flag; allowed only MERGE — > data > >>>> streamer can't do putIfAbsent stuff, right?; absolutely no relation > >>>> w/JDBC), *but* that patch had been reverted — by advice from Vlad > which I > >>>> believe had been agreed with you, so it didn't make it to 1.8 after > all. > >>>> Also, while it's possible to maintain INSERT vs MERGE semantic using > >>>> streamer's allowOverwrite flag, I can't see how we could mimic UPDATE > >>> here > >>>> as long as the streamer does not put data to cache only in case when > key > >>> is > >>>> present AND allowOverwrite is false, while UPDATE should not put > anything > >>>> when the key is *missing* — i.e., there's no way to emulate cache's > >>>> *replace* operation semantic with streamer (update value only if key > is > >>>> present, otherwise do nothing). > >>>> > >>>> — Alex > >>>> 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" < > >>>> [hidden email]> написал: > >>>> > >>>>> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov < > [hidden email]> > >>>>> wrote: > >>>>> > >>>>>> I already expressed my concern - this is counterintuitive approach. > >>>>> Because > >>>>>> without happens-before pure streaming model can be applied only on > >>>>>> independent chunks of data. It mean that mentioned ETL use case is > not > >>>>>> feasible - ETL always depend on implicit or explicit links between > >>>>> tables, > >>>>>> and hence streaming is not applicable here. My question stands > still - > >>>>> what > >>>>>> products except of possibly Ignite do this kind of JDBC streaming? > >>>>>> > >>>>> > >>>>> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or > >>>>> DataStreamer.addData(). > >>>>> > >>>>> JDBC batching and putAll() are absolutely identical. If you see it as > >>>>> counter-intuitive, I would ask for a concrete example. > >>>>> > >>>>> As far as links between data, Ignite does not have foreign-key > >>>>> constraints, > >>>>> so DataStreamer can insert data in any order (but again, not as > part of > >>>>> JDBC batch). > >>>>> > >>>>> > >>>>>> > >>>>>> Another problem is that connection-wide property doesn't fit well in > >>>>> JDBC > >>>>>> pooling model. Users will have use different connections for > streaming > >>>>> and > >>>>>> non-streaming approaches. > >>>>>> > >>>>> > >>>>> Using DataStreamer is not possible within JDBC batching paradigm, > >>> period. > >>>>> I > >>>>> wish we could drop the high-level-feels-good discussions altogether, > >>>>> because it seems like we are spinning wheels here. > >>>>> > >>>>> There is no way to use the streamer in JDBC context, unless we add a > >>>>> connection flag. Again, if you disagree, I would prefer to see a > >>> concrete > >>>>> example explaining why. > >>>>> > >>>>> > >>>>>> Please see how Oracle did that, this is precisely what I am talking > >>>>> about: > >>>>>> https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf > >>>>> .htm#i1056232 > >>>>>> Two batching modes - one with explicit flush, another one with > >>> implicit > >>>>>> flush, when Oracle decides on it's own when it is better to > >>> communicate > >>>>> the > >>>>>> server. Batching mode can be declared globally or on per-statement > >>>>> level. > >>>>>> Simple and flexible. > >>>>>> > >>>>>> > >>>>>> On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan < > >>>>> [hidden email]> > >>>>>> wrote: > >>>>>> > >>>>>>> Gents, > >>>>>>> > >>>>>>> As Sergi suggested, batching and streaming are very different > >>>>>> semantically. > >>>>>>> > >>>>>>> To use standard JDBC batching, all we need to do is convert it to a > >>>>>>> cache.putAll() method, as semantically a putAll(...) call is > >>> identical > >>>>>> to a > >>>>>>> JDBC batch. Of course, if we see and UPDATE with a WHERE clause in > >>>>>> between, > >>>>>>> then we may have to break a batch into several chunks and execute > >>> the > >>>>>>> update in between. The DataStreamer should not be used here. > >>>>>>> > >>>>>>> I believe that for streaming we need to add a special JDBC/ODBC > >>>>>> connection > >>>>>>> flag. Whenever this flag is set to true, then we only should allow > >>>>> INSERT > >>>>>>> or single-UPDATE operations and use DataStreamer API internally. > All > >>>>>>> operations other than INSERT or single-UPDATE should be prohibited. > >>>>>>> > >>>>>>> I think this design is semantically clear. Any objections? > >>>>>>> > >>>>>>> D. > >>>>>>> > >>>>>>> On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin < > >>>>> [hidden email] > >>>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> If we use Streamer, then we always have `happens-before` broken. > >>>>> This > >>>>>> is > >>>>>>>> ok, because Streamer is for data loading, not for usual operating. > >>>>>>>> > >>>>>>>> We are not inventing any bicycles, just separating concerns: > >>>>> Batching > >>>>>> and > >>>>>>>> Streaming. > >>>>>>>> > >>>>>>>> My point here is that they should not depend on each other at all: > >>>>>>> Batching > >>>>>>>> can work with or without Streaming, as well as Streaming can work > >>>>> with > >>>>>> or > >>>>>>>> without Batching. > >>>>>>>> > >>>>>>>> Your proposal is a set of non-obvious rules for them to work. I > >>> see > >>>>> no > >>>>>>>> reasons for these complications. > >>>>>>>> > >>>>>>>> Sergi > >>>>>>>> > >>>>>>>> > >>>>>>>> 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email] > >>>> : > >>>>>>>> > >>>>>>>>> Sergi, > >>>>>>>>> > >>>>>>>>> If user call single *execute() *operation, than most likely it > >>> is > >>>>> not > >>>>>>>>> batching. We should not rely on strange case where user perform > >>>>>>> batching > >>>>>>>>> without using standard and well-adopted batching JDBC API. The > >>>>> main > >>>>>>>> problem > >>>>>>>>> with streamer is that it is async and hence break happens-before > >>>>>>>> guarantees > >>>>>>>>> in a single thread: SELECT after INSERT might not return > >>> inserted > >>>>>>> value. > >>>>>>>>> > >>>>>>>>> Honestly, I do not really understand why we are trying to > >>>>> re-invent a > >>>>>>>>> bicycle here. There is standard API - let's just use it and make > >>>>>>> flexible > >>>>>>>>> enough to take advantage of IgniteDataStreamer if needed. > >>>>>>>>> > >>>>>>>>> Is there any use case which is not covered with this solution? > >>> Or > >>>>> let > >>>>>>> me > >>>>>>>>> ask from the opposite side - are there any well-known JDBC > >>> drivers > >>>>>>> which > >>>>>>>>> perform batching/streaming from non-batched update statements? > >>>>>>>>> > >>>>>>>>> Vladimir. > >>>>>>>>> > >>>>>>>>> On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < > >>>>>>> [hidden email] > >>>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Vladimir, > >>>>>>>>>> > >>>>>>>>>> I see no reason to forbid Streamer usage from non-batched > >>>>> statement > >>>>>>>>>> execution. > >>>>>>>>>> It is common that users already have their ETL tools and you > >>>>> can't > >>>>>> be > >>>>>>>>> sure > >>>>>>>>>> if they use batching or not. > >>>>>>>>>> > >>>>>>>>>> Alex, > >>>>>>>>>> > >>>>>>>>>> I guess we have to decide on Streaming first and then we will > >>>>>> discuss > >>>>>>>>>> Batching separately, ok? Because this decision may become > >>>>> important > >>>>>>> for > >>>>>>>>>> batching implementation. > >>>>>>>>>> > >>>>>>>>>> Sergi > >>>>>>>>>> > >>>>>>>>>> 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>: > >>>>>>>>>> > >>>>>>>>>>> Alex, > >>>>>>>>>>> > >>>>>>>>>>> In most cases JdbcQueryTask should be executed locally on > >>>>> client > >>>>>>> node > >>>>>>>>>>> started by JDBC driver. > >>>>>>>>>>> > >>>>>>>>>>> JdbcQueryTask.QueryResult res = > >>>>>>>>>>> loc ? qryTask.call() : > >>>>>>>>>>> ignite.compute(ignite.cluster().forNodeId(nodeId)).call( > >>>>>> qryTask); > >>>>>>>>>>> > >>>>>>>>>>> Is it valid behavior after introducing DML functionality? > >>>>>>>>>>> > >>>>>>>>>>> In cases when user wants to execute query on specific node > >>> he > >>>>>>> should > >>>>>>>>>>> fully understand what he wants and what can go in wrong way. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko > >>>>>>>>>>> <[hidden email]> wrote: > >>>>>>>>>>>> Sergi, > >>>>>>>>>>>> > >>>>>>>>>>>> JDBC batching might work quite differently from driver to > >>>>>> driver. > >>>>>>>>> Say, > >>>>>>>>>>>> MySQL happily rewrites queries as I had suggested in the > >>>>>>> beginning > >>>>>>>> of > >>>>>>>>>>>> this thread (it's not the only strategy, but one of the > >>>>>> possible > >>>>>>>>>>>> options) - and, BTW, would like to hear at least an > >>> opinion > >>>>>> about > >>>>>>>> it. > >>>>>>>>>>>> > >>>>>>>>>>>> On your first approach, section before streamer: you > >>> suggest > >>>>>> that > >>>>>>>> we > >>>>>>>>>>>> send single statement and multiple param sets as a single > >>>>> query > >>>>>>>> task, > >>>>>>>>>>>> am I right? (Just to make sure that I got you properly.) > >>> If > >>>>> so, > >>>>>>> do > >>>>>>>>> you > >>>>>>>>>>>> also mean that API (namely JdbcQueryTask) between server > >>> and > >>>>>>> client > >>>>>>>>>>>> should also change? Or should new API means be added to > >>>>>>> facilitate > >>>>>>>>>>>> batching tasks? > >>>>>>>>>>>> > >>>>>>>>>>>> - Alex > >>>>>>>>>>>> > >>>>>>>>>>>> 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < > >>>>>>>> [hidden email] > >>>>>>>>>> : > >>>>>>>>>>>>> Guys, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I discussed this feature with Dmitriy and we came to > >>>>>> conclusion > >>>>>>>> that > >>>>>>>>>>>>> batching in JDBC and Data Streaming in Ignite have > >>>>> different > >>>>>>>>> semantics > >>>>>>>>>>> and > >>>>>>>>>>>>> performance characteristics. Thus they are independent > >>>>>> features > >>>>>>>>> (they > >>>>>>>>>>> may > >>>>>>>>>>>>> work together, may separately, but this is another > >>> story). > >>>>>>>>>>>>> > >>>>>>>>>>>>> Let me explain. > >>>>>>>>>>>>> > >>>>>>>>>>>>> This is how JDBC batching works: > >>>>>>>>>>>>> - Add N sets of parameters to a prepared statement. > >>>>>>>>>>>>> - Manually execute prepared statement. > >>>>>>>>>>>>> - Repeat until all the data is loaded. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> This is how data streamer works: > >>>>>>>>>>>>> - Keep adding data. > >>>>>>>>>>>>> - Streamer will buffer and load buffered per-node batches > >>>>> when > >>>>>>>> they > >>>>>>>>>> are > >>>>>>>>>>> big > >>>>>>>>>>>>> enough. > >>>>>>>>>>>>> - Close streamer to make sure that everything is loaded. > >>>>>>>>>>>>> > >>>>>>>>>>>>> As you can see we have a difference in semantics of when > >>> we > >>>>>> send > >>>>>>>>> data: > >>>>>>>>>>> if > >>>>>>>>>>>>> in our JDBC we will allow sending batches to nodes > >>> without > >>>>>>> calling > >>>>>>>>>>>>> `execute` (and probably we will need to make `execute` to > >>>>>> no-op > >>>>>>>>> here), > >>>>>>>>>>> then > >>>>>>>>>>>>> we are violating semantics of JDBC, if we will disallow > >>>>> this > >>>>>>>>> behavior, > >>>>>>>>>>> then > >>>>>>>>>>>>> this batching will underperform. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thus I suggest keeping these features (JDBC Batching and > >>>>> JDBC > >>>>>>>>>>> Streaming) as > >>>>>>>>>>>>> separate features. > >>>>>>>>>>>>> > >>>>>>>>>>>>> As I already said they can work together: Batching will > >>>>> batch > >>>>>>>>>> parameters > >>>>>>>>>>>>> and on `execute` they will go to the Streamer in one shot > >>>>> and > >>>>>>>>> Streamer > >>>>>>>>>>> will > >>>>>>>>>>>>> deal with the rest. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Sergi > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < > >>>>>>> [hidden email] > >>>>>>>>> : > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Alex, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> To my understanding there are two possible approaches to > >>>>>>> batching > >>>>>>>>> in > >>>>>>>>>>> JDBC > >>>>>>>>>>>>>> layer: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 1) Rely on default batching API. Specifically > >>>>>>>>>>>>>> *PreparedStatement.addBatch()* [1] > >>>>>>>>>>>>>> and others. This is nice and clear API, users are used > >>> to > >>>>> it, > >>>>>>> and > >>>>>>>>>> it's > >>>>>>>>>>>>>> adoption will minimize user code changes when migrating > >>>>> from > >>>>>>>> other > >>>>>>>>>> JDBC > >>>>>>>>>>>>>> sources. We simply copy updates locally and then execute > >>>>> them > >>>>>>> all > >>>>>>>>> at > >>>>>>>>>>> once > >>>>>>>>>>>>>> with only a single network hop to servers. > >>>>>> *IgniteDataStreamer* > >>>>>>>> can > >>>>>>>>>> be > >>>>>>>>>>> used > >>>>>>>>>>>>>> underneath. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 2) Or we can have separate connection flag which will > >>> move > >>>>>> all > >>>>>>>>>>>>>> INSERT/UPDATE/DELETE statements through streamer. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I prefer the first approach > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Also we need to keep in mind that data streamer has poor > >>>>>>>>> performance > >>>>>>>>>>> when > >>>>>>>>>>>>>> adding single key-value pairs due to high overhead on > >>>>>>> concurrency > >>>>>>>>> and > >>>>>>>>>>> other > >>>>>>>>>>>>>> bookkeeping. Instead, it is better to pre-batch > >>> key-value > >>>>>> pairs > >>>>>>>>>> before > >>>>>>>>>>>>>> giving them to streamer. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Vladimir. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [1] > >>>>>>>>>>>>>> https://docs.oracle.com/javase/8/docs/api/java/sql/ > >>>>>>>>>>> PreparedStatement.html# > >>>>>>>>>>>>>> addBatch-- > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < > >>>>>>>>>>>>>> [hidden email]> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hello Igniters, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> One of the major improvements to DML has to be support > >>>>> of > >>>>>>> batch > >>>>>>>>>>>>>>> statements. I'd like to discuss its implementation. > >>> The > >>>>>>>> suggested > >>>>>>>>>>>>>>> approach is to rewrite given query turning it from few > >>>>>>> INSERTs > >>>>>>>>> into > >>>>>>>>>>>>>>> single statement and processing arguments > >>> accordingly. I > >>>>>>>> suggest > >>>>>>>>>> this > >>>>>>>>>>>>>>> as long as the whole point of batching is to make as > >>>>> little > >>>>>>>>>>>>>>> interactions with cluster as possible and to make > >>>>>> operations > >>>>>>> as > >>>>>>>>>>>>>>> condensed as possible, and in case of Ignite it means > >>>>> that > >>>>>> we > >>>>>>>>>> should > >>>>>>>>>>>>>>> send as little JdbcQueryTasks as possible. And, as > >>> long > >>>>> as > >>>>>> a > >>>>>>>>> query > >>>>>>>>>>>>>>> task holds single query and its arguments, this > >>> approach > >>>>>> will > >>>>>>>> not > >>>>>>>>>>>>>>> require any changes to be done to current design and > >>>>> won't > >>>>>>>> break > >>>>>>>>>> any > >>>>>>>>>>>>>>> backward compatibility - all dirty work on rewriting > >>>>> will > >>>>>> be > >>>>>>>> done > >>>>>>>>>> by > >>>>>>>>>>>>>>> JDBC driver. > >>>>>>>>>>>>>>> Without rewriting, we could introduce some new query > >>>>> task > >>>>>> for > >>>>>>>>> batch > >>>>>>>>>>>>>>> operations, but that would make impossible sending > >>> such > >>>>>>>> requests > >>>>>>>>>> from > >>>>>>>>>>>>>>> newer clients to older servers (say, servers of > >>> version > >>>>>>> 1.8.0, > >>>>>>>>>> which > >>>>>>>>>>>>>>> does not know about batching, let alone older > >>> versions). > >>>>>>>>>>>>>>> I'd like to hear comments and suggestions from the > >>>>>> community. > >>>>>>>>>> Thanks! > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> - Alex > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Vladimir Ozerov > >>>>>> Senior Software Architect > >>>>>> GridGain Systems > >>>>>> www.gridgain.com > >>>>>> *+7 (960) 283 98 40* > >>>>>> > >>>>> > >>>> > >>> > > |
Free forum by Nabble | Edit this page |