Apache Ignite Developers - Legacy Mail Archive

Batch DML queries design discussion

Classic

List

Threaded

20 messages Options

al.psc

Batch DML queries design discussion

Hello Igniters,

One of the major improvements to DML has to be support of batch
statements. I'd like to discuss its implementation. The suggested
approach is to rewrite given query turning it from few INSERTs into
single statement and processing arguments accordingly. I suggest this
as long as the whole point of batching is to make as little
interactions with cluster as possible and to make operations as
condensed as possible, and in case of Ignite it means that we should
send as little JdbcQueryTasks as possible. And, as long as a query
task holds single query and its arguments, this approach will not
require any changes to be done to current design and won't break any
backward compatibility - all dirty work on rewriting will be done by
JDBC driver.
Without rewriting, we could introduce some new query task for batch
operations, but that would make impossible sending such requests from
newer clients to older servers (say, servers of version 1.8.0, which
does not know about batching, let alone older versions).
I'd like to hear comments and suggestions from the community. Thanks!

- Alex

Vladimir Ozerov

Alex,

It seams to me that replace semantic can be implemented with
StreamReceiver, no?

D.

On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko <
[hidden email]> wrote:

> Sorry, "no relation w/JDBC" in my previous message should read "no relation
> w/JDBC batching".
>
> — Alex
> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" <
> [hidden email]> написал:
>
> > Dima,
> >
> > I would like to point out that data streamer support had already been
> > implemented in the course of work on DML in 1.8 exactly as you are
> > suggesting now (turned on via connection flag; allowed only MERGE — data
> > streamer can't do putIfAbsent stuff, right?; absolutely no relation
> > w/JDBC), *but* that patch had been reverted — by advice from Vlad which I
> > believe had been agreed with you, so it didn't make it to 1.8 after all.
> > Also, while it's possible to maintain INSERT vs MERGE semantic using
> > streamer's allowOverwrite flag, I can't see how we could mimic UPDATE
> here
> > as long as the streamer does not put data to cache only in case when key
> is
> > present AND allowOverwrite is false, while UPDATE should not put anything
> > when the key is *missing* — i.e., there's no way to emulate cache's
> > *replace* operation semantic with streamer (update value only if key is
> > present, otherwise do nothing).
> >
> > — Alex
> > 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" <
> > [hidden email]> написал:
> >
> >> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <[hidden email]>
> >> wrote:
> >>
> >> > I already expressed my concern - this is counterintuitive approach.
> >> Because
> >> > without happens-before pure streaming model can be applied only on
> >> > independent chunks of data. It mean that mentioned ETL use case is not
> >> > feasible - ETL always depend on implicit or explicit links between
> >> tables,
> >> > and hence streaming is not applicable here. My question stands still -
> >> what
> >> > products except of possibly Ignite do this kind of JDBC streaming?
> >> >
> >>
> >> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or
> >> DataStreamer.addData().
> >>
> >> JDBC batching and putAll() are absolutely identical. If you see it as
> >> counter-intuitive, I would ask for a concrete example.
> >>
> >> As far as links between data, Ignite does not have foreign-key
> >> constraints,
> >> so DataStreamer can insert data in any order (but again, not as part of
> >> JDBC batch).
> >>
> >>
> >> >
> >> > Another problem is that connection-wide property doesn't fit well in
> >> JDBC
> >> > pooling model. Users will have use different connections for streaming
> >> and
> >> > non-streaming approaches.
> >> >
> >>
> >> Using DataStreamer is not possible within JDBC batching paradigm,
> period.
> >> I
> >> wish we could drop the high-level-feels-good discussions altogether,
> >> because it seems like we are spinning wheels here.
> >>
> >> There is no way to use the streamer in JDBC context, unless we add a
> >> connection flag. Again, if you disagree, I would prefer to see a
> concrete
> >> example explaining why.
> >>
> >>
> >> > Please see how Oracle did that, this is precisely what I am talking
> >> about:
> >> > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf
> >> .htm#i1056232
> >> > Two batching modes - one with explicit flush, another one with
> implicit
> >> > flush, when Oracle decides on it's own when it is better to
> communicate
> >> the
> >> > server. Batching mode can be declared globally or on per-statement
> >> level.
> >> > Simple and flexible.
> >> >
> >> >
> >> > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <
> >> [hidden email]>
> >> > wrote:
> >> >
> >> > > Gents,
> >> > >
> >> > > As Sergi suggested, batching and streaming are very different
> >> > semantically.
> >> > >
> >> > > To use standard JDBC batching, all we need to do is convert it to a
> >> > > cache.putAll() method, as semantically a putAll(...) call is
> identical
> >> > to a
> >> > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in
> >> > between,
> >> > > then we may have to break a batch into several chunks and execute
> the
> >> > > update in between. The DataStreamer should not be used here.
> >> > >
> >> > > I believe that for streaming we need to add a special JDBC/ODBC
> >> > connection
> >> > > flag. Whenever this flag is set to true, then we only should allow
> >> INSERT
> >> > > or single-UPDATE operations and use DataStreamer API internally. All
> >> > > operations other than INSERT or single-UPDATE should be prohibited.
> >> > >
> >> > > I think this design is semantically clear. Any objections?
> >> > >
> >> > > D.
> >> > >
> >> > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <
> >> [hidden email]
> >> > >
> >> > > wrote:
> >> > >
> >> > > > If we use Streamer, then we always have `happens-before` broken.
> >> This
> >> > is
> >> > > > ok, because Streamer is for data loading, not for usual operating.
> >> > > >
> >> > > > We are not inventing any bicycles, just separating concerns:
> >> Batching
> >> > and
> >> > > > Streaming.
> >> > > >
> >> > > > My point here is that they should not depend on each other at all:
> >> > > Batching
> >> > > > can work with or without Streaming, as well as Streaming can work
> >> with
> >> > or
> >> > > > without Batching.
> >> > > >
> >> > > > Your proposal is a set of non-obvious rules for them to work. I
> see
> >> no
> >> > > > reasons for these complications.
> >> > > >
> >> > > > Sergi
> >> > > >
> >> > > >
> >> > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]
> >:
> >> > > >
> >> > > > > Sergi,
> >> > > > >
> >> > > > > If user call single *execute() *operation, than most likely it
> is
> >> not
> >> > > > > batching. We should not rely on strange case where user perform
> >> > > batching
> >> > > > > without using standard and well-adopted batching JDBC API. The
> >> main
> >> > > > problem
> >> > > > > with streamer is that it is async and hence break happens-before
> >> > > > guarantees
> >> > > > > in a single thread: SELECT after INSERT might not return
> inserted
> >> > > value.
> >> > > > >
> >> > > > > Honestly, I do not really understand why we are trying to
> >> re-invent a
> >> > > > > bicycle here. There is standard API - let's just use it and make
> >> > > flexible
> >> > > > > enough to take advantage of IgniteDataStreamer if needed.
> >> > > > >
> >> > > > > Is there any use case which is not covered with this solution?
> Or
> >> let
> >> > > me
> >> > > > > ask from the opposite side - are there any well-known JDBC
> drivers
> >> > > which
> >> > > > > perform batching/streaming from non-batched update statements?
> >> > > > >
> >> > > > > Vladimir.
> >> > > > >
> >> > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
> >> > > [hidden email]
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Vladimir,
> >> > > > > >
> >> > > > > > I see no reason to forbid Streamer usage from non-batched
> >> statement
> >> > > > > > execution.
> >> > > > > > It is common that users already have their ETL tools and you
> >> can't
> >> > be
> >> > > > > sure
> >> > > > > > if they use batching or not.
> >> > > > > >
> >> > > > > > Alex,
> >> > > > > >
> >> > > > > > I guess we have to decide on Streaming first and then we will
> >> > discuss
> >> > > > > > Batching separately, ok? Because this decision may become
> >> important
> >> > > for
> >> > > > > > batching implementation.
> >> > > > > >
> >> > > > > > Sergi
> >> > > > > >
> >> > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>:
> >> > > > > >
> >> > > > > > > Alex,
> >> > > > > > >
> >> > > > > > > In most cases JdbcQueryTask should be executed locally on
> >> client
> >> > > node
> >> > > > > > > started by JDBC driver.
> >> > > > > > >
> >> > > > > > > JdbcQueryTask.QueryResult res =
> >> > > > > > > loc ? qryTask.call() :
> >> > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(
> >> > qryTask);
> >> > > > > > >
> >> > > > > > > Is it valid behavior after introducing DML functionality?
> >> > > > > > >
> >> > > > > > > In cases when user wants to execute query on specific node
> he
> >> > > should
> >> > > > > > > fully understand what he wants and what can go in wrong way.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
> >> > > > > > > <[hidden email]> wrote:
> >> > > > > > > > Sergi,
> >> > > > > > > >
> >> > > > > > > > JDBC batching might work quite differently from driver to
> >> > driver.
> >> > > > > Say,
> >> > > > > > > > MySQL happily rewrites queries as I had suggested in the
> >> > > beginning
> >> > > > of
> >> > > > > > > > this thread (it's not the only strategy, but one of the
> >> > possible
> >> > > > > > > > options) - and, BTW, would like to hear at least an
> opinion
> >> > about
> >> > > > it.
> >> > > > > > > >
> >> > > > > > > > On your first approach, section before streamer: you
> suggest
> >> > that
> >> > > > we
> >> > > > > > > > send single statement and multiple param sets as a single
> >> query
> >> > > > task,
> >> > > > > > > > am I right? (Just to make sure that I got you properly.)
> If
> >> so,
> >> > > do
> >> > > > > you
> >> > > > > > > > also mean that API (namely JdbcQueryTask) between server
> and
> >> > > client
> >> > > > > > > > should also change? Or should new API means be added to
> >> > > facilitate
> >> > > > > > > > batching tasks?
> >> > > > > > > >
> >> > > > > > > > - Alex
> >> > > > > > > >
> >> > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
> >> > > > [hidden email]
> >> > > > > >:
> >> > > > > > > >> Guys,
> >> > > > > > > >>
> >> > > > > > > >> I discussed this feature with Dmitriy and we came to
> >> > conclusion
> >> > > > that
> >> > > > > > > >> batching in JDBC and Data Streaming in Ignite have
> >> different
> >> > > > > semantics
> >> > > > > > > and
> >> > > > > > > >> performance characteristics. Thus they are independent
> >> > features
> >> > > > > (they
> >> > > > > > > may
> >> > > > > > > >> work together, may separately, but this is another
> story).
> >> > > > > > > >>
> >> > > > > > > >> Let me explain.
> >> > > > > > > >>
> >> > > > > > > >> This is how JDBC batching works:
> >> > > > > > > >> - Add N sets of parameters to a prepared statement.
> >> > > > > > > >> - Manually execute prepared statement.
> >> > > > > > > >> - Repeat until all the data is loaded.
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> This is how data streamer works:
> >> > > > > > > >> - Keep adding data.
> >> > > > > > > >> - Streamer will buffer and load buffered per-node batches
> >> when
> >> > > > they
> >> > > > > > are
> >> > > > > > > big
> >> > > > > > > >> enough.
> >> > > > > > > >> - Close streamer to make sure that everything is loaded.
> >> > > > > > > >>
> >> > > > > > > >> As you can see we have a difference in semantics of when
> we
> >> > send
> >> > > > > data:
> >> > > > > > > if
> >> > > > > > > >> in our JDBC we will allow sending batches to nodes
> without
> >> > > calling
> >> > > > > > > >> `execute` (and probably we will need to make `execute` to
> >> > no-op
> >> > > > > here),
> >> > > > > > > then
> >> > > > > > > >> we are violating semantics of JDBC, if we will disallow
> >> this
> >> > > > > behavior,
> >> > > > > > > then
> >> > > > > > > >> this batching will underperform.
> >> > > > > > > >>
> >> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and
> >> JDBC
> >> > > > > > > Streaming) as
> >> > > > > > > >> separate features.
> >> > > > > > > >>
> >> > > > > > > >> As I already said they can work together: Batching will
> >> batch
> >> > > > > > parameters
> >> > > > > > > >> and on `execute` they will go to the Streamer in one shot
> >> and
> >> > > > > Streamer
> >> > > > > > > will
> >> > > > > > > >> deal with the rest.
> >> > > > > > > >>
> >> > > > > > > >> Sergi
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
> >> > > [hidden email]
> >> > > > >:
> >> > > > > > > >>
> >> > > > > > > >>> Hi Alex,
> >> > > > > > > >>>
> >> > > > > > > >>> To my understanding there are two possible approaches to
> >> > > batching
> >> > > > > in
> >> > > > > > > JDBC
> >> > > > > > > >>> layer:
> >> > > > > > > >>>
> >> > > > > > > >>> 1) Rely on default batching API. Specifically
> >> > > > > > > >>> *PreparedStatement.addBatch()* [1]
> >> > > > > > > >>> and others. This is nice and clear API, users are used
> to
> >> it,
> >> > > and
> >> > > > > > it's
> >> > > > > > > >>> adoption will minimize user code changes when migrating
> >> from
> >> > > > other
> >> > > > > > JDBC
> >> > > > > > > >>> sources. We simply copy updates locally and then execute
> >> them
> >> > > all
> >> > > > > at
> >> > > > > > > once
> >> > > > > > > >>> with only a single network hop to servers.
> >> > *IgniteDataStreamer*
> >> > > > can
> >> > > > > > be
> >> > > > > > > used
> >> > > > > > > >>> underneath.
> >> > > > > > > >>>
> >> > > > > > > >>> 2) Or we can have separate connection flag which will
> move
> >> > all
> >> > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer.
> >> > > > > > > >>>
> >> > > > > > > >>> I prefer the first approach
> >> > > > > > > >>>
> >> > > > > > > >>> Also we need to keep in mind that data streamer has poor
> >> > > > > performance
> >> > > > > > > when
> >> > > > > > > >>> adding single key-value pairs due to high overhead on
> >> > > concurrency
> >> > > > > and
> >> > > > > > > other
> >> > > > > > > >>> bookkeeping. Instead, it is better to pre-batch
> key-value
> >> > pairs
> >> > > > > > before
> >> > > > > > > >>> giving them to streamer.
> >> > > > > > > >>>
> >> > > > > > > >>> Vladimir.
> >> > > > > > > >>>
> >> > > > > > > >>> [1]
> >> > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/
> >> > > > > > > PreparedStatement.html#
> >> > > > > > > >>> addBatch--
> >> > > > > > > >>>
> >> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
> >> > > > > > > >>> [hidden email]> wrote:
> >> > > > > > > >>>
> >> > > > > > > >>> > Hello Igniters,
> >> > > > > > > >>> >
> >> > > > > > > >>> > One of the major improvements to DML has to be support
> >> of
> >> > > batch
> >> > > > > > > >>> > statements. I'd like to discuss its implementation.
> The
> >> > > > suggested
> >> > > > > > > >>> > approach is to rewrite given query turning it from few
> >> > > INSERTs
> >> > > > > into
> >> > > > > > > >>> > single statement and processing arguments
> accordingly. I
> >> > > > suggest
> >> > > > > > this
> >> > > > > > > >>> > as long as the whole point of batching is to make as
> >> little
> >> > > > > > > >>> > interactions with cluster as possible and to make
> >> > operations
> >> > > as
> >> > > > > > > >>> > condensed as possible, and in case of Ignite it means
> >> that
> >> > we
> >> > > > > > should
> >> > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as
> long
> >> as
> >> > a
> >> > > > > query
> >> > > > > > > >>> > task holds single query and its arguments, this
> approach
> >> > will
> >> > > > not
> >> > > > > > > >>> > require any changes to be done to current design and
> >> won't
> >> > > > break
> >> > > > > > any
> >> > > > > > > >>> > backward compatibility - all dirty work on rewriting
> >> will
> >> > be
> >> > > > done
> >> > > > > > by
> >> > > > > > > >>> > JDBC driver.
> >> > > > > > > >>> > Without rewriting, we could introduce some new query
> >> task
> >> > for
> >> > > > > batch
> >> > > > > > > >>> > operations, but that would make impossible sending
> such
> >> > > > requests
> >> > > > > > from
> >> > > > > > > >>> > newer clients to older servers (say, servers of
> version
> >> > > 1.8.0,
> >> > > > > > which
> >> > > > > > > >>> > does not know about batching, let alone older
> versions).
> >> > > > > > > >>> > I'd like to hear comments and suggestions from the
> >> > community.
> >> > > > > > Thanks!
> >> > > > > > > >>> >
> >> > > > > > > >>> > - Alex
> >> > > > > > > >>> >
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Vladimir Ozerov
> >> > Senior Software Architect
> >> > GridGain Systems
> >> > www.gridgain.com
> >> > *+7 (960) 283 98 40*
> >> >
> >>
> >
>

al.psc

Re: Batch DML queries design discussion

OK folks, both data streamer support and batching support have been implemented.

Resulting design fully conforms to what Dima suggested initially -
these two concepts are separated.

Streamed statements are turned on by connection flag, stream auto
flush timeout can be tuned in the same way; these statements support
INSERT and MERGE w/o subquery as well as fast key bounded DELETE and
UPDATE; each prepared statement in streamed mode has its own streamer
object and their lifecycles are the same - on close, the statement
closes its streamer. Streaming mode is available only in "local" mode
of connection between JDBC driver and Ignite client (default mode when
JDBC driver creates Ignite client node by itself) - there would be no
sense in streaming if query args would have to travel over network.

Batched statements sre used via conventional JDBC API (setArgs...
addBatch... executeBatch...), they also support INSERT and MERGE w/o
subquery as well as fast key (and, optionally, value) bounded DELETE
and UPDATE. These work in the similar manner to non batched statements
and likewise rely on traditional putAll/invokeAll routines.
Essentially, batching is just the way to pass a bigger map to
cache.putAll without writing single very long query. This works in
local as well as "remote" Ignite JDBC connectivity mode.

More info (details are in the comments):

Batching - https://issues.apache.org/jira/browse/IGNITE-4269
Streaming - https://issues.apache.org/jira/browse/IGNITE-4169

Regards,
Alex

2016-12-10 23:39 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:

> Alex,
>
> It seams to me that replace semantic can be implemented with
> StreamReceiver, no?
>
> D.
>
> On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko <
> [hidden email]> wrote:
>
>> Sorry, "no relation w/JDBC" in my previous message should read "no relation
>> w/JDBC batching".
>>
>> — Alex
>> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" <
>> [hidden email]> написал:
>>
>> > Dima,
>> >
>> > I would like to point out that data streamer support had already been
>> > implemented in the course of work on DML in 1.8 exactly as you are
>> > suggesting now (turned on via connection flag; allowed only MERGE — data
>> > streamer can't do putIfAbsent stuff, right?; absolutely no relation
>> > w/JDBC), *but* that patch had been reverted — by advice from Vlad which I
>> > believe had been agreed with you, so it didn't make it to 1.8 after all.
>> > Also, while it's possible to maintain INSERT vs MERGE semantic using
>> > streamer's allowOverwrite flag, I can't see how we could mimic UPDATE
>> here
>> > as long as the streamer does not put data to cache only in case when key
>> is
>> > present AND allowOverwrite is false, while UPDATE should not put anything
>> > when the key is *missing* — i.e., there's no way to emulate cache's
>> > *replace* operation semantic with streamer (update value only if key is
>> > present, otherwise do nothing).
>> >
>> > — Alex
>> > 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" <
>> > [hidden email]> написал:
>> >
>> >> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <[hidden email]>
>> >> wrote:
>> >>
>> >> > I already expressed my concern - this is counterintuitive approach.
>> >> Because
>> >> > without happens-before pure streaming model can be applied only on
>> >> > independent chunks of data. It mean that mentioned ETL use case is not
>> >> > feasible - ETL always depend on implicit or explicit links between
>> >> tables,
>> >> > and hence streaming is not applicable here. My question stands still -
>> >> what
>> >> > products except of possibly Ignite do this kind of JDBC streaming?
>> >> >
>> >>
>> >> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or
>> >> DataStreamer.addData().
>> >>
>> >> JDBC batching and putAll() are absolutely identical. If you see it as
>> >> counter-intuitive, I would ask for a concrete example.
>> >>
>> >> As far as links between data, Ignite does not have foreign-key
>> >> constraints,
>> >> so DataStreamer can insert data in any order (but again, not as part of
>> >> JDBC batch).
>> >>
>> >>
>> >> >
>> >> > Another problem is that connection-wide property doesn't fit well in
>> >> JDBC
>> >> > pooling model. Users will have use different connections for streaming
>> >> and
>> >> > non-streaming approaches.
>> >> >
>> >>
>> >> Using DataStreamer is not possible within JDBC batching paradigm,
>> period.
>> >> I
>> >> wish we could drop the high-level-feels-good discussions altogether,
>> >> because it seems like we are spinning wheels here.
>> >>
>> >> There is no way to use the streamer in JDBC context, unless we add a
>> >> connection flag. Again, if you disagree, I would prefer to see a
>> concrete
>> >> example explaining why.
>> >>
>> >>
>> >> > Please see how Oracle did that, this is precisely what I am talking
>> >> about:
>> >> > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf
>> >> .htm#i1056232
>> >> > Two batching modes - one with explicit flush, another one with
>> implicit
>> >> > flush, when Oracle decides on it's own when it is better to
>> communicate
>> >> the
>> >> > server. Batching mode can be declared globally or on per-statement
>> >> level.
>> >> > Simple and flexible.
>> >> >
>> >> >
>> >> > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <
>> >> [hidden email]>
>> >> > wrote:
>> >> >
>> >> > > Gents,
>> >> > >
>> >> > > As Sergi suggested, batching and streaming are very different
>> >> > semantically.
>> >> > >
>> >> > > To use standard JDBC batching, all we need to do is convert it to a
>> >> > > cache.putAll() method, as semantically a putAll(...) call is
>> identical
>> >> > to a
>> >> > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in
>> >> > between,
>> >> > > then we may have to break a batch into several chunks and execute
>> the
>> >> > > update in between. The DataStreamer should not be used here.
>> >> > >
>> >> > > I believe that for streaming we need to add a special JDBC/ODBC
>> >> > connection
>> >> > > flag. Whenever this flag is set to true, then we only should allow
>> >> INSERT
>> >> > > or single-UPDATE operations and use DataStreamer API internally. All
>> >> > > operations other than INSERT or single-UPDATE should be prohibited.
>> >> > >
>> >> > > I think this design is semantically clear. Any objections?
>> >> > >
>> >> > > D.
>> >> > >
>> >> > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <
>> >> [hidden email]
>> >> > >
>> >> > > wrote:
>> >> > >
>> >> > > > If we use Streamer, then we always have `happens-before` broken.
>> >> This
>> >> > is
>> >> > > > ok, because Streamer is for data loading, not for usual operating.
>> >> > > >
>> >> > > > We are not inventing any bicycles, just separating concerns:
>> >> Batching
>> >> > and
>> >> > > > Streaming.
>> >> > > >
>> >> > > > My point here is that they should not depend on each other at all:
>> >> > > Batching
>> >> > > > can work with or without Streaming, as well as Streaming can work
>> >> with
>> >> > or
>> >> > > > without Batching.
>> >> > > >
>> >> > > > Your proposal is a set of non-obvious rules for them to work. I
>> see
>> >> no
>> >> > > > reasons for these complications.
>> >> > > >
>> >> > > > Sergi
>> >> > > >
>> >> > > >
>> >> > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]
>> >:
>> >> > > >
>> >> > > > > Sergi,
>> >> > > > >
>> >> > > > > If user call single *execute() *operation, than most likely it
>> is
>> >> not
>> >> > > > > batching. We should not rely on strange case where user perform
>> >> > > batching
>> >> > > > > without using standard and well-adopted batching JDBC API. The
>> >> main
>> >> > > > problem
>> >> > > > > with streamer is that it is async and hence break happens-before
>> >> > > > guarantees
>> >> > > > > in a single thread: SELECT after INSERT might not return
>> inserted
>> >> > > value.
>> >> > > > >
>> >> > > > > Honestly, I do not really understand why we are trying to
>> >> re-invent a
>> >> > > > > bicycle here. There is standard API - let's just use it and make
>> >> > > flexible
>> >> > > > > enough to take advantage of IgniteDataStreamer if needed.
>> >> > > > >
>> >> > > > > Is there any use case which is not covered with this solution?
>> Or
>> >> let
>> >> > > me
>> >> > > > > ask from the opposite side - are there any well-known JDBC
>> drivers
>> >> > > which
>> >> > > > > perform batching/streaming from non-batched update statements?
>> >> > > > >
>> >> > > > > Vladimir.
>> >> > > > >
>> >> > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
>> >> > > [hidden email]
>> >> > > > >
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > > Vladimir,
>> >> > > > > >
>> >> > > > > > I see no reason to forbid Streamer usage from non-batched
>> >> statement
>> >> > > > > > execution.
>> >> > > > > > It is common that users already have their ETL tools and you
>> >> can't
>> >> > be
>> >> > > > > sure
>> >> > > > > > if they use batching or not.
>> >> > > > > >
>> >> > > > > > Alex,
>> >> > > > > >
>> >> > > > > > I guess we have to decide on Streaming first and then we will
>> >> > discuss
>> >> > > > > > Batching separately, ok? Because this decision may become
>> >> important
>> >> > > for
>> >> > > > > > batching implementation.
>> >> > > > > >
>> >> > > > > > Sergi
>> >> > > > > >
>> >> > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>:
>> >> > > > > >
>> >> > > > > > > Alex,
>> >> > > > > > >
>> >> > > > > > > In most cases JdbcQueryTask should be executed locally on
>> >> client
>> >> > > node
>> >> > > > > > > started by JDBC driver.
>> >> > > > > > >
>> >> > > > > > > JdbcQueryTask.QueryResult res =
>> >> > > > > > > loc ? qryTask.call() :
>> >> > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(
>> >> > qryTask);
>> >> > > > > > >
>> >> > > > > > > Is it valid behavior after introducing DML functionality?
>> >> > > > > > >
>> >> > > > > > > In cases when user wants to execute query on specific node
>> he
>> >> > > should
>> >> > > > > > > fully understand what he wants and what can go in wrong way.
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
>> >> > > > > > > <[hidden email]> wrote:
>> >> > > > > > > > Sergi,
>> >> > > > > > > >
>> >> > > > > > > > JDBC batching might work quite differently from driver to
>> >> > driver.
>> >> > > > > Say,
>> >> > > > > > > > MySQL happily rewrites queries as I had suggested in the
>> >> > > beginning
>> >> > > > of
>> >> > > > > > > > this thread (it's not the only strategy, but one of the
>> >> > possible
>> >> > > > > > > > options) - and, BTW, would like to hear at least an
>> opinion
>> >> > about
>> >> > > > it.
>> >> > > > > > > >
>> >> > > > > > > > On your first approach, section before streamer: you
>> suggest
>> >> > that
>> >> > > > we
>> >> > > > > > > > send single statement and multiple param sets as a single
>> >> query
>> >> > > > task,
>> >> > > > > > > > am I right? (Just to make sure that I got you properly.)
>> If
>> >> so,
>> >> > > do
>> >> > > > > you
>> >> > > > > > > > also mean that API (namely JdbcQueryTask) between server
>> and
>> >> > > client
>> >> > > > > > > > should also change? Or should new API means be added to
>> >> > > facilitate
>> >> > > > > > > > batching tasks?
>> >> > > > > > > >
>> >> > > > > > > > - Alex
>> >> > > > > > > >
>> >> > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
>> >> > > > [hidden email]
>> >> > > > > >:
>> >> > > > > > > >> Guys,
>> >> > > > > > > >>
>> >> > > > > > > >> I discussed this feature with Dmitriy and we came to
>> >> > conclusion
>> >> > > > that
>> >> > > > > > > >> batching in JDBC and Data Streaming in Ignite have
>> >> different
>> >> > > > > semantics
>> >> > > > > > > and
>> >> > > > > > > >> performance characteristics. Thus they are independent
>> >> > features
>> >> > > > > (they
>> >> > > > > > > may
>> >> > > > > > > >> work together, may separately, but this is another
>> story).
>> >> > > > > > > >>
>> >> > > > > > > >> Let me explain.
>> >> > > > > > > >>
>> >> > > > > > > >> This is how JDBC batching works:
>> >> > > > > > > >> - Add N sets of parameters to a prepared statement.
>> >> > > > > > > >> - Manually execute prepared statement.
>> >> > > > > > > >> - Repeat until all the data is loaded.
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >> This is how data streamer works:
>> >> > > > > > > >> - Keep adding data.
>> >> > > > > > > >> - Streamer will buffer and load buffered per-node batches
>> >> when
>> >> > > > they
>> >> > > > > > are
>> >> > > > > > > big
>> >> > > > > > > >> enough.
>> >> > > > > > > >> - Close streamer to make sure that everything is loaded.
>> >> > > > > > > >>
>> >> > > > > > > >> As you can see we have a difference in semantics of when
>> we
>> >> > send
>> >> > > > > data:
>> >> > > > > > > if
>> >> > > > > > > >> in our JDBC we will allow sending batches to nodes
>> without
>> >> > > calling
>> >> > > > > > > >> `execute` (and probably we will need to make `execute` to
>> >> > no-op
>> >> > > > > here),
>> >> > > > > > > then
>> >> > > > > > > >> we are violating semantics of JDBC, if we will disallow
>> >> this
>> >> > > > > behavior,
>> >> > > > > > > then
>> >> > > > > > > >> this batching will underperform.
>> >> > > > > > > >>
>> >> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and
>> >> JDBC
>> >> > > > > > > Streaming) as
>> >> > > > > > > >> separate features.
>> >> > > > > > > >>
>> >> > > > > > > >> As I already said they can work together: Batching will
>> >> batch
>> >> > > > > > parameters
>> >> > > > > > > >> and on `execute` they will go to the Streamer in one shot
>> >> and
>> >> > > > > Streamer
>> >> > > > > > > will
>> >> > > > > > > >> deal with the rest.
>> >> > > > > > > >>
>> >> > > > > > > >> Sergi
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
>> >> > > [hidden email]
>> >> > > > >:
>> >> > > > > > > >>
>> >> > > > > > > >>> Hi Alex,
>> >> > > > > > > >>>
>> >> > > > > > > >>> To my understanding there are two possible approaches to
>> >> > > batching
>> >> > > > > in
>> >> > > > > > > JDBC
>> >> > > > > > > >>> layer:
>> >> > > > > > > >>>
>> >> > > > > > > >>> 1) Rely on default batching API. Specifically
>> >> > > > > > > >>> *PreparedStatement.addBatch()* [1]
>> >> > > > > > > >>> and others. This is nice and clear API, users are used
>> to
>> >> it,
>> >> > > and
>> >> > > > > > it's
>> >> > > > > > > >>> adoption will minimize user code changes when migrating
>> >> from
>> >> > > > other
>> >> > > > > > JDBC
>> >> > > > > > > >>> sources. We simply copy updates locally and then execute
>> >> them
>> >> > > all
>> >> > > > > at
>> >> > > > > > > once
>> >> > > > > > > >>> with only a single network hop to servers.
>> >> > *IgniteDataStreamer*
>> >> > > > can
>> >> > > > > > be
>> >> > > > > > > used
>> >> > > > > > > >>> underneath.
>> >> > > > > > > >>>
>> >> > > > > > > >>> 2) Or we can have separate connection flag which will
>> move
>> >> > all
>> >> > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer.
>> >> > > > > > > >>>
>> >> > > > > > > >>> I prefer the first approach
>> >> > > > > > > >>>
>> >> > > > > > > >>> Also we need to keep in mind that data streamer has poor
>> >> > > > > performance
>> >> > > > > > > when
>> >> > > > > > > >>> adding single key-value pairs due to high overhead on
>> >> > > concurrency
>> >> > > > > and
>> >> > > > > > > other
>> >> > > > > > > >>> bookkeeping. Instead, it is better to pre-batch
>> key-value
>> >> > pairs
>> >> > > > > > before
>> >> > > > > > > >>> giving them to streamer.
>> >> > > > > > > >>>
>> >> > > > > > > >>> Vladimir.
>> >> > > > > > > >>>
>> >> > > > > > > >>> [1]
>> >> > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/
>> >> > > > > > > PreparedStatement.html#
>> >> > > > > > > >>> addBatch--
>> >> > > > > > > >>>
>> >> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
>> >> > > > > > > >>> [hidden email]> wrote:
>> >> > > > > > > >>>
>> >> > > > > > > >>> > Hello Igniters,
>> >> > > > > > > >>> >
>> >> > > > > > > >>> > One of the major improvements to DML has to be support
>> >> of
>> >> > > batch
>> >> > > > > > > >>> > statements. I'd like to discuss its implementation.
>> The
>> >> > > > suggested
>> >> > > > > > > >>> > approach is to rewrite given query turning it from few
>> >> > > INSERTs
>> >> > > > > into
>> >> > > > > > > >>> > single statement and processing arguments
>> accordingly. I
>> >> > > > suggest
>> >> > > > > > this
>> >> > > > > > > >>> > as long as the whole point of batching is to make as
>> >> little
>> >> > > > > > > >>> > interactions with cluster as possible and to make
>> >> > operations
>> >> > > as
>> >> > > > > > > >>> > condensed as possible, and in case of Ignite it means
>> >> that
>> >> > we
>> >> > > > > > should
>> >> > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as
>> long
>> >> as
>> >> > a
>> >> > > > > query
>> >> > > > > > > >>> > task holds single query and its arguments, this
>> approach
>> >> > will
>> >> > > > not
>> >> > > > > > > >>> > require any changes to be done to current design and
>> >> won't
>> >> > > > break
>> >> > > > > > any
>> >> > > > > > > >>> > backward compatibility - all dirty work on rewriting
>> >> will
>> >> > be
>> >> > > > done
>> >> > > > > > by
>> >> > > > > > > >>> > JDBC driver.
>> >> > > > > > > >>> > Without rewriting, we could introduce some new query
>> >> task
>> >> > for
>> >> > > > > batch
>> >> > > > > > > >>> > operations, but that would make impossible sending
>> such
>> >> > > > requests
>> >> > > > > > from
>> >> > > > > > > >>> > newer clients to older servers (say, servers of
>> version
>> >> > > 1.8.0,
>> >> > > > > > which
>> >> > > > > > > >>> > does not know about batching, let alone older
>> versions).
>> >> > > > > > > >>> > I'd like to hear comments and suggestions from the
>> >> > community.
>> >> > > > > > Thanks!
>> >> > > > > > > >>> >
>> >> > > > > > > >>> > - Alex
>> >> > > > > > > >>> >
>> >> > > > > > > >>>
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Vladimir Ozerov
>> >> > Senior Software Architect
>> >> > GridGain Systems
>> >> > www.gridgain.com
>> >> > *+7 (960) 283 98 40*
>> >> >
>> >>
>> >
>>

dmagda

Re: Batch DML queries design discussion

Alexander,

A couple of comments in regards to the streaming mode.

I would rename rename the existed property to “ignite.jdbc.streaming” and add additional ones that will help to manage and tune the streaming behavior:
ignite.jdbc.streaming.perNodeBufferSize
ignite.jdbc.streaming.perNodeParallelOperations
ignite.jdbc.streaming.autoFlushFrequency

Any other thoughts?

—
Denis

> On Dec 19, 2016, at 8:02 AM, Alexander Paschenko <[hidden email]> wrote:
>
> OK folks, both data streamer support and batching support have been implemented.
>
> Resulting design fully conforms to what Dima suggested initially -
> these two concepts are separated.
>
> Streamed statements are turned on by connection flag, stream auto
> flush timeout can be tuned in the same way; these statements support
> INSERT and MERGE w/o subquery as well as fast key bounded DELETE and
> UPDATE; each prepared statement in streamed mode has its own streamer
> object and their lifecycles are the same - on close, the statement
> closes its streamer. Streaming mode is available only in "local" mode
> of connection between JDBC driver and Ignite client (default mode when
> JDBC driver creates Ignite client node by itself) - there would be no
> sense in streaming if query args would have to travel over network.
>
> Batched statements sre used via conventional JDBC API (setArgs...
> addBatch... executeBatch...), they also support INSERT and MERGE w/o
> subquery as well as fast key (and, optionally, value) bounded DELETE
> and UPDATE. These work in the similar manner to non batched statements
> and likewise rely on traditional putAll/invokeAll routines.
> Essentially, batching is just the way to pass a bigger map to
> cache.putAll without writing single very long query. This works in
> local as well as "remote" Ignite JDBC connectivity mode.
>
> More info (details are in the comments):
>
> Batching - https://issues.apache.org/jira/browse/IGNITE-4269
> Streaming - https://issues.apache.org/jira/browse/IGNITE-4169
>
> Regards,
> Alex
>
> 2016-12-10 23:39 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:
>> Alex,
>>
>> It seams to me that replace semantic can be implemented with
>> StreamReceiver, no?
>>
>> D.
>>
>> On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko <
>> [hidden email]> wrote:
>>
>>> Sorry, "no relation w/JDBC" in my previous message should read "no relation
>>> w/JDBC batching".
>>>
>>> — Alex
>>> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" <
>>> [hidden email]> написал:
>>>
>>>> Dima,
>>>>
>>>> I would like to point out that data streamer support had already been
>>>> implemented in the course of work on DML in 1.8 exactly as you are
>>>> suggesting now (turned on via connection flag; allowed only MERGE — data
>>>> streamer can't do putIfAbsent stuff, right?; absolutely no relation
>>>> w/JDBC), *but* that patch had been reverted — by advice from Vlad which I
>>>> believe had been agreed with you, so it didn't make it to 1.8 after all.
>>>> Also, while it's possible to maintain INSERT vs MERGE semantic using
>>>> streamer's allowOverwrite flag, I can't see how we could mimic UPDATE
>>> here
>>>> as long as the streamer does not put data to cache only in case when key
>>> is
>>>> present AND allowOverwrite is false, while UPDATE should not put anything
>>>> when the key is *missing* — i.e., there's no way to emulate cache's
>>>> *replace* operation semantic with streamer (update value only if key is
>>>> present, otherwise do nothing).
>>>>
>>>> — Alex
>>>> 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" <
>>>> [hidden email]> написал:
>>>>
>>>>> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> I already expressed my concern - this is counterintuitive approach.
>>>>> Because
>>>>>> without happens-before pure streaming model can be applied only on
>>>>>> independent chunks of data. It mean that mentioned ETL use case is not
>>>>>> feasible - ETL always depend on implicit or explicit links between
>>>>> tables,
>>>>>> and hence streaming is not applicable here. My question stands still -
>>>>> what
>>>>>> products except of possibly Ignite do this kind of JDBC streaming?
>>>>>>
>>>>>
>>>>> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or
>>>>> DataStreamer.addData().
>>>>>
>>>>> JDBC batching and putAll() are absolutely identical. If you see it as
>>>>> counter-intuitive, I would ask for a concrete example.
>>>>>
>>>>> As far as links between data, Ignite does not have foreign-key
>>>>> constraints,
>>>>> so DataStreamer can insert data in any order (but again, not as part of
>>>>> JDBC batch).
>>>>>
>>>>>
>>>>>>
>>>>>> Another problem is that connection-wide property doesn't fit well in
>>>>> JDBC
>>>>>> pooling model. Users will have use different connections for streaming
>>>>> and
>>>>>> non-streaming approaches.
>>>>>>
>>>>>
>>>>> Using DataStreamer is not possible within JDBC batching paradigm,
>>> period.
>>>>> I
>>>>> wish we could drop the high-level-feels-good discussions altogether,
>>>>> because it seems like we are spinning wheels here.
>>>>>
>>>>> There is no way to use the streamer in JDBC context, unless we add a
>>>>> connection flag. Again, if you disagree, I would prefer to see a
>>> concrete
>>>>> example explaining why.
>>>>>
>>>>>
>>>>>> Please see how Oracle did that, this is precisely what I am talking
>>>>> about:
>>>>>> https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf
>>>>> .htm#i1056232
>>>>>> Two batching modes - one with explicit flush, another one with
>>> implicit
>>>>>> flush, when Oracle decides on it's own when it is better to
>>> communicate
>>>>> the
>>>>>> server. Batching mode can be declared globally or on per-statement
>>>>> level.
>>>>>> Simple and flexible.
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <
>>>>> [hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> Gents,
>>>>>>>
>>>>>>> As Sergi suggested, batching and streaming are very different
>>>>>> semantically.
>>>>>>>
>>>>>>> To use standard JDBC batching, all we need to do is convert it to a
>>>>>>> cache.putAll() method, as semantically a putAll(...) call is
>>> identical
>>>>>> to a
>>>>>>> JDBC batch. Of course, if we see and UPDATE with a WHERE clause in
>>>>>> between,
>>>>>>> then we may have to break a batch into several chunks and execute
>>> the
>>>>>>> update in between. The DataStreamer should not be used here.
>>>>>>>
>>>>>>> I believe that for streaming we need to add a special JDBC/ODBC
>>>>>> connection
>>>>>>> flag. Whenever this flag is set to true, then we only should allow
>>>>> INSERT
>>>>>>> or single-UPDATE operations and use DataStreamer API internally. All
>>>>>>> operations other than INSERT or single-UPDATE should be prohibited.
>>>>>>>
>>>>>>> I think this design is semantically clear. Any objections?
>>>>>>>
>>>>>>> D.
>>>>>>>
>>>>>>> On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <
>>>>> [hidden email]
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If we use Streamer, then we always have `happens-before` broken.
>>>>> This
>>>>>> is
>>>>>>>> ok, because Streamer is for data loading, not for usual operating.
>>>>>>>>
>>>>>>>> We are not inventing any bicycles, just separating concerns:
>>>>> Batching
>>>>>> and
>>>>>>>> Streaming.
>>>>>>>>
>>>>>>>> My point here is that they should not depend on each other at all:
>>>>>>> Batching
>>>>>>>> can work with or without Streaming, as well as Streaming can work
>>>>> with
>>>>>> or
>>>>>>>> without Batching.
>>>>>>>>
>>>>>>>> Your proposal is a set of non-obvious rules for them to work. I
>>> see
>>>>> no
>>>>>>>> reasons for these complications.
>>>>>>>>
>>>>>>>> Sergi
>>>>>>>>
>>>>>>>>
>>>>>>>> 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]
>>>> :
>>>>>>>>
>>>>>>>>> Sergi,
>>>>>>>>>
>>>>>>>>> If user call single *execute() *operation, than most likely it
>>> is
>>>>> not
>>>>>>>>> batching. We should not rely on strange case where user perform
>>>>>>> batching
>>>>>>>>> without using standard and well-adopted batching JDBC API. The
>>>>> main
>>>>>>>> problem
>>>>>>>>> with streamer is that it is async and hence break happens-before
>>>>>>>> guarantees
>>>>>>>>> in a single thread: SELECT after INSERT might not return
>>> inserted
>>>>>>> value.
>>>>>>>>>
>>>>>>>>> Honestly, I do not really understand why we are trying to
>>>>> re-invent a
>>>>>>>>> bicycle here. There is standard API - let's just use it and make
>>>>>>> flexible
>>>>>>>>> enough to take advantage of IgniteDataStreamer if needed.
>>>>>>>>>
>>>>>>>>> Is there any use case which is not covered with this solution?
>>> Or
>>>>> let
>>>>>>> me
>>>>>>>>> ask from the opposite side - are there any well-known JDBC
>>> drivers
>>>>>>> which
>>>>>>>>> perform batching/streaming from non-batched update statements?
>>>>>>>>>
>>>>>>>>> Vladimir.
>>>>>>>>>
>>>>>>>>> On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
>>>>>>> [hidden email]
>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Vladimir,
>>>>>>>>>>
>>>>>>>>>> I see no reason to forbid Streamer usage from non-batched
>>>>> statement
>>>>>>>>>> execution.
>>>>>>>>>> It is common that users already have their ETL tools and you
>>>>> can't
>>>>>> be
>>>>>>>>> sure
>>>>>>>>>> if they use batching or not.
>>>>>>>>>>
>>>>>>>>>> Alex,
>>>>>>>>>>
>>>>>>>>>> I guess we have to decide on Streaming first and then we will
>>>>>> discuss
>>>>>>>>>> Batching separately, ok? Because this decision may become
>>>>> important
>>>>>>> for
>>>>>>>>>> batching implementation.
>>>>>>>>>>
>>>>>>>>>> Sergi
>>>>>>>>>>
>>>>>>>>>> 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>:
>>>>>>>>>>
>>>>>>>>>>> Alex,
>>>>>>>>>>>
>>>>>>>>>>> In most cases JdbcQueryTask should be executed locally on
>>>>> client
>>>>>>> node
>>>>>>>>>>> started by JDBC driver.
>>>>>>>>>>>
>>>>>>>>>>> JdbcQueryTask.QueryResult res =
>>>>>>>>>>> loc ? qryTask.call() :
>>>>>>>>>>> ignite.compute(ignite.cluster().forNodeId(nodeId)).call(
>>>>>> qryTask);
>>>>>>>>>>>
>>>>>>>>>>> Is it valid behavior after introducing DML functionality?
>>>>>>>>>>>
>>>>>>>>>>> In cases when user wants to execute query on specific node
>>> he
>>>>>>> should
>>>>>>>>>>> fully understand what he wants and what can go in wrong way.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
>>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>>> Sergi,
>>>>>>>>>>>>
>>>>>>>>>>>> JDBC batching might work quite differently from driver to
>>>>>> driver.
>>>>>>>>> Say,
>>>>>>>>>>>> MySQL happily rewrites queries as I had suggested in the
>>>>>>> beginning
>>>>>>>> of
>>>>>>>>>>>> this thread (it's not the only strategy, but one of the
>>>>>> possible
>>>>>>>>>>>> options) - and, BTW, would like to hear at least an
>>> opinion
>>>>>> about
>>>>>>>> it.
>>>>>>>>>>>>
>>>>>>>>>>>> On your first approach, section before streamer: you
>>> suggest
>>>>>> that
>>>>>>>> we
>>>>>>>>>>>> send single statement and multiple param sets as a single
>>>>> query
>>>>>>>> task,
>>>>>>>>>>>> am I right? (Just to make sure that I got you properly.)
>>> If
>>>>> so,
>>>>>>> do
>>>>>>>>> you
>>>>>>>>>>>> also mean that API (namely JdbcQueryTask) between server
>>> and
>>>>>>> client
>>>>>>>>>>>> should also change? Or should new API means be added to
>>>>>>> facilitate
>>>>>>>>>>>> batching tasks?
>>>>>>>>>>>>
>>>>>>>>>>>> - Alex
>>>>>>>>>>>>
>>>>>>>>>>>> 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
>>>>>>>> [hidden email]
>>>>>>>>>> :
>>>>>>>>>>>>> Guys,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I discussed this feature with Dmitriy and we came to
>>>>>> conclusion
>>>>>>>> that
>>>>>>>>>>>>> batching in JDBC and Data Streaming in Ignite have
>>>>> different
>>>>>>>>> semantics
>>>>>>>>>>> and
>>>>>>>>>>>>> performance characteristics. Thus they are independent
>>>>>> features
>>>>>>>>> (they
>>>>>>>>>>> may
>>>>>>>>>>>>> work together, may separately, but this is another
>>> story).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me explain.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is how JDBC batching works:
>>>>>>>>>>>>> - Add N sets of parameters to a prepared statement.
>>>>>>>>>>>>> - Manually execute prepared statement.
>>>>>>>>>>>>> - Repeat until all the data is loaded.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is how data streamer works:
>>>>>>>>>>>>> - Keep adding data.
>>>>>>>>>>>>> - Streamer will buffer and load buffered per-node batches
>>>>> when
>>>>>>>> they
>>>>>>>>>> are
>>>>>>>>>>> big
>>>>>>>>>>>>> enough.
>>>>>>>>>>>>> - Close streamer to make sure that everything is loaded.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you can see we have a difference in semantics of when
>>> we
>>>>>> send
>>>>>>>>> data:
>>>>>>>>>>> if
>>>>>>>>>>>>> in our JDBC we will allow sending batches to nodes
>>> without
>>>>>>> calling
>>>>>>>>>>>>> `execute` (and probably we will need to make `execute` to
>>>>>> no-op
>>>>>>>>> here),
>>>>>>>>>>> then
>>>>>>>>>>>>> we are violating semantics of JDBC, if we will disallow
>>>>> this
>>>>>>>>> behavior,
>>>>>>>>>>> then
>>>>>>>>>>>>> this batching will underperform.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thus I suggest keeping these features (JDBC Batching and
>>>>> JDBC
>>>>>>>>>>> Streaming) as
>>>>>>>>>>>>> separate features.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As I already said they can work together: Batching will
>>>>> batch
>>>>>>>>>> parameters
>>>>>>>>>>>>> and on `execute` they will go to the Streamer in one shot
>>>>> and
>>>>>>>>> Streamer
>>>>>>>>>>> will
>>>>>>>>>>>>> deal with the rest.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sergi
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
>>>>>>> [hidden email]
>>>>>>>>> :
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To my understanding there are two possible approaches to
>>>>>>> batching
>>>>>>>>> in
>>>>>>>>>>> JDBC
>>>>>>>>>>>>>> layer:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) Rely on default batching API. Specifically
>>>>>>>>>>>>>> *PreparedStatement.addBatch()* [1]
>>>>>>>>>>>>>> and others. This is nice and clear API, users are used
>>> to
>>>>> it,
>>>>>>> and
>>>>>>>>>> it's
>>>>>>>>>>>>>> adoption will minimize user code changes when migrating
>>>>> from
>>>>>>>> other
>>>>>>>>>> JDBC
>>>>>>>>>>>>>> sources. We simply copy updates locally and then execute
>>>>> them
>>>>>>> all
>>>>>>>>> at
>>>>>>>>>>> once
>>>>>>>>>>>>>> with only a single network hop to servers.
>>>>>> *IgniteDataStreamer*
>>>>>>>> can
>>>>>>>>>> be
>>>>>>>>>>> used
>>>>>>>>>>>>>> underneath.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2) Or we can have separate connection flag which will
>>> move
>>>>>> all
>>>>>>>>>>>>>> INSERT/UPDATE/DELETE statements through streamer.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I prefer the first approach
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also we need to keep in mind that data streamer has poor
>>>>>>>>> performance
>>>>>>>>>>> when
>>>>>>>>>>>>>> adding single key-value pairs due to high overhead on
>>>>>>> concurrency
>>>>>>>>> and
>>>>>>>>>>> other
>>>>>>>>>>>>>> bookkeeping. Instead, it is better to pre-batch
>>> key-value
>>>>>> pairs
>>>>>>>>>> before
>>>>>>>>>>>>>> giving them to streamer.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Vladimir.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://docs.oracle.com/javase/8/docs/api/java/sql/
>>>>>>>>>>> PreparedStatement.html#
>>>>>>>>>>>>>> addBatch--
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
>>>>>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello Igniters,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One of the major improvements to DML has to be support
>>>>> of
>>>>>>> batch
>>>>>>>>>>>>>>> statements. I'd like to discuss its implementation.
>>> The
>>>>>>>> suggested
>>>>>>>>>>>>>>> approach is to rewrite given query turning it from few
>>>>>>> INSERTs
>>>>>>>>> into
>>>>>>>>>>>>>>> single statement and processing arguments
>>> accordingly. I
>>>>>>>> suggest
>>>>>>>>>> this
>>>>>>>>>>>>>>> as long as the whole point of batching is to make as
>>>>> little
>>>>>>>>>>>>>>> interactions with cluster as possible and to make
>>>>>> operations
>>>>>>> as
>>>>>>>>>>>>>>> condensed as possible, and in case of Ignite it means
>>>>> that
>>>>>> we
>>>>>>>>>> should
>>>>>>>>>>>>>>> send as little JdbcQueryTasks as possible. And, as
>>> long
>>>>> as
>>>>>> a
>>>>>>>>> query
>>>>>>>>>>>>>>> task holds single query and its arguments, this
>>> approach
>>>>>> will
>>>>>>>> not
>>>>>>>>>>>>>>> require any changes to be done to current design and
>>>>> won't
>>>>>>>> break
>>>>>>>>>> any
>>>>>>>>>>>>>>> backward compatibility - all dirty work on rewriting
>>>>> will
>>>>>> be
>>>>>>>> done
>>>>>>>>>> by
>>>>>>>>>>>>>>> JDBC driver.
>>>>>>>>>>>>>>> Without rewriting, we could introduce some new query
>>>>> task
>>>>>> for
>>>>>>>>> batch
>>>>>>>>>>>>>>> operations, but that would make impossible sending
>>> such
>>>>>>>> requests
>>>>>>>>>> from
>>>>>>>>>>>>>>> newer clients to older servers (say, servers of
>>> version
>>>>>>> 1.8.0,
>>>>>>>>>> which
>>>>>>>>>>>>>>> does not know about batching, let alone older
>>> versions).
>>>>>>>>>>>>>>> I'd like to hear comments and suggestions from the
>>>>>> community.
>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Alex
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Vladimir Ozerov
>>>>>> Senior Software Architect
>>>>>> GridGain Systems
>>>>>> www.gridgain.com
>>>>>> *+7 (960) 283 98 40*
>>>>>>
>>>>>
>>>>
>>>

al.psc

Re: Batch DML queries design discussion

Auto flush freq is already there, I just forgot to mention it in the
comments. Will add the rest today.

— Alex
19 дек. 2016 г. 10:29 PM пользователь "Denis Magda" <[hidden email]>
написал:

> Alexander,
>
> A couple of comments in regards to the streaming mode.
>
> I would rename rename the existed property to “ignite.jdbc.streaming” and
> add additional ones that will help to manage and tune the streaming
> behavior:
> ignite.jdbc.streaming.perNodeBufferSize
> ignite.jdbc.streaming.perNodeParallelOperations
> ignite.jdbc.streaming.autoFlushFrequency
>
>
> Any other thoughts?
>
> —
> Denis
>
> > On Dec 19, 2016, at 8:02 AM, Alexander Paschenko <
> [hidden email]> wrote:
> >
> > OK folks, both data streamer support and batching support have been
> implemented.
> >
> > Resulting design fully conforms to what Dima suggested initially -
> > these two concepts are separated.
> >
> > Streamed statements are turned on by connection flag, stream auto
> > flush timeout can be tuned in the same way; these statements support
> > INSERT and MERGE w/o subquery as well as fast key bounded DELETE and
> > UPDATE; each prepared statement in streamed mode has its own streamer
> > object and their lifecycles are the same - on close, the statement
> > closes its streamer. Streaming mode is available only in "local" mode
> > of connection between JDBC driver and Ignite client (default mode when
> > JDBC driver creates Ignite client node by itself) - there would be no
> > sense in streaming if query args would have to travel over network.
> >
> > Batched statements sre used via conventional JDBC API (setArgs...
> > addBatch... executeBatch...), they also support INSERT and MERGE w/o
> > subquery as well as fast key (and, optionally, value) bounded DELETE
> > and UPDATE. These work in the similar manner to non batched statements
> > and likewise rely on traditional putAll/invokeAll routines.
> > Essentially, batching is just the way to pass a bigger map to
> > cache.putAll without writing single very long query. This works in
> > local as well as "remote" Ignite JDBC connectivity mode.
> >
> > More info (details are in the comments):
> >
> > Batching - https://issues.apache.org/jira/browse/IGNITE-4269
> > Streaming - https://issues.apache.org/jira/browse/IGNITE-4169
> >
> > Regards,
> > Alex
> >
> > 2016-12-10 23:39 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:
> >> Alex,
> >>
> >> It seams to me that replace semantic can be implemented with
> >> StreamReceiver, no?
> >>
> >> D.
> >>
> >> On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko <
> >> [hidden email]> wrote:
> >>
> >>> Sorry, "no relation w/JDBC" in my previous message should read "no
> relation
> >>> w/JDBC batching".
> >>>
> >>> — Alex
> >>> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" <
> >>> [hidden email]> написал:
> >>>
> >>>> Dima,
> >>>>
> >>>> I would like to point out that data streamer support had already been
> >>>> implemented in the course of work on DML in 1.8 exactly as you are
> >>>> suggesting now (turned on via connection flag; allowed only MERGE —
> data
> >>>> streamer can't do putIfAbsent stuff, right?; absolutely no relation
> >>>> w/JDBC), *but* that patch had been reverted — by advice from Vlad
> which I
> >>>> believe had been agreed with you, so it didn't make it to 1.8 after
> all.
> >>>> Also, while it's possible to maintain INSERT vs MERGE semantic using
> >>>> streamer's allowOverwrite flag, I can't see how we could mimic UPDATE
> >>> here
> >>>> as long as the streamer does not put data to cache only in case when
> key
> >>> is
> >>>> present AND allowOverwrite is false, while UPDATE should not put
> anything
> >>>> when the key is *missing* — i.e., there's no way to emulate cache's
> >>>> *replace* operation semantic with streamer (update value only if key
> is
> >>>> present, otherwise do nothing).
> >>>>
> >>>> — Alex
> >>>> 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" <
> >>>> [hidden email]> написал:
> >>>>
> >>>>> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <
> [hidden email]>
> >>>>> wrote:
> >>>>>
> >>>>>> I already expressed my concern - this is counterintuitive approach.
> >>>>> Because
> >>>>>> without happens-before pure streaming model can be applied only on
> >>>>>> independent chunks of data. It mean that mentioned ETL use case is
> not
> >>>>>> feasible - ETL always depend on implicit or explicit links between
> >>>>> tables,
> >>>>>> and hence streaming is not applicable here. My question stands
> still -
> >>>>> what
> >>>>>> products except of possibly Ignite do this kind of JDBC streaming?
> >>>>>>
> >>>>>
> >>>>> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or
> >>>>> DataStreamer.addData().
> >>>>>
> >>>>> JDBC batching and putAll() are absolutely identical. If you see it as
> >>>>> counter-intuitive, I would ask for a concrete example.
> >>>>>
> >>>>> As far as links between data, Ignite does not have foreign-key
> >>>>> constraints,
> >>>>> so DataStreamer can insert data in any order (but again, not as
> part of
> >>>>> JDBC batch).
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Another problem is that connection-wide property doesn't fit well in
> >>>>> JDBC
> >>>>>> pooling model. Users will have use different connections for
> streaming
> >>>>> and
> >>>>>> non-streaming approaches.
> >>>>>>
> >>>>>
> >>>>> Using DataStreamer is not possible within JDBC batching paradigm,
> >>> period.
> >>>>> I
> >>>>> wish we could drop the high-level-feels-good discussions altogether,
> >>>>> because it seems like we are spinning wheels here.
> >>>>>
> >>>>> There is no way to use the streamer in JDBC context, unless we add a
> >>>>> connection flag. Again, if you disagree, I would prefer to see a
> >>> concrete
> >>>>> example explaining why.
> >>>>>
> >>>>>
> >>>>>> Please see how Oracle did that, this is precisely what I am talking
> >>>>> about:
> >>>>>> https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf
> >>>>> .htm#i1056232
> >>>>>> Two batching modes - one with explicit flush, another one with
> >>> implicit
> >>>>>> flush, when Oracle decides on it's own when it is better to
> >>> communicate
> >>>>> the
> >>>>>> server. Batching mode can be declared globally or on per-statement
> >>>>> level.
> >>>>>> Simple and flexible.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <
> >>>>> [hidden email]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Gents,
> >>>>>>>
> >>>>>>> As Sergi suggested, batching and streaming are very different
> >>>>>> semantically.
> >>>>>>>
> >>>>>>> To use standard JDBC batching, all we need to do is convert it to a
> >>>>>>> cache.putAll() method, as semantically a putAll(...) call is
> >>> identical
> >>>>>> to a
> >>>>>>> JDBC batch. Of course, if we see and UPDATE with a WHERE clause in
> >>>>>> between,
> >>>>>>> then we may have to break a batch into several chunks and execute
> >>> the
> >>>>>>> update in between. The DataStreamer should not be used here.
> >>>>>>>
> >>>>>>> I believe that for streaming we need to add a special JDBC/ODBC
> >>>>>> connection
> >>>>>>> flag. Whenever this flag is set to true, then we only should allow
> >>>>> INSERT
> >>>>>>> or single-UPDATE operations and use DataStreamer API internally.
> All
> >>>>>>> operations other than INSERT or single-UPDATE should be prohibited.
> >>>>>>>
> >>>>>>> I think this design is semantically clear. Any objections?
> >>>>>>>
> >>>>>>> D.
> >>>>>>>
> >>>>>>> On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <
> >>>>> [hidden email]
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> If we use Streamer, then we always have `happens-before` broken.
> >>>>> This
> >>>>>> is
> >>>>>>>> ok, because Streamer is for data loading, not for usual operating.
> >>>>>>>>
> >>>>>>>> We are not inventing any bicycles, just separating concerns:
> >>>>> Batching
> >>>>>> and
> >>>>>>>> Streaming.
> >>>>>>>>
> >>>>>>>> My point here is that they should not depend on each other at all:
> >>>>>>> Batching
> >>>>>>>> can work with or without Streaming, as well as Streaming can work
> >>>>> with
> >>>>>> or
> >>>>>>>> without Batching.
> >>>>>>>>
> >>>>>>>> Your proposal is a set of non-obvious rules for them to work. I
> >>> see
> >>>>> no
> >>>>>>>> reasons for these complications.
> >>>>>>>>
> >>>>>>>> Sergi
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[hidden email]
> >>>> :
> >>>>>>>>
> >>>>>>>>> Sergi,
> >>>>>>>>>
> >>>>>>>>> If user call single *execute() *operation, than most likely it
> >>> is
> >>>>> not
> >>>>>>>>> batching. We should not rely on strange case where user perform
> >>>>>>> batching
> >>>>>>>>> without using standard and well-adopted batching JDBC API. The
> >>>>> main
> >>>>>>>> problem
> >>>>>>>>> with streamer is that it is async and hence break happens-before
> >>>>>>>> guarantees
> >>>>>>>>> in a single thread: SELECT after INSERT might not return
> >>> inserted
> >>>>>>> value.
> >>>>>>>>>
> >>>>>>>>> Honestly, I do not really understand why we are trying to
> >>>>> re-invent a
> >>>>>>>>> bicycle here. There is standard API - let's just use it and make
> >>>>>>> flexible
> >>>>>>>>> enough to take advantage of IgniteDataStreamer if needed.
> >>>>>>>>>
> >>>>>>>>> Is there any use case which is not covered with this solution?
> >>> Or
> >>>>> let
> >>>>>>> me
> >>>>>>>>> ask from the opposite side - are there any well-known JDBC
> >>> drivers
> >>>>>>> which
> >>>>>>>>> perform batching/streaming from non-batched update statements?
> >>>>>>>>>
> >>>>>>>>> Vladimir.
> >>>>>>>>>
> >>>>>>>>> On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
> >>>>>>> [hidden email]
> >>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Vladimir,
> >>>>>>>>>>
> >>>>>>>>>> I see no reason to forbid Streamer usage from non-batched
> >>>>> statement
> >>>>>>>>>> execution.
> >>>>>>>>>> It is common that users already have their ETL tools and you
> >>>>> can't
> >>>>>> be
> >>>>>>>>> sure
> >>>>>>>>>> if they use batching or not.
> >>>>>>>>>>
> >>>>>>>>>> Alex,
> >>>>>>>>>>
> >>>>>>>>>> I guess we have to decide on Streaming first and then we will
> >>>>>> discuss
> >>>>>>>>>> Batching separately, ok? Because this decision may become
> >>>>> important
> >>>>>>> for
> >>>>>>>>>> batching implementation.
> >>>>>>>>>>
> >>>>>>>>>> Sergi
> >>>>>>>>>>
> >>>>>>>>>> 2016-12-08 15:31 GMT+03:00 Andrey Gura <[hidden email]>:
> >>>>>>>>>>
> >>>>>>>>>>> Alex,
> >>>>>>>>>>>
> >>>>>>>>>>> In most cases JdbcQueryTask should be executed locally on
> >>>>> client
> >>>>>>> node
> >>>>>>>>>>> started by JDBC driver.
> >>>>>>>>>>>
> >>>>>>>>>>> JdbcQueryTask.QueryResult res =
> >>>>>>>>>>> loc ? qryTask.call() :
> >>>>>>>>>>> ignite.compute(ignite.cluster().forNodeId(nodeId)).call(
> >>>>>> qryTask);
> >>>>>>>>>>>
> >>>>>>>>>>> Is it valid behavior after introducing DML functionality?
> >>>>>>>>>>>
> >>>>>>>>>>> In cases when user wants to execute query on specific node
> >>> he
> >>>>>>> should
> >>>>>>>>>>> fully understand what he wants and what can go in wrong way.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
> >>>>>>>>>>> <[hidden email]> wrote:
> >>>>>>>>>>>> Sergi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> JDBC batching might work quite differently from driver to
> >>>>>> driver.
> >>>>>>>>> Say,
> >>>>>>>>>>>> MySQL happily rewrites queries as I had suggested in the
> >>>>>>> beginning
> >>>>>>>> of
> >>>>>>>>>>>> this thread (it's not the only strategy, but one of the
> >>>>>> possible
> >>>>>>>>>>>> options) - and, BTW, would like to hear at least an
> >>> opinion
> >>>>>> about
> >>>>>>>> it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On your first approach, section before streamer: you
> >>> suggest
> >>>>>> that
> >>>>>>>> we
> >>>>>>>>>>>> send single statement and multiple param sets as a single
> >>>>> query
> >>>>>>>> task,
> >>>>>>>>>>>> am I right? (Just to make sure that I got you properly.)
> >>> If
> >>>>> so,
> >>>>>>> do
> >>>>>>>>> you
> >>>>>>>>>>>> also mean that API (namely JdbcQueryTask) between server
> >>> and
> >>>>>>> client
> >>>>>>>>>>>> should also change? Or should new API means be added to
> >>>>>>> facilitate
> >>>>>>>>>>>> batching tasks?
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Alex
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
> >>>>>>>> [hidden email]
> >>>>>>>>>> :
> >>>>>>>>>>>>> Guys,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I discussed this feature with Dmitriy and we came to
> >>>>>> conclusion
> >>>>>>>> that
> >>>>>>>>>>>>> batching in JDBC and Data Streaming in Ignite have
> >>>>> different
> >>>>>>>>> semantics
> >>>>>>>>>>> and
> >>>>>>>>>>>>> performance characteristics. Thus they are independent
> >>>>>> features
> >>>>>>>>> (they
> >>>>>>>>>>> may
> >>>>>>>>>>>>> work together, may separately, but this is another
> >>> story).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Let me explain.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is how JDBC batching works:
> >>>>>>>>>>>>> - Add N sets of parameters to a prepared statement.
> >>>>>>>>>>>>> - Manually execute prepared statement.
> >>>>>>>>>>>>> - Repeat until all the data is loaded.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is how data streamer works:
> >>>>>>>>>>>>> - Keep adding data.
> >>>>>>>>>>>>> - Streamer will buffer and load buffered per-node batches
> >>>>> when
> >>>>>>>> they
> >>>>>>>>>> are
> >>>>>>>>>>> big
> >>>>>>>>>>>>> enough.
> >>>>>>>>>>>>> - Close streamer to make sure that everything is loaded.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As you can see we have a difference in semantics of when
> >>> we
> >>>>>> send
> >>>>>>>>> data:
> >>>>>>>>>>> if
> >>>>>>>>>>>>> in our JDBC we will allow sending batches to nodes
> >>> without
> >>>>>>> calling
> >>>>>>>>>>>>> `execute` (and probably we will need to make `execute` to
> >>>>>> no-op
> >>>>>>>>> here),
> >>>>>>>>>>> then
> >>>>>>>>>>>>> we are violating semantics of JDBC, if we will disallow
> >>>>> this
> >>>>>>>>> behavior,
> >>>>>>>>>>> then
> >>>>>>>>>>>>> this batching will underperform.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thus I suggest keeping these features (JDBC Batching and
> >>>>> JDBC
> >>>>>>>>>>> Streaming) as
> >>>>>>>>>>>>> separate features.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As I already said they can work together: Batching will
> >>>>> batch
> >>>>>>>>>> parameters
> >>>>>>>>>>>>> and on `execute` they will go to the Streamer in one shot
> >>>>> and
> >>>>>>>>> Streamer
> >>>>>>>>>>> will
> >>>>>>>>>>>>> deal with the rest.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sergi
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
> >>>>>>> [hidden email]
> >>>>>>>>> :
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To my understanding there are two possible approaches to
> >>>>>>> batching
> >>>>>>>>> in
> >>>>>>>>>>> JDBC
> >>>>>>>>>>>>>> layer:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1) Rely on default batching API. Specifically
> >>>>>>>>>>>>>> *PreparedStatement.addBatch()* [1]
> >>>>>>>>>>>>>> and others. This is nice and clear API, users are used
> >>> to
> >>>>> it,
> >>>>>>> and
> >>>>>>>>>> it's
> >>>>>>>>>>>>>> adoption will minimize user code changes when migrating
> >>>>> from
> >>>>>>>> other
> >>>>>>>>>> JDBC
> >>>>>>>>>>>>>> sources. We simply copy updates locally and then execute
> >>>>> them
> >>>>>>> all
> >>>>>>>>> at
> >>>>>>>>>>> once
> >>>>>>>>>>>>>> with only a single network hop to servers.
> >>>>>> *IgniteDataStreamer*
> >>>>>>>> can
> >>>>>>>>>> be
> >>>>>>>>>>> used
> >>>>>>>>>>>>>> underneath.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2) Or we can have separate connection flag which will
> >>> move
> >>>>>> all
> >>>>>>>>>>>>>> INSERT/UPDATE/DELETE statements through streamer.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I prefer the first approach
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Also we need to keep in mind that data streamer has poor
> >>>>>>>>> performance
> >>>>>>>>>>> when
> >>>>>>>>>>>>>> adding single key-value pairs due to high overhead on
> >>>>>>> concurrency
> >>>>>>>>> and
> >>>>>>>>>>> other
> >>>>>>>>>>>>>> bookkeeping. Instead, it is better to pre-batch
> >>> key-value
> >>>>>> pairs
> >>>>>>>>>> before
> >>>>>>>>>>>>>> giving them to streamer.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Vladimir.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>> https://docs.oracle.com/javase/8/docs/api/java/sql/
> >>>>>>>>>>> PreparedStatement.html#
> >>>>>>>>>>>>>> addBatch--
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
> >>>>>>>>>>>>>> [hidden email]> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello Igniters,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> One of the major improvements to DML has to be support
> >>>>> of
> >>>>>>> batch
> >>>>>>>>>>>>>>> statements. I'd like to discuss its implementation.
> >>> The
> >>>>>>>> suggested
> >>>>>>>>>>>>>>> approach is to rewrite given query turning it from few
> >>>>>>> INSERTs
> >>>>>>>>> into
> >>>>>>>>>>>>>>> single statement and processing arguments
> >>> accordingly. I
> >>>>>>>> suggest
> >>>>>>>>>> this
> >>>>>>>>>>>>>>> as long as the whole point of batching is to make as
> >>>>> little
> >>>>>>>>>>>>>>> interactions with cluster as possible and to make
> >>>>>> operations
> >>>>>>> as
> >>>>>>>>>>>>>>> condensed as possible, and in case of Ignite it means
> >>>>> that
> >>>>>> we
> >>>>>>>>>> should
> >>>>>>>>>>>>>>> send as little JdbcQueryTasks as possible. And, as
> >>> long
> >>>>> as
> >>>>>> a
> >>>>>>>>> query
> >>>>>>>>>>>>>>> task holds single query and its arguments, this
> >>> approach
> >>>>>> will
> >>>>>>>> not
> >>>>>>>>>>>>>>> require any changes to be done to current design and
> >>>>> won't
> >>>>>>>> break
> >>>>>>>>>> any
> >>>>>>>>>>>>>>> backward compatibility - all dirty work on rewriting
> >>>>> will
> >>>>>> be
> >>>>>>>> done
> >>>>>>>>>> by
> >>>>>>>>>>>>>>> JDBC driver.
> >>>>>>>>>>>>>>> Without rewriting, we could introduce some new query
> >>>>> task
> >>>>>> for
> >>>>>>>>> batch
> >>>>>>>>>>>>>>> operations, but that would make impossible sending
> >>> such
> >>>>>>>> requests
> >>>>>>>>>> from
> >>>>>>>>>>>>>>> newer clients to older servers (say, servers of
> >>> version
> >>>>>>> 1.8.0,
> >>>>>>>>>> which
> >>>>>>>>>>>>>>> does not know about batching, let alone older
> >>> versions).
> >>>>>>>>>>>>>>> I'd like to hear comments and suggestions from the
> >>>>>> community.
> >>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Alex
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Vladimir Ozerov
> >>>>>> Senior Software Architect
> >>>>>> GridGain Systems
> >>>>>> www.gridgain.com
> >>>>>> *+7 (960) 283 98 40*
> >>>>>>
> >>>>>
> >>>>
> >>>
>
>