Apache Ignite Developers - Legacy Mail Archive

IgniteCache.loadCache improvement proposal

Classic

List

Threaded

30 messages Options

yzhdanov

Re: IgniteCache.loadCache improvement proposal

> Yakov, I agree that such scenario should be avoided. I also think that
> loadCache(...) method, as it is right now, provides a way to avoid it.

No, it does not.

--Yakov

dsetrakyan

Re: IgniteCache.loadCache improvement proposal

On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email]> wrote:

> > Yakov, I agree that such scenario should be avoided. I also think that
> > loadCache(...) method, as it is right now, provides a way to avoid it.
>
> No, it does not.
>

Yes it does :)

yzhdanov

Re: IgniteCache.loadCache improvement proposal

> On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email]>
wrote:

> > > Yakov, I agree that such scenario should be avoided. I also think that

> > > loadCache(...) method, as it is right now, provides a way to avoid it.

> >

> > No, it does not.

> >
> Yes it does :)

No it doesn't. Load cache should either send a query to DB that filters all
the data on server side which, in turn, may result to full-scan of 2 Tb
data set dozens of times (equal to node count) or send a query that brings
the whole dataset to each node which is unacceptable as well.

--Yakov

dsetrakyan

Re: IgniteCache.loadCache improvement proposal

On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov <[hidden email]> wrote:

> > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email]>
> wrote:
>
> > > > Yakov, I agree that such scenario should be avoided. I also think
> that
>
> > > > loadCache(...) method, as it is right now, provides a way to avoid
> it.
>
> > >
>
> > > No, it does not.
>
> > >
> > Yes it does :)
>
> No it doesn't. Load cache should either send a query to DB that filters all
> the data on server side which, in turn, may result to full-scan of 2 Tb
> data set dozens of times (equal to node count) or send a query that brings
> the whole dataset to each node which is unacceptable as well.
>

Why not store the partition ID in the database and query only local
partitions? Whatever approach we design with a DataStreamer will be slower
than this.

Valentin Kulichenko

Re: IgniteCache.loadCache improvement proposal

Alexandr,

'local' prefix in Ignite APIs means that the method is invoked only on the
current node, while its regular sibling is invoked in distributed fashion.
localLoadCache doesn't imply that only local partitions are loaded. it
turns out to work this way right now, but it doesn't mean that this can't
be change (and I don't suggest to change default behavior, BTW).

Method overhead is decreased with my approach, if used properly. You can
call localLoadCache with the data streamer based closure, and the database
will be queried only from local node, and the local node will then
distribute the data across other nodes. All I did is abstracted this logic
of moving an entry from store to cache, because currently user doesn't have
an option to override it.

If you still believe this doesn't work, can you please elaborate what
exactly you propose? What code should we add and/or change in Ignite and
how user will use it API wise?

-Val

On Wed, Nov 16, 2016 at 5:40 AM, Dmitriy Setrakyan <[hidden email]>
wrote:

> On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov <[hidden email]>
> wrote:
>
> > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email]>
> > wrote:
> >
> > > > > Yakov, I agree that such scenario should be avoided. I also think
> > that
> >
> > > > > loadCache(...) method, as it is right now, provides a way to avoid
> > it.
> >
> > > >
> >
> > > > No, it does not.
> >
> > > >
> > > Yes it does :)
> >
> > No it doesn't. Load cache should either send a query to DB that filters
> all
> > the data on server side which, in turn, may result to full-scan of 2 Tb
> > data set dozens of times (equal to node count) or send a query that
> brings
> > the whole dataset to each node which is unacceptable as well.
> >
>
> Why not store the partition ID in the database and query only local
> partitions? Whatever approach we design with a DataStreamer will be slower
> than this.
>

Alexandr Kuramshin

Re: IgniteCache.loadCache improvement proposal

In reply to this post by dsetrakyan

Dmitriy,

I will not be fully confident that partition ID is the best approach in all
cases. Even if we have full access to the database structure, there are
another problems.

Assume we have a table PERSON (ID NUMBER, NAME VARCHAR, SURNAME VARCHAR,
AGE NUMBER, EMPL_DATE DATE). And we add our column PART NUMBER.

While we already have indexes IDX1(NAME), IDX2(SURNAME), IDX3(AGE),
IDX4(EMPL_DATE), we have to add new 2-column index IDX5(PART, EMPL_DATE)
for pre-loading at startup, for example, recently employed persons.

And if we'd like to query filtered data from the database, we'd also have
to create the other compound indexes IDX6(PART, NAME), IDX7(PART, SURNAME),
IDX8(PART, AGE). So we doubling overhead is defined by indexes.

After this modifications on the database has been done and the PART column
is filled, what we should do to preload the data?

We should perform so many database queries so many partitions are stored on
the nodes. Number of queries would be 1024 by default settings in the
affinity functions. Some calls may not return any data at all, and it will
be a vain network round-trip. Also it may be a problem for some databases
to effectively perform number of parallel queries without a degradation on
the total throughput.

DataStreamer approach may be faster, but it should be tested.

2016-11-16 16:40 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:

--
Thanks,
Alexandr Kuramshin

Valentin Kulichenko

Re: IgniteCache.loadCache improvement proposal

Alexandr,

This has been tested many times already by our users and the answer is
simple - it depends :) Any approach has its pros and cons and you never
know which one will better for particular use case, database, data model,
hardware, etc.

Having said that, you will never find the best way to load the data,
because it just doesn't exist. What I propose is just to make the API more
generic and give user even more control than they have now.

-Val

On Fri, Nov 18, 2016 at 6:53 AM, Alexandr Kuramshin <[hidden email]>
wrote:

> Dmitriy,
>
> I will not be fully confident that partition ID is the best approach in all
> cases. Even if we have full access to the database structure, there are
> another problems.
>
> Assume we have a table PERSON (ID NUMBER, NAME VARCHAR, SURNAME VARCHAR,
> AGE NUMBER, EMPL_DATE DATE). And we add our column PART NUMBER.
>
> While we already have indexes IDX1(NAME), IDX2(SURNAME), IDX3(AGE),
> IDX4(EMPL_DATE), we have to add new 2-column index IDX5(PART, EMPL_DATE)
> for pre-loading at startup, for example, recently employed persons.
>
> And if we'd like to query filtered data from the database, we'd also have
> to create the other compound indexes IDX6(PART, NAME), IDX7(PART, SURNAME),
> IDX8(PART, AGE). So we doubling overhead is defined by indexes.
>
> After this modifications on the database has been done and the PART column
> is filled, what we should do to preload the data?
>
> We should perform so many database queries so many partitions are stored on
> the nodes. Number of queries would be 1024 by default settings in the
> affinity functions. Some calls may not return any data at all, and it will
> be a vain network round-trip. Also it may be a problem for some databases
> to effectively perform number of parallel queries without a degradation on
> the total throughput.
>
> DataStreamer approach may be faster, but it should be tested.
>
> 2016-11-16 16:40 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:
>
> > On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov <[hidden email]>
> > wrote:
> >
> > > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email]
> >
> > > wrote:
> > >
> > > > > > Yakov, I agree that such scenario should be avoided. I also think
> > > that
> > >
> > > > > > loadCache(...) method, as it is right now, provides a way to
> avoid
> > > it.
> > >
> > > > >
> > >
> > > > > No, it does not.
> > >
> > > > >
> > > > Yes it does :)
> > >
> > > No it doesn't. Load cache should either send a query to DB that filters
> > all
> > > the data on server side which, in turn, may result to full-scan of 2 Tb
> > > data set dozens of times (equal to node count) or send a query that
> > brings
> > > the whole dataset to each node which is unacceptable as well.
> > >
> >
> > Why not store the partition ID in the database and query only local
> > partitions? Whatever approach we design with a DataStreamer will be
> slower
> > than this.
> >
>
>
>
> --
> Thanks,
> Alexandr Kuramshin
>

yzhdanov

Re: IgniteCache.loadCache improvement proposal

In reply to this post by dsetrakyan

>
>
> Why not store the partition ID in the database and query only local
> partitions? Whatever approach we design with a DataStreamer will be slower
> than this.
>

Because this can be some generic DB. Imagine the app migrating to IMDG.

I am pretty sure that in many cases approach with data streamer will be
faster and in many cases approach with multiple queries will be faster. And
the choice should depend on many factors. I like Val's suggestions. I think
he goes in the right direction.

--Yakov

Valentin Kulichenko

Re: IgniteCache.loadCache improvement proposal

Guys,

I created a ticket for this:
https://issues.apache.org/jira/browse/IGNITE-4255

Feel free to provide comments.

-Val

On Sat, Nov 19, 2016 at 6:56 AM, Yakov Zhdanov <[hidden email]> wrote:

> >
> >
> > Why not store the partition ID in the database and query only local
> > partitions? Whatever approach we design with a DataStreamer will be
> slower
> > than this.
> >
>
> Because this can be some generic DB. Imagine the app migrating to IMDG.
>
> I am pretty sure that in many cases approach with data streamer will be
> faster and in many cases approach with multiple queries will be faster. And
> the choice should depend on many factors. I like Val's suggestions. I think
> he goes in the right direction.
>
> --Yakov
>

Alexandr Kuramshin

Re: IgniteCache.loadCache improvement proposal

Val, Yakov,

Sorry for delay, I need time to think and to do some tests.

Anyway, extending the API and supply default implementation - is good. It
makes frameworks more flexible and usable.

But your proposal of extension will not solve the problem that I have
raise. Please, read the next with special attention.

Current implementation IgniteCache.loadCache causes parallel execution of
IgniteCache.localLoadCache on each node in the cluster. It's bad
implementation, but it's *right semantic*.

You propose to extend IgniteCache.localLoadCache and use it to load data on
all the nodes. It's bad semantic. But it also leads to bad implementation.
Please note why.

When you filter the data with the supplied IgniteBiPredicate, you may
access the data that must be co-located. Hence to load the data to all the
nodes, you need access to all the related data partitioned by the cluster.
This leads to great network overhead and near caches overload.

And that is why am I wondering that IgniteBiPredicate is executed for every
key supplied by Cache.loadCache, but not only for those keys, which will be
stored on this node.

My opinion in conclusion.

localLoadCache should first filter a key by the affinity function and the
current cache topology, *then *invoke the predicate, and then store the
entity in the cache (possibly by invoking the supplied closure). All
associated partitions should be locked for the time of loading.

IgniteCache.loadCache should perform Cache.loadCache on the one (or some
more) nodes, then transfer entities to the remote nodes, *then *invoke the
predicate and closure on the remote nodes.

2016-11-22 2:16 GMT+03:00 Valentin Kulichenko <[hidden email]
>:

> Guys,
>
> I created a ticket for this:
> https://issues.apache.org/jira/browse/IGNITE-4255
>
> Feel free to provide comments.
>
> -Val
>
> On Sat, Nov 19, 2016 at 6:56 AM, Yakov Zhdanov <[hidden email]>
> wrote:
>
> > >
> > >
> > > Why not store the partition ID in the database and query only local
> > > partitions? Whatever approach we design with a DataStreamer will be
> > slower
> > > than this.
> > >
> >
> > Because this can be some generic DB. Imagine the app migrating to IMDG.
> >
> > I am pretty sure that in many cases approach with data streamer will be
> > faster and in many cases approach with multiple queries will be faster.
> And
> > the choice should depend on many factors. I like Val's suggestions. I
> think
> > he goes in the right direction.
> >
> > --Yakov
> >
>

--
Thanks,
Alexandr Kuramshin