> Yakov, I agree that such scenario should be avoided. I also think that
> loadCache(...) method, as it is right now, provides a way to avoid it. No, it does not. --Yakov |
On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email]> wrote:
> > Yakov, I agree that such scenario should be avoided. I also think that > > loadCache(...) method, as it is right now, provides a way to avoid it. > > No, it does not. > Yes it does :) |
> On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email]>
wrote: > > > Yakov, I agree that such scenario should be avoided. I also think that > > > loadCache(...) method, as it is right now, provides a way to avoid it. > > > > No, it does not. > > > Yes it does :) No it doesn't. Load cache should either send a query to DB that filters all the data on server side which, in turn, may result to full-scan of 2 Tb data set dozens of times (equal to node count) or send a query that brings the whole dataset to each node which is unacceptable as well. --Yakov |
On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov <[hidden email]> wrote:
> > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email]> > wrote: > > > > > Yakov, I agree that such scenario should be avoided. I also think > that > > > > > loadCache(...) method, as it is right now, provides a way to avoid > it. > > > > > > > > No, it does not. > > > > > > Yes it does :) > > No it doesn't. Load cache should either send a query to DB that filters all > the data on server side which, in turn, may result to full-scan of 2 Tb > data set dozens of times (equal to node count) or send a query that brings > the whole dataset to each node which is unacceptable as well. > Why not store the partition ID in the database and query only local partitions? Whatever approach we design with a DataStreamer will be slower than this. |
Alexandr,
'local' prefix in Ignite APIs means that the method is invoked only on the current node, while its regular sibling is invoked in distributed fashion. localLoadCache doesn't imply that only local partitions are loaded. it turns out to work this way right now, but it doesn't mean that this can't be change (and I don't suggest to change default behavior, BTW). Method overhead is decreased with my approach, if used properly. You can call localLoadCache with the data streamer based closure, and the database will be queried only from local node, and the local node will then distribute the data across other nodes. All I did is abstracted this logic of moving an entry from store to cache, because currently user doesn't have an option to override it. If you still believe this doesn't work, can you please elaborate what exactly you propose? What code should we add and/or change in Ignite and how user will use it API wise? -Val On Wed, Nov 16, 2016 at 5:40 AM, Dmitriy Setrakyan <[hidden email]> wrote: > On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov <[hidden email]> > wrote: > > > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email]> > > wrote: > > > > > > > Yakov, I agree that such scenario should be avoided. I also think > > that > > > > > > > loadCache(...) method, as it is right now, provides a way to avoid > > it. > > > > > > > > > > > > No, it does not. > > > > > > > > > Yes it does :) > > > > No it doesn't. Load cache should either send a query to DB that filters > all > > the data on server side which, in turn, may result to full-scan of 2 Tb > > data set dozens of times (equal to node count) or send a query that > brings > > the whole dataset to each node which is unacceptable as well. > > > > Why not store the partition ID in the database and query only local > partitions? Whatever approach we design with a DataStreamer will be slower > than this. > |
In reply to this post by dsetrakyan
Dmitriy,
I will not be fully confident that partition ID is the best approach in all cases. Even if we have full access to the database structure, there are another problems. Assume we have a table PERSON (ID NUMBER, NAME VARCHAR, SURNAME VARCHAR, AGE NUMBER, EMPL_DATE DATE). And we add our column PART NUMBER. While we already have indexes IDX1(NAME), IDX2(SURNAME), IDX3(AGE), IDX4(EMPL_DATE), we have to add new 2-column index IDX5(PART, EMPL_DATE) for pre-loading at startup, for example, recently employed persons. And if we'd like to query filtered data from the database, we'd also have to create the other compound indexes IDX6(PART, NAME), IDX7(PART, SURNAME), IDX8(PART, AGE). So we doubling overhead is defined by indexes. After this modifications on the database has been done and the PART column is filled, what we should do to preload the data? We should perform so many database queries so many partitions are stored on the nodes. Number of queries would be 1024 by default settings in the affinity functions. Some calls may not return any data at all, and it will be a vain network round-trip. Also it may be a problem for some databases to effectively perform number of parallel queries without a degradation on the total throughput. DataStreamer approach may be faster, but it should be tested. 2016-11-16 16:40 GMT+03:00 Dmitriy Setrakyan <[hidden email]>: > On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov <[hidden email]> > wrote: > > > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email]> > > wrote: > > > > > > > Yakov, I agree that such scenario should be avoided. I also think > > that > > > > > > > loadCache(...) method, as it is right now, provides a way to avoid > > it. > > > > > > > > > > > > No, it does not. > > > > > > > > > Yes it does :) > > > > No it doesn't. Load cache should either send a query to DB that filters > all > > the data on server side which, in turn, may result to full-scan of 2 Tb > > data set dozens of times (equal to node count) or send a query that > brings > > the whole dataset to each node which is unacceptable as well. > > > > Why not store the partition ID in the database and query only local > partitions? Whatever approach we design with a DataStreamer will be slower > than this. > -- Thanks, Alexandr Kuramshin |
Alexandr,
This has been tested many times already by our users and the answer is simple - it depends :) Any approach has its pros and cons and you never know which one will better for particular use case, database, data model, hardware, etc. Having said that, you will never find the best way to load the data, because it just doesn't exist. What I propose is just to make the API more generic and give user even more control than they have now. -Val On Fri, Nov 18, 2016 at 6:53 AM, Alexandr Kuramshin <[hidden email]> wrote: > Dmitriy, > > I will not be fully confident that partition ID is the best approach in all > cases. Even if we have full access to the database structure, there are > another problems. > > Assume we have a table PERSON (ID NUMBER, NAME VARCHAR, SURNAME VARCHAR, > AGE NUMBER, EMPL_DATE DATE). And we add our column PART NUMBER. > > While we already have indexes IDX1(NAME), IDX2(SURNAME), IDX3(AGE), > IDX4(EMPL_DATE), we have to add new 2-column index IDX5(PART, EMPL_DATE) > for pre-loading at startup, for example, recently employed persons. > > And if we'd like to query filtered data from the database, we'd also have > to create the other compound indexes IDX6(PART, NAME), IDX7(PART, SURNAME), > IDX8(PART, AGE). So we doubling overhead is defined by indexes. > > After this modifications on the database has been done and the PART column > is filled, what we should do to preload the data? > > We should perform so many database queries so many partitions are stored on > the nodes. Number of queries would be 1024 by default settings in the > affinity functions. Some calls may not return any data at all, and it will > be a vain network round-trip. Also it may be a problem for some databases > to effectively perform number of parallel queries without a degradation on > the total throughput. > > DataStreamer approach may be faster, but it should be tested. > > 2016-11-16 16:40 GMT+03:00 Dmitriy Setrakyan <[hidden email]>: > > > On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov <[hidden email]> > > wrote: > > > > > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[hidden email] > > > > > wrote: > > > > > > > > > Yakov, I agree that such scenario should be avoided. I also think > > > that > > > > > > > > > loadCache(...) method, as it is right now, provides a way to > avoid > > > it. > > > > > > > > > > > > > > > > No, it does not. > > > > > > > > > > > > Yes it does :) > > > > > > No it doesn't. Load cache should either send a query to DB that filters > > all > > > the data on server side which, in turn, may result to full-scan of 2 Tb > > > data set dozens of times (equal to node count) or send a query that > > brings > > > the whole dataset to each node which is unacceptable as well. > > > > > > > Why not store the partition ID in the database and query only local > > partitions? Whatever approach we design with a DataStreamer will be > slower > > than this. > > > > > > -- > Thanks, > Alexandr Kuramshin > |
In reply to this post by dsetrakyan
>
> > Why not store the partition ID in the database and query only local > partitions? Whatever approach we design with a DataStreamer will be slower > than this. > Because this can be some generic DB. Imagine the app migrating to IMDG. I am pretty sure that in many cases approach with data streamer will be faster and in many cases approach with multiple queries will be faster. And the choice should depend on many factors. I like Val's suggestions. I think he goes in the right direction. --Yakov |
Guys,
I created a ticket for this: https://issues.apache.org/jira/browse/IGNITE-4255 Feel free to provide comments. -Val On Sat, Nov 19, 2016 at 6:56 AM, Yakov Zhdanov <[hidden email]> wrote: > > > > > > Why not store the partition ID in the database and query only local > > partitions? Whatever approach we design with a DataStreamer will be > slower > > than this. > > > > Because this can be some generic DB. Imagine the app migrating to IMDG. > > I am pretty sure that in many cases approach with data streamer will be > faster and in many cases approach with multiple queries will be faster. And > the choice should depend on many factors. I like Val's suggestions. I think > he goes in the right direction. > > --Yakov > |
Val, Yakov,
Sorry for delay, I need time to think and to do some tests. Anyway, extending the API and supply default implementation - is good. It makes frameworks more flexible and usable. But your proposal of extension will not solve the problem that I have raise. Please, read the next with special attention. Current implementation IgniteCache.loadCache causes parallel execution of IgniteCache.localLoadCache on each node in the cluster. It's bad implementation, but it's *right semantic*. You propose to extend IgniteCache.localLoadCache and use it to load data on all the nodes. It's bad semantic. But it also leads to bad implementation. Please note why. When you filter the data with the supplied IgniteBiPredicate, you may access the data that must be co-located. Hence to load the data to all the nodes, you need access to all the related data partitioned by the cluster. This leads to great network overhead and near caches overload. And that is why am I wondering that IgniteBiPredicate is executed for every key supplied by Cache.loadCache, but not only for those keys, which will be stored on this node. My opinion in conclusion. localLoadCache should first filter a key by the affinity function and the current cache topology, *then *invoke the predicate, and then store the entity in the cache (possibly by invoking the supplied closure). All associated partitions should be locked for the time of loading. IgniteCache.loadCache should perform Cache.loadCache on the one (or some more) nodes, then transfer entities to the remote nodes, *then *invoke the predicate and closure on the remote nodes. 2016-11-22 2:16 GMT+03:00 Valentin Kulichenko <[hidden email] >: > Guys, > > I created a ticket for this: > https://issues.apache.org/jira/browse/IGNITE-4255 > > Feel free to provide comments. > > -Val > > On Sat, Nov 19, 2016 at 6:56 AM, Yakov Zhdanov <[hidden email]> > wrote: > > > > > > > > > > Why not store the partition ID in the database and query only local > > > partitions? Whatever approach we design with a DataStreamer will be > > slower > > > than this. > > > > > > > Because this can be some generic DB. Imagine the app migrating to IMDG. > > > > I am pretty sure that in many cases approach with data streamer will be > > faster and in many cases approach with multiple queries will be faster. > And > > the choice should depend on many factors. I like Val's suggestions. I > think > > he goes in the right direction. > > > > --Yakov > > > -- Thanks, Alexandr Kuramshin |
Free forum by Nabble | Edit this page |