Igniters,
Since we're developing some kind of storage system it's pretty interesting how effectively it stores data. I propose to develop some Estimator allows to count how much space is needed to keep any data. For example: 1) You have classes A,B and C with known fields and data distribution over this fields. 2) You know that you have to keep 1M of A, 2M of B and 45K of C. We can perform estimation in two different approaches: 1) Estimate how much space is needed to keep data in binary format. So, we should - Create some instances - Marshall them to binary format - Count sum(sizes) - Multiply Pros: - Fast. - No need to start Ignite nodes. - Can be used as some kind of benchmarking tool for BinaryMarshaller. Once you improve something at BinaryMarshaller you'll see profit at BinarySizeEstimator results. Cons: - Estimation result will be different from real cluster memory consumption and can be used only as preliminary assessment. 2) Estimate how much space is needed to keep data in real cluster. So, we should - Configure and start small cluster. Set page size, cache types and amount, backups, nodes count, etc. - Create a lot of instances (1/1000, 1/10 or even 1/1 of expected) - Count pages size Pros: - Can be used as pre-production tuning tool. Cons: - Slow. - Required to start Ignite nodes and a lot of free memory. I think we need both, but I propose to start with first approach - BinarySizeEstimator (https://issues.apache.org/jira/browse/IGNITE-6300) Thoughts? |
We should keep in mind two very important things. First, binary object
format is not storage format by design. Second, real space consumption heavily depend on configuration (backups, page size, indexes, compression, etc.). For this reason by estimating sizes of binary objects user would estimate nothing. This would only confuse users. Thus I vote for the second solution. As far as cons, I find them unconvincing. It is neither slow - node starts in a matter of seconds, data load is seconds/minutes, nor it require a lot of memory - small sample of data would be enough. On Tue, Sep 12, 2017 at 7:29 PM, Anton Vinogradov <[hidden email]> wrote: > Igniters, > > Since we're developing some kind of storage system it's pretty interesting > how effectively it stores data. > > I propose to develop some Estimator allows to count how much space is > needed to keep any data. > > For example: > 1) You have classes A,B and C with known fields and data distribution over > this fields. > 2) You know that you have to keep 1M of A, 2M of B and 45K of C. > > We can perform estimation in two different approaches: > > 1) Estimate how much space is needed to keep data in binary format. > So, we should > - Create some instances > - Marshall them to binary format > - Count sum(sizes) > - Multiply > > Pros: > - Fast. > - No need to start Ignite nodes. > - Can be used as some kind of benchmarking tool for BinaryMarshaller. > Once you improve something at BinaryMarshaller you'll see profit at > BinarySizeEstimator results. > > Cons: > - Estimation result will be different from real cluster memory consumption > and can be used only as preliminary assessment. > > 2) Estimate how much space is needed to keep data in real cluster. > So, we should > - Configure and start small cluster. Set page size, cache types and amount, > backups, nodes count, etc. > - Create a lot of instances (1/1000, 1/10 or even 1/1 of expected) > - Count pages size > > Pros: > - Can be used as pre-production tuning tool. > > Cons: > - Slow. > - Required to start Ignite nodes and a lot of free memory. > > > I think we need both, but I propose to start with first approach - > BinarySizeEstimator (https://issues.apache.org/jira/browse/IGNITE-6300) > > Thoughts? > |
Another important point is that size estimation is typically used to
estimate costs. We are talking about real money of our users. Inaccurate estimates might lead wrong decisions and lost money. We should take this feature very serious. On Tue, Sep 12, 2017 at 7:40 PM, Vladimir Ozerov <[hidden email]> wrote: > We should keep in mind two very important things. First, binary object > format is not storage format by design. Second, real space consumption > heavily depend on configuration (backups, page size, indexes, compression, > etc.). For this reason by estimating sizes of binary objects user would > estimate nothing. This would only confuse users. > > Thus I vote for the second solution. As far as cons, I find them > unconvincing. It is neither slow - node starts in a matter of seconds, data > load is seconds/minutes, nor it require a lot of memory - small sample of > data would be enough. > > On Tue, Sep 12, 2017 at 7:29 PM, Anton Vinogradov <[hidden email]> wrote: > >> Igniters, >> >> Since we're developing some kind of storage system it's pretty interesting >> how effectively it stores data. >> >> I propose to develop some Estimator allows to count how much space is >> needed to keep any data. >> >> For example: >> 1) You have classes A,B and C with known fields and data distribution over >> this fields. >> 2) You know that you have to keep 1M of A, 2M of B and 45K of C. >> >> We can perform estimation in two different approaches: >> >> 1) Estimate how much space is needed to keep data in binary format. >> So, we should >> - Create some instances >> - Marshall them to binary format >> - Count sum(sizes) >> - Multiply >> >> Pros: >> - Fast. >> - No need to start Ignite nodes. >> - Can be used as some kind of benchmarking tool for BinaryMarshaller. >> Once you improve something at BinaryMarshaller you'll see profit at >> BinarySizeEstimator results. >> >> Cons: >> - Estimation result will be different from real cluster memory consumption >> and can be used only as preliminary assessment. >> >> 2) Estimate how much space is needed to keep data in real cluster. >> So, we should >> - Configure and start small cluster. Set page size, cache types and >> amount, >> backups, nodes count, etc. >> - Create a lot of instances (1/1000, 1/10 or even 1/1 of expected) >> - Count pages size >> >> Pros: >> - Can be used as pre-production tuning tool. >> >> Cons: >> - Slow. >> - Required to start Ignite nodes and a lot of free memory. >> >> >> I think we need both, but I propose to start with first approach - >> BinarySizeEstimator (https://issues.apache.org/jira/browse/IGNITE-6300) >> >> Thoughts? >> > > |
Agree with Vladimir - the second option seems to be more interesting.
Guys, can we also give recommendations to user on building more effective data model? For example, can we detect dates in string or indexes on boolean fields that most probably have very low selectivity or indexed field which is mostly null for all rows (so queries with "field = null" condition will be cheaper to do with scan)? --Yakov |
Free forum by Nabble | Edit this page |