Apache Ignite Developers - Legacy Mail Archive

How much memory/disc I need to keep my data?

Classic

List

Threaded

4 messages Options

Anton Vinogradov-2

How much memory/disc I need to keep my data?

Igniters,

Since we're developing some kind of storage system it's pretty interesting
how effectively it stores data.

I propose to develop some Estimator allows to count how much space is
needed to keep any data.

For example:
1) You have classes A,B and C with known fields and data distribution over
this fields.
2) You know that you have to keep 1M of A, 2M of B and 45K of C.

We can perform estimation in two different approaches:

1) Estimate how much space is needed to keep data in binary format.
So, we should
- Create some instances
- Marshall them to binary format
- Count sum(sizes)
- Multiply

Pros:
- Fast.
- No need to start Ignite nodes.
- Can be used as some kind of benchmarking tool for BinaryMarshaller.
Once you improve something at BinaryMarshaller you'll see profit at
BinarySizeEstimator results.

Cons:
- Estimation result will be different from real cluster memory consumption
and can be used only as preliminary assessment.

2) Estimate how much space is needed to keep data in real cluster.
So, we should
- Configure and start small cluster. Set page size, cache types and amount,
backups, nodes count, etc.
- Create a lot of instances (1/1000, 1/10 or even 1/1 of expected)
- Count pages size

Pros:
- Can be used as pre-production tuning tool.

Cons:
- Slow.
- Required to start Ignite nodes and a lot of free memory.

I think we need both, but I propose to start with first approach -
BinarySizeEstimator (https://issues.apache.org/jira/browse/IGNITE-6300)

Thoughts?

Vladimir Ozerov

Re: How much memory/disc I need to keep my data?

We should keep in mind two very important things. First, binary object
format is not storage format by design. Second, real space consumption
heavily depend on configuration (backups, page size, indexes, compression,
etc.). For this reason by estimating sizes of binary objects user would
estimate nothing. This would only confuse users.

Thus I vote for the second solution. As far as cons, I find them
unconvincing. It is neither slow - node starts in a matter of seconds, data
load is seconds/minutes, nor it require a lot of memory - small sample of
data would be enough.

On Tue, Sep 12, 2017 at 7:29 PM, Anton Vinogradov <[hidden email]> wrote:

> Igniters,
>
> Since we're developing some kind of storage system it's pretty interesting
> how effectively it stores data.
>
> I propose to develop some Estimator allows to count how much space is
> needed to keep any data.
>
> For example:
> 1) You have classes A,B and C with known fields and data distribution over
> this fields.
> 2) You know that you have to keep 1M of A, 2M of B and 45K of C.
>
> We can perform estimation in two different approaches:
>
> 1) Estimate how much space is needed to keep data in binary format.
> So, we should
> - Create some instances
> - Marshall them to binary format
> - Count sum(sizes)
> - Multiply
>
> Pros:
> - Fast.
> - No need to start Ignite nodes.
> - Can be used as some kind of benchmarking tool for BinaryMarshaller.
> Once you improve something at BinaryMarshaller you'll see profit at
> BinarySizeEstimator results.
>
> Cons:
> - Estimation result will be different from real cluster memory consumption
> and can be used only as preliminary assessment.
>
> 2) Estimate how much space is needed to keep data in real cluster.
> So, we should
> - Configure and start small cluster. Set page size, cache types and amount,
> backups, nodes count, etc.
> - Create a lot of instances (1/1000, 1/10 or even 1/1 of expected)
> - Count pages size
>
> Pros:
> - Can be used as pre-production tuning tool.
>
> Cons:
> - Slow.
> - Required to start Ignite nodes and a lot of free memory.
>
>
> I think we need both, but I propose to start with first approach -
> BinarySizeEstimator (https://issues.apache.org/jira/browse/IGNITE-6300)
>
> Thoughts?
>

Vladimir Ozerov

Re: How much memory/disc I need to keep my data?

Another important point is that size estimation is typically used to
estimate costs. We are talking about real money of our users. Inaccurate
estimates might lead wrong decisions and lost money. We should take this
feature very serious.

On Tue, Sep 12, 2017 at 7:40 PM, Vladimir Ozerov <[hidden email]>
wrote:

> We should keep in mind two very important things. First, binary object
> format is not storage format by design. Second, real space consumption
> heavily depend on configuration (backups, page size, indexes, compression,
> etc.). For this reason by estimating sizes of binary objects user would
> estimate nothing. This would only confuse users.
>
> Thus I vote for the second solution. As far as cons, I find them
> unconvincing. It is neither slow - node starts in a matter of seconds, data
> load is seconds/minutes, nor it require a lot of memory - small sample of
> data would be enough.
>
> On Tue, Sep 12, 2017 at 7:29 PM, Anton Vinogradov <[hidden email]> wrote:
>
>> Igniters,
>>
>> Since we're developing some kind of storage system it's pretty interesting
>> how effectively it stores data.
>>
>> I propose to develop some Estimator allows to count how much space is
>> needed to keep any data.
>>
>> For example:
>> 1) You have classes A,B and C with known fields and data distribution over
>> this fields.
>> 2) You know that you have to keep 1M of A, 2M of B and 45K of C.
>>
>> We can perform estimation in two different approaches:
>>
>> 1) Estimate how much space is needed to keep data in binary format.
>> So, we should
>> - Create some instances
>> - Marshall them to binary format
>> - Count sum(sizes)
>> - Multiply
>>
>> Pros:
>> - Fast.
>> - No need to start Ignite nodes.
>> - Can be used as some kind of benchmarking tool for BinaryMarshaller.
>> Once you improve something at BinaryMarshaller you'll see profit at
>> BinarySizeEstimator results.
>>
>> Cons:
>> - Estimation result will be different from real cluster memory consumption
>> and can be used only as preliminary assessment.
>>
>> 2) Estimate how much space is needed to keep data in real cluster.
>> So, we should
>> - Configure and start small cluster. Set page size, cache types and
>> amount,
>> backups, nodes count, etc.
>> - Create a lot of instances (1/1000, 1/10 or even 1/1 of expected)
>> - Count pages size
>>
>> Pros:
>> - Can be used as pre-production tuning tool.
>>
>> Cons:
>> - Slow.
>> - Required to start Ignite nodes and a lot of free memory.
>>
>>
>> I think we need both, but I propose to start with first approach -
>> BinarySizeEstimator (https://issues.apache.org/jira/browse/IGNITE-6300)
>>
>> Thoughts?
>>
>
>

yzhdanov

Re: How much memory/disc I need to keep my data?

Agree with Vladimir - the second option seems to be more interesting.

Guys, can we also give recommendations to user on building more effective
data model? For example, can we detect dates in string or indexes on
boolean fields that most probably have very low selectivity or indexed
field which is mostly null for all rows (so queries with "field = null"
condition will be cheaper to do with scan)?

--Yakov