Hey folks,
While working with Ignite users, I keep seeing data models where a single object (row) might contain many fields (100, 200, more...), and most of them are strings. Correct me if I'm wrong, but per my understanding, for every such field we store an integer value to represent its length. This is significant overhead - with 200 fields we spend 800 bytes only for this. Now here is the catch: vast majority of those strings are actually empty or very short (several chars), therefore we don't really need 4 bytes to their length. My suggestions is to introduce another data type, e.g. STRING_SHORT, use it for all strings that are 255 chars or less, and therefore use a single byte to encode length. We can go even further, and also introduce STRING_EMPTY, which obviously doesn't need any length information at all. What do you guys think? -Val |
Hi Val,
I would say that we do not need string length at all, because it can be derived from object footer (next field offset MINUS current field offset). It is not very good idea to implement proposed change in Apache Ignite 2.x because it is breaking and will add unnecessary complexity to already very complex binary infrastructure. Instead, it is better to review binary format in 3.0 and remove length's not only from Strings, but from other variable-length data types as well (arrays, decimals). On Tue, Mar 5, 2019 at 10:12 AM Valentin Kulichenko < [hidden email]> wrote: > Hey folks, > > While working with Ignite users, I keep seeing data models where a single > object (row) might contain many fields (100, 200, more...), and most of > them are strings. > > Correct me if I'm wrong, but per my understanding, for every such field we > store an integer value to represent its length. This is significant > overhead - with 200 fields we spend 800 bytes only for this. > > Now here is the catch: vast majority of those strings are actually empty or > very short (several chars), therefore we don't really need 4 bytes to their > length. > > My suggestions is to introduce another data type, e.g. STRING_SHORT, use it > for all strings that are 255 chars or less, and therefore use a single byte > to encode length. We can go even further, and also introduce STRING_EMPTY, > which obviously doesn't need any length information at all. > > What do you guys think? > > -Val > |
In reply to this post by Valentin Kulichenko
Hello!
If you can modify your code to store nulls instead of empty strings, nulls seem to be much more compact. Regards, -- Ilya Kasnacheev вт, 5 мар. 2019 г. в 10:12, Valentin Kulichenko < [hidden email]>: > Hey folks, > > While working with Ignite users, I keep seeing data models where a single > object (row) might contain many fields (100, 200, more...), and most of > them are strings. > > Correct me if I'm wrong, but per my understanding, for every such field we > store an integer value to represent its length. This is significant > overhead - with 200 fields we spend 800 bytes only for this. > > Now here is the catch: vast majority of those strings are actually empty or > very short (several chars), therefore we don't really need 4 bytes to their > length. > > My suggestions is to introduce another data type, e.g. STRING_SHORT, use it > for all strings that are 255 chars or less, and therefore use a single byte > to encode length. We can go even further, and also introduce STRING_EMPTY, > which obviously doesn't need any length information at all. > > What do you guys think? > > -Val > |
Free forum by Nabble | Edit this page |