Hello!
Each node has a local dictionary (per node currently, per cache planned). Dictionary is never shared between nodes. As data patterns shift, dictionary rotation is also planned. With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It is enough to store common BinaryObject boilerplate, and everything else is compressed on the fly. The source sample is 16k records. Regards, -- Ilya Kasnacheev вт, 4 сент. 2018 г. в 11:49, Dmitriy Setrakyan <[hidden email]>: > On Tue, Sep 4, 2018 at 1:16 AM, Ilya Kasnacheev <[hidden email] > > > wrote: > > > Hello! > > > > The compression is per-binary-object, but dictionary is external, shared > > between multiple (millions of) entries and stored alongside compressed > > data. > > > > I was under a different impression. If the dictionary is for the whole data > set, then it will occupy megabytes (if not gigabytes) of data. What happens > when a new node joins and has no idea about the dictionary? What happens > when dictionary between nodes get out-of-sync? > > D. > |
On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <[hidden email]>
wrote: > Hello! > > Each node has a local dictionary (per node currently, per cache planned). > Dictionary is never shared between nodes. As data patterns shift, > dictionary rotation is also planned. > > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It is > enough to store common BinaryObject boilerplate, and everything else is > compressed on the fly. The source sample is 16k records. > > have a question about dictionary size. Ignite stores TBs of data. How do you plan the dictionary to fit in 1K bytes? D. |
Hello!
In case of Apache Ignite, most of savings is due to BinaryObject format, which encodes types and fields with byte sequences. Any enum/string flags will also get in dictionary. And then as it processes a record it fills up its individual dictionary. But, in one cache, most if not all entries have identical BinaryObject layout so a tiny dictionary covers that case. Compression algorithms are not very keen on large dictionaries, preferring to work with local regularities in byte stream. E.g. if we have large entries in cache with low BinaryObject overhead, they're served just fine by "generic" compression. All of the above is my speculations, actually. I just observe that on a large data set, compression ratio is around 0.4 (2.5x) with a dictionary of 1024 bytes. The rest is black box. Regards, -- Ilya Kasnacheev вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <[hidden email]>: > On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <[hidden email] > > > wrote: > > > Hello! > > > > Each node has a local dictionary (per node currently, per cache planned). > > Dictionary is never shared between nodes. As data patterns shift, > > dictionary rotation is also planned. > > > > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It > is > > enough to store common BinaryObject boilerplate, and everything else is > > compressed on the fly. The source sample is 16k records. > > > > > Thanks, Ilya, understood. I think per-cache is a better idea. However, I > have a question about dictionary size. Ignite stores TBs of data. How do > you plan the dictionary to fit in 1K bytes? > > D. > |
In my view, dictionary of 1024 bytes is not going to be nearly enough.
On Tue, Sep 4, 2018 at 8:06 AM, Ilya Kasnacheev <[hidden email]> wrote: > Hello! > > In case of Apache Ignite, most of savings is due to BinaryObject format, > which encodes types and fields with byte sequences. Any enum/string flags > will also get in dictionary. And then as it processes a record it fills up > its individual dictionary. > > But, in one cache, most if not all entries have identical BinaryObject > layout so a tiny dictionary covers that case. Compression algorithms are > not very keen on large dictionaries, preferring to work with local > regularities in byte stream. > > E.g. if we have large entries in cache with low BinaryObject overhead, > they're served just fine by "generic" compression. > > All of the above is my speculations, actually. I just observe that on a > large data set, compression ratio is around 0.4 (2.5x) with a dictionary of > 1024 bytes. The rest is black box. > > Regards, > -- > Ilya Kasnacheev > > > вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <[hidden email]>: > > > On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev < > [hidden email] > > > > > wrote: > > > > > Hello! > > > > > > Each node has a local dictionary (per node currently, per cache > planned). > > > Dictionary is never shared between nodes. As data patterns shift, > > > dictionary rotation is also planned. > > > > > > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine > It > > is > > > enough to store common BinaryObject boilerplate, and everything else is > > > compressed on the fly. The source sample is 16k records. > > > > > > > > Thanks, Ilya, understood. I think per-cache is a better idea. However, I > > have a question about dictionary size. Ignite stores TBs of data. How do > > you plan the dictionary to fit in 1K bytes? > > > > D. > > > |
Hello!
Of course, this setting will be configurable. Regards, -- Ilya Kasnacheev ср, 5 сент. 2018 г. в 3:21, Dmitriy Setrakyan <[hidden email]>: > In my view, dictionary of 1024 bytes is not going to be nearly enough. > > On Tue, Sep 4, 2018 at 8:06 AM, Ilya Kasnacheev <[hidden email] > > > wrote: > > > Hello! > > > > In case of Apache Ignite, most of savings is due to BinaryObject format, > > which encodes types and fields with byte sequences. Any enum/string flags > > will also get in dictionary. And then as it processes a record it fills > up > > its individual dictionary. > > > > But, in one cache, most if not all entries have identical BinaryObject > > layout so a tiny dictionary covers that case. Compression algorithms are > > not very keen on large dictionaries, preferring to work with local > > regularities in byte stream. > > > > E.g. if we have large entries in cache with low BinaryObject overhead, > > they're served just fine by "generic" compression. > > > > All of the above is my speculations, actually. I just observe that on a > > large data set, compression ratio is around 0.4 (2.5x) with a dictionary > of > > 1024 bytes. The rest is black box. > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <[hidden email]>: > > > > > On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev < > > [hidden email] > > > > > > > wrote: > > > > > > > Hello! > > > > > > > > Each node has a local dictionary (per node currently, per cache > > planned). > > > > Dictionary is never shared between nodes. As data patterns shift, > > > > dictionary rotation is also planned. > > > > > > > > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine > > It > > > is > > > > enough to store common BinaryObject boilerplate, and everything else > is > > > > compressed on the fly. The source sample is 16k records. > > > > > > > > > > > Thanks, Ilya, understood. I think per-cache is a better idea. However, > I > > > have a question about dictionary size. Ignite stores TBs of data. How > do > > > you plan the dictionary to fit in 1K bytes? > > > > > > D. > > > > > > |
Free forum by Nabble | Edit this page |