Apache Ignite Developers - Legacy Mail Archive

Compression prototype

Classic

List

Threaded

25 messages Options

Ilya Kasnacheev

Re: Compression prototype

Hello!

Each node has a local dictionary (per node currently, per cache planned).
Dictionary is never shared between nodes. As data patterns shift,
dictionary rotation is also planned.

With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It is
enough to store common BinaryObject boilerplate, and everything else is
compressed on the fly. The source sample is 16k records.

Regards,
--
Ilya Kasnacheev

вт, 4 сент. 2018 г. в 11:49, Dmitriy Setrakyan <[hidden email]>:

> On Tue, Sep 4, 2018 at 1:16 AM, Ilya Kasnacheev <[hidden email]
> >
> wrote:
>
> > Hello!
> >
> > The compression is per-binary-object, but dictionary is external, shared
> > between multiple (millions of) entries and stored alongside compressed
> > data.
> >
>
> I was under a different impression. If the dictionary is for the whole data
> set, then it will occupy megabytes (if not gigabytes) of data. What happens
> when a new node joins and has no idea about the dictionary? What happens
> when dictionary between nodes get out-of-sync?
>
> D.
>

dsetrakyan

Re: Compression prototype

On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <[hidden email]>
wrote:

> Hello!
>
> Each node has a local dictionary (per node currently, per cache planned).
> Dictionary is never shared between nodes. As data patterns shift,
> dictionary rotation is also planned.
>
> With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It is
> enough to store common BinaryObject boilerplate, and everything else is
> compressed on the fly. The source sample is 16k records.
>
>

Thanks, Ilya, understood. I think per-cache is a better idea. However, I
have a question about dictionary size. Ignite stores TBs of data. How do
you plan the dictionary to fit in 1K bytes?

D.

Ilya Kasnacheev

Re: Compression prototype

Hello!

In case of Apache Ignite, most of savings is due to BinaryObject format,
which encodes types and fields with byte sequences. Any enum/string flags
will also get in dictionary. And then as it processes a record it fills up
its individual dictionary.

But, in one cache, most if not all entries have identical BinaryObject
layout so a tiny dictionary covers that case. Compression algorithms are
not very keen on large dictionaries, preferring to work with local
regularities in byte stream.

E.g. if we have large entries in cache with low BinaryObject overhead,
they're served just fine by "generic" compression.

All of the above is my speculations, actually. I just observe that on a
large data set, compression ratio is around 0.4 (2.5x) with a dictionary of
1024 bytes. The rest is black box.

Regards,
--
Ilya Kasnacheev

вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <[hidden email]>:

> On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <[hidden email]
> >
> wrote:
>
> > Hello!
> >
> > Each node has a local dictionary (per node currently, per cache planned).
> > Dictionary is never shared between nodes. As data patterns shift,
> > dictionary rotation is also planned.
> >
> > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine It
> is
> > enough to store common BinaryObject boilerplate, and everything else is
> > compressed on the fly. The source sample is 16k records.
> >
> >
> Thanks, Ilya, understood. I think per-cache is a better idea. However, I
> have a question about dictionary size. Ignite stores TBs of data. How do
> you plan the dictionary to fit in 1K bytes?
>
> D.
>

dsetrakyan

Re: Compression prototype

In my view, dictionary of 1024 bytes is not going to be nearly enough.

On Tue, Sep 4, 2018 at 8:06 AM, Ilya Kasnacheev <[hidden email]>
wrote:

> Hello!
>
> In case of Apache Ignite, most of savings is due to BinaryObject format,
> which encodes types and fields with byte sequences. Any enum/string flags
> will also get in dictionary. And then as it processes a record it fills up
> its individual dictionary.
>
> But, in one cache, most if not all entries have identical BinaryObject
> layout so a tiny dictionary covers that case. Compression algorithms are
> not very keen on large dictionaries, preferring to work with local
> regularities in byte stream.
>
> E.g. if we have large entries in cache with low BinaryObject overhead,
> they're served just fine by "generic" compression.
>
> All of the above is my speculations, actually. I just observe that on a
> large data set, compression ratio is around 0.4 (2.5x) with a dictionary of
> 1024 bytes. The rest is black box.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <[hidden email]>:
>
> > On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <
> [hidden email]
> > >
> > wrote:
> >
> > > Hello!
> > >
> > > Each node has a local dictionary (per node currently, per cache
> planned).
> > > Dictionary is never shared between nodes. As data patterns shift,
> > > dictionary rotation is also planned.
> > >
> > > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine
> It
> > is
> > > enough to store common BinaryObject boilerplate, and everything else is
> > > compressed on the fly. The source sample is 16k records.
> > >
> > >
> > Thanks, Ilya, understood. I think per-cache is a better idea. However, I
> > have a question about dictionary size. Ignite stores TBs of data. How do
> > you plan the dictionary to fit in 1K bytes?
> >
> > D.
> >
>

Ilya Kasnacheev

Re: Compression prototype

Hello!

Of course, this setting will be configurable.

Regards,
--
Ilya Kasnacheev

ср, 5 сент. 2018 г. в 3:21, Dmitriy Setrakyan <[hidden email]>:

> In my view, dictionary of 1024 bytes is not going to be nearly enough.
>
> On Tue, Sep 4, 2018 at 8:06 AM, Ilya Kasnacheev <[hidden email]
> >
> wrote:
>
> > Hello!
> >
> > In case of Apache Ignite, most of savings is due to BinaryObject format,
> > which encodes types and fields with byte sequences. Any enum/string flags
> > will also get in dictionary. And then as it processes a record it fills
> up
> > its individual dictionary.
> >
> > But, in one cache, most if not all entries have identical BinaryObject
> > layout so a tiny dictionary covers that case. Compression algorithms are
> > not very keen on large dictionaries, preferring to work with local
> > regularities in byte stream.
> >
> > E.g. if we have large entries in cache with low BinaryObject overhead,
> > they're served just fine by "generic" compression.
> >
> > All of the above is my speculations, actually. I just observe that on a
> > large data set, compression ratio is around 0.4 (2.5x) with a dictionary
> of
> > 1024 bytes. The rest is black box.
> >
> > Regards,
> > --
> > Ilya Kasnacheev
> >
> >
> > вт, 4 сент. 2018 г. в 17:16, Dmitriy Setrakyan <[hidden email]>:
> >
> > > On Tue, Sep 4, 2018 at 2:55 AM, Ilya Kasnacheev <
> > [hidden email]
> > > >
> > > wrote:
> > >
> > > > Hello!
> > > >
> > > > Each node has a local dictionary (per node currently, per cache
> > planned).
> > > > Dictionary is never shared between nodes. As data patterns shift,
> > > > dictionary rotation is also planned.
> > > >
> > > > With Zstd, the best dictionary size seems to be 1024 bytes. I imagine
> > It
> > > is
> > > > enough to store common BinaryObject boilerplate, and everything else
> is
> > > > compressed on the fly. The source sample is 16k records.
> > > >
> > > >
> > > Thanks, Ilya, understood. I think per-cache is a better idea. However,
> I
> > > have a question about dictionary size. Ignite stores TBs of data. How
> do
> > > you plan the dictionary to fit in 1K bytes?
> > >
> > > D.
> > >
> >
>