Apache Ignite Developers - Legacy Mail Archive

Compression prototype

Classic

List

Threaded

25 messages Options

Ilya Kasnacheev

Compression prototype

Hello!

I want to share with the developer community my compression prototype.

Long story short, it compresses BinaryObject's byte[] as they are written
to Durable Memory page, operating on a pre-built dictionary. Typical
compression ratio is 0.4 (meaning 2.5x compression) using custom
LZW+Huffman. Metadata, indexes and primitive values are unaffected entirely.

This is akin to DB2's table-level compression[1] but independently invented.

On Yardstick tests performance hit is -6% with PDS and up to -25% (in
throughput) with In-Memory loads. It also means you can fit ~twice as much
data into the same IM cluster, or have higher ram/disk ratio with PDS
cluster, saving on hardware or decreasing latency.

The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true to
activate). Note that it will not presently survive a PDS node restart.
The impact is very small, the patch should be applicable to most 2.x
releases.

Sure there's a long way before this prototype can have hope of being
included, but first I would like to hear input from fellow igniters.

See also IEP-20[3].

1.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
2. https://github.com/apache/ignite/pull/4295
3.
https://cwiki.apache.org/confluence/display/IGNITE/IEP-20%3A+Data+Compression+in+Ignite

Regards,

--
Ilya Kasnacheev

Dmitriy Pavlov

Re: Compression prototype

Hi Ilya,

Thank you for sharing this here. I believe this contribution will be
accepted by the Community. Moreover, it shows so remarkable performance
boost.

I'm pretty sure this patch will be reviewed by Ignite Native Persistence
experts soon.

What do you mean by can't survive PDS node restart?

Sincerely,
Dmitriy Pavlov

чт, 23 авг. 2018 г. в 13:09, Ilya Kasnacheev <[hidden email]>:

> Hello!
>
> I want to share with the developer community my compression prototype.
>
> Long story short, it compresses BinaryObject's byte[] as they are written
> to Durable Memory page, operating on a pre-built dictionary. Typical
> compression ratio is 0.4 (meaning 2.5x compression) using custom
> LZW+Huffman. Metadata, indexes and primitive values are unaffected
> entirely.
>
> This is akin to DB2's table-level compression[1] but independently
> invented.
>
> On Yardstick tests performance hit is -6% with PDS and up to -25% (in
> throughput) with In-Memory loads. It also means you can fit ~twice as much
> data into the same IM cluster, or have higher ram/disk ratio with PDS
> cluster, saving on hardware or decreasing latency.
>
> The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true to
> activate). Note that it will not presently survive a PDS node restart.
> The impact is very small, the patch should be applicable to most 2.x
> releases.
>
> Sure there's a long way before this prototype can have hope of being
> included, but first I would like to hear input from fellow igniters.
>
> See also IEP-20[3].
>
> 1.
>
> https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> 2. https://github.com/apache/ignite/pull/4295
> 3.
>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-20%3A+Data+Compression+in+Ignite
>
> Regards,
>
> --
> Ilya Kasnacheev
>

Ilya Kasnacheev

Re: Compression prototype

Hello!

Currently, the dictionary for decompression is only stored on heap. After
restart there's compressed data in the PDS, but there's no dictionary :)

Regards,

--
Ilya Kasnacheev

2018-08-23 14:58 GMT+03:00 Dmitriy Pavlov <[hidden email]>:

> Hi Ilya,
>
> Thank you for sharing this here. I believe this contribution will be
> accepted by the Community. Moreover, it shows so remarkable performance
> boost.
>
> I'm pretty sure this patch will be reviewed by Ignite Native Persistence
> experts soon.
>
> What do you mean by can't survive PDS node restart?
>
> Sincerely,
> Dmitriy Pavlov
>
> чт, 23 авг. 2018 г. в 13:09, Ilya Kasnacheev <[hidden email]>:
>
> > Hello!
> >
> > I want to share with the developer community my compression prototype.
> >
> > Long story short, it compresses BinaryObject's byte[] as they are written
> > to Durable Memory page, operating on a pre-built dictionary. Typical
> > compression ratio is 0.4 (meaning 2.5x compression) using custom
> > LZW+Huffman. Metadata, indexes and primitive values are unaffected
> > entirely.
> >
> > This is akin to DB2's table-level compression[1] but independently
> > invented.
> >
> > On Yardstick tests performance hit is -6% with PDS and up to -25% (in
> > throughput) with In-Memory loads. It also means you can fit ~twice as
> much
> > data into the same IM cluster, or have higher ram/disk ratio with PDS
> > cluster, saving on hardware or decreasing latency.
> >
> > The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true
> to
> > activate). Note that it will not presently survive a PDS node restart.
> > The impact is very small, the patch should be applicable to most 2.x
> > releases.
> >
> > Sure there's a long way before this prototype can have hope of being
> > included, but first I would like to hear input from fellow igniters.
> >
> > See also IEP-20[3].
> >
> > 1.
> >
> > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > 2. https://github.com/apache/ignite/pull/4295
> > 3.
> >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 20%3A+Data+Compression+in+Ignite
> >
> > Regards,
> >
> > --
> > Ilya Kasnacheev
> >
>

Sergey Kozlov

Re: Compression prototype

In reply to this post by Ilya Kasnacheev

Hi Ilya

Is there a plan to introduce it as an option of Ignite configuration? In
that instead the boolean type I suggest to use the enum and reserve the
ability to extend compressions algorithms in future

On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <[hidden email]>
wrote:

> Hello!
>
> I want to share with the developer community my compression prototype.
>
> Long story short, it compresses BinaryObject's byte[] as they are written
> to Durable Memory page, operating on a pre-built dictionary. Typical
> compression ratio is 0.4 (meaning 2.5x compression) using custom
> LZW+Huffman. Metadata, indexes and primitive values are unaffected
> entirely.
>
> This is akin to DB2's table-level compression[1] but independently
> invented.
>
> On Yardstick tests performance hit is -6% with PDS and up to -25% (in
> throughput) with In-Memory loads. It also means you can fit ~twice as much
> data into the same IM cluster, or have higher ram/disk ratio with PDS
> cluster, saving on hardware or decreasing latency.
>
> The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true to
> activate). Note that it will not presently survive a PDS node restart.
> The impact is very small, the patch should be applicable to most 2.x
> releases.
>
> Sure there's a long way before this prototype can have hope of being
> included, but first I would like to hear input from fellow igniters.
>
> See also IEP-20[3].
>
> 1.
> https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> 2. https://github.com/apache/ignite/pull/4295
> 3.
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 20%3A+Data+Compression+in+Ignite
>
> Regards,
>
> --
> Ilya Kasnacheev
>

--
Sergey Kozlov
GridGain Systems
www.gridgain.com

Dmitriy Pavlov

Re: Compression prototype

In reply to this post by Ilya Kasnacheev

Ok, thanks. IMO we need to store the dictionary in Durable memory before
merging into master.

чт, 23 авг. 2018 г. в 15:12, Ilya Kasnacheev <[hidden email]>:

> Hello!
>
> Currently, the dictionary for decompression is only stored on heap. After
> restart there's compressed data in the PDS, but there's no dictionary :)
>
> Regards,
>
> --
> Ilya Kasnacheev
>
> 2018-08-23 14:58 GMT+03:00 Dmitriy Pavlov <[hidden email]>:
>
> > Hi Ilya,
> >
> > Thank you for sharing this here. I believe this contribution will be
> > accepted by the Community. Moreover, it shows so remarkable performance
> > boost.
> >
> > I'm pretty sure this patch will be reviewed by Ignite Native Persistence
> > experts soon.
> >
> > What do you mean by can't survive PDS node restart?
> >
> > Sincerely,
> > Dmitriy Pavlov
> >
> > чт, 23 авг. 2018 г. в 13:09, Ilya Kasnacheev <[hidden email]
> >:
> >
> > > Hello!
> > >
> > > I want to share with the developer community my compression prototype.
> > >
> > > Long story short, it compresses BinaryObject's byte[] as they are
> written
> > > to Durable Memory page, operating on a pre-built dictionary. Typical
> > > compression ratio is 0.4 (meaning 2.5x compression) using custom
> > > LZW+Huffman. Metadata, indexes and primitive values are unaffected
> > > entirely.
> > >
> > > This is akin to DB2's table-level compression[1] but independently
> > > invented.
> > >
> > > On Yardstick tests performance hit is -6% with PDS and up to -25% (in
> > > throughput) with In-Memory loads. It also means you can fit ~twice as
> > much
> > > data into the same IM cluster, or have higher ram/disk ratio with PDS
> > > cluster, saving on hardware or decreasing latency.
> > >
> > > The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true
> > to
> > > activate). Note that it will not presently survive a PDS node restart.
> > > The impact is very small, the patch should be applicable to most 2.x
> > > releases.
> > >
> > > Sure there's a long way before this prototype can have hope of being
> > > included, but first I would like to hear input from fellow igniters.
> > >
> > > See also IEP-20[3].
> > >
> > > 1.
> > >
> > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > 2. https://github.com/apache/ignite/pull/4295
> > > 3.
> > >
> > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 20%3A+Data+Compression+in+Ignite
> > >
> > > Regards,
> > >
> > > --
> > > Ilya Kasnacheev
> > >
> >
>

Ilya Kasnacheev

Re: Compression prototype

In reply to this post by Sergey Kozlov

Hello!

My plan was to add a compression section to cache configuration, where you
can enable compression, enable key compression (which has heavier
performance implications), adjust dictionary gathering settings, and in the
future possibly choose betwen algorithms. In fact I'm not sure, since my
assumption is that you can always just use latest&greatest, but maybe we
can have e.g. very fast and not very strong vs. slower but stronger one.

I'm not sure yet if we should share dictionary between all caches vs.
having separate dictionary for every cache.

With regards to data format, of course there will be room for further
extension.

Regards,

--
Ilya Kasnacheev

2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>:

> Hi Ilya
>
> Is there a plan to introduce it as an option of Ignite configuration? In
> that instead the boolean type I suggest to use the enum and reserve the
> ability to extend compressions algorithms in future
>
> On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> [hidden email]>
> wrote:
>
> > Hello!
> >
> > I want to share with the developer community my compression prototype.
> >
> > Long story short, it compresses BinaryObject's byte[] as they are written
> > to Durable Memory page, operating on a pre-built dictionary. Typical
> > compression ratio is 0.4 (meaning 2.5x compression) using custom
> > LZW+Huffman. Metadata, indexes and primitive values are unaffected
> > entirely.
> >
> > This is akin to DB2's table-level compression[1] but independently
> > invented.
> >
> > On Yardstick tests performance hit is -6% with PDS and up to -25% (in
> > throughput) with In-Memory loads. It also means you can fit ~twice as
> much
> > data into the same IM cluster, or have higher ram/disk ratio with PDS
> > cluster, saving on hardware or decreasing latency.
> >
> > The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true
> to
> > activate). Note that it will not presently survive a PDS node restart.
> > The impact is very small, the patch should be applicable to most 2.x
> > releases.
> >
> > Sure there's a long way before this prototype can have hope of being
> > included, but first I would like to hear input from fellow igniters.
> >
> > See also IEP-20[3].
> >
> > 1.
> > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > 2. https://github.com/apache/ignite/pull/4295
> > 3.
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 20%3A+Data+Compression+in+Ignite
> >
> > Regards,
> >
> > --
> > Ilya Kasnacheev
> >
>
>
>
> --
> Sergey Kozlov
> GridGain Systems
> www.gridgain.com
>

dmagda

Re: Compression prototype

Hi Ilya,

Sounds terrific! Is this part of the following Ignite enhancement proposal?
https://cwiki.apache.org/confluence/display/IGNITE/IEP-20%3A+Data+Compression+in+Ignite

--
Denis

On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <[hidden email]>
wrote:

> Hello!
>
> My plan was to add a compression section to cache configuration, where you
> can enable compression, enable key compression (which has heavier
> performance implications), adjust dictionary gathering settings, and in the
> future possibly choose betwen algorithms. In fact I'm not sure, since my
> assumption is that you can always just use latest&greatest, but maybe we
> can have e.g. very fast and not very strong vs. slower but stronger one.
>
> I'm not sure yet if we should share dictionary between all caches vs.
> having separate dictionary for every cache.
>
> With regards to data format, of course there will be room for further
> extension.
>
> Regards,
>
> --
> Ilya Kasnacheev
>
> 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>:
>
> > Hi Ilya
> >
> > Is there a plan to introduce it as an option of Ignite configuration? In
> > that instead the boolean type I suggest to use the enum and reserve the
> > ability to extend compressions algorithms in future
> >
> > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> > [hidden email]>
> > wrote:
> >
> > > Hello!
> > >
> > > I want to share with the developer community my compression prototype.
> > >
> > > Long story short, it compresses BinaryObject's byte[] as they are
> written
> > > to Durable Memory page, operating on a pre-built dictionary. Typical
> > > compression ratio is 0.4 (meaning 2.5x compression) using custom
> > > LZW+Huffman. Metadata, indexes and primitive values are unaffected
> > > entirely.
> > >
> > > This is akin to DB2's table-level compression[1] but independently
> > > invented.
> > >
> > > On Yardstick tests performance hit is -6% with PDS and up to -25% (in
> > > throughput) with In-Memory loads. It also means you can fit ~twice as
> > much
> > > data into the same IM cluster, or have higher ram/disk ratio with PDS
> > > cluster, saving on hardware or decreasing latency.
> > >
> > > The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true
> > to
> > > activate). Note that it will not presently survive a PDS node restart.
> > > The impact is very small, the patch should be applicable to most 2.x
> > > releases.
> > >
> > > Sure there's a long way before this prototype can have hope of being
> > > included, but first I would like to hear input from fellow igniters.
> > >
> > > See also IEP-20[3].
> > >
> > > 1.
> > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > 2. https://github.com/apache/ignite/pull/4295
> > > 3.
> > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > 20%3A+Data+Compression+in+Ignite
> > >
> > > Regards,
> > >
> > > --
> > > Ilya Kasnacheev
> > >
> >
> >
> >
> > --
> > Sergey Kozlov
> > GridGain Systems
> > www.gridgain.com
> >
>

Ilya Kasnacheev

Re: Compression prototype

Hello!

It is somewhat a part of IEP-20, since I have updated it with this
particular direction.

Regards,

--
Ilya Kasnacheev

2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>:

> Hi Ilya,
>
> Sounds terrific! Is this part of the following Ignite enhancement proposal?
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 20%3A+Data+Compression+in+Ignite
>
> --
> Denis
>
> On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <[hidden email]
> >
> wrote:
>
> > Hello!
> >
> > My plan was to add a compression section to cache configuration, where
> you
> > can enable compression, enable key compression (which has heavier
> > performance implications), adjust dictionary gathering settings, and in
> the
> > future possibly choose betwen algorithms. In fact I'm not sure, since my
> > assumption is that you can always just use latest&greatest, but maybe we
> > can have e.g. very fast and not very strong vs. slower but stronger one.
> >
> > I'm not sure yet if we should share dictionary between all caches vs.
> > having separate dictionary for every cache.
> >
> > With regards to data format, of course there will be room for further
> > extension.
> >
> > Regards,
> >
> > --
> > Ilya Kasnacheev
> >
> > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>:
> >
> > > Hi Ilya
> > >
> > > Is there a plan to introduce it as an option of Ignite configuration?
> In
> > > that instead the boolean type I suggest to use the enum and reserve the
> > > ability to extend compressions algorithms in future
> > >
> > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> > > [hidden email]>
> > > wrote:
> > >
> > > > Hello!
> > > >
> > > > I want to share with the developer community my compression
> prototype.
> > > >
> > > > Long story short, it compresses BinaryObject's byte[] as they are
> > written
> > > > to Durable Memory page, operating on a pre-built dictionary. Typical
> > > > compression ratio is 0.4 (meaning 2.5x compression) using custom
> > > > LZW+Huffman. Metadata, indexes and primitive values are unaffected
> > > > entirely.
> > > >
> > > > This is akin to DB2's table-level compression[1] but independently
> > > > invented.
> > > >
> > > > On Yardstick tests performance hit is -6% with PDS and up to -25% (in
> > > > throughput) with In-Memory loads. It also means you can fit ~twice as
> > > much
> > > > data into the same IM cluster, or have higher ram/disk ratio with PDS
> > > > cluster, saving on hardware or decreasing latency.
> > > >
> > > > The code is available as PR 4295[2] (set
> IGNITE_ENABLE_COMPRESSION=true
> > > to
> > > > activate). Note that it will not presently survive a PDS node
> restart.
> > > > The impact is very small, the patch should be applicable to most 2.x
> > > > releases.
> > > >
> > > > Sure there's a long way before this prototype can have hope of being
> > > > included, but first I would like to hear input from fellow igniters.
> > > >
> > > > See also IEP-20[3].
> > > >
> > > > 1.
> > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > > 2. https://github.com/apache/ignite/pull/4295
> > > > 3.
> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > 20%3A+Data+Compression+in+Ignite
> > > >
> > > > Regards,
> > > >
> > > > --
> > > > Ilya Kasnacheev
> > > >
> > >
> > >
> > >
> > > --
> > > Sergey Kozlov
> > > GridGain Systems
> > > www.gridgain.com
> > >
> >
>

dmagda

Re: Compression prototype

>
> Currently, the dictionary for decompression is only stored on heap. After
> restart there's compressed data in the PDS, but there's no dictionary :)

Basically, it means that I've lost my data, right? How about persisting
data to disk.

Overall, we need Vladimir Ozerov to check the contribution. He was the one
who sponsored the IEP and knows the area best.

--
Denis

On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <[hidden email]>
wrote:

> Hello!
>
> It is somewhat a part of IEP-20, since I have updated it with this
> particular direction.
>
> Regards,
>
> --
> Ilya Kasnacheev
>
> 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>:
>
> > Hi Ilya,
> >
> > Sounds terrific! Is this part of the following Ignite enhancement
> proposal?
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 20%3A+Data+Compression+in+Ignite
> >
> > --
> > Denis
> >
> > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
> [hidden email]
> > >
> > wrote:
> >
> > > Hello!
> > >
> > > My plan was to add a compression section to cache configuration, where
> > you
> > > can enable compression, enable key compression (which has heavier
> > > performance implications), adjust dictionary gathering settings, and in
> > the
> > > future possibly choose betwen algorithms. In fact I'm not sure, since
> my
> > > assumption is that you can always just use latest&greatest, but maybe
> we
> > > can have e.g. very fast and not very strong vs. slower but stronger
> one.
> > >
> > > I'm not sure yet if we should share dictionary between all caches vs.
> > > having separate dictionary for every cache.
> > >
> > > With regards to data format, of course there will be room for further
> > > extension.
> > >
> > > Regards,
> > >
> > > --
> > > Ilya Kasnacheev
> > >
> > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>:
> > >
> > > > Hi Ilya
> > > >
> > > > Is there a plan to introduce it as an option of Ignite configuration?
> > In
> > > > that instead the boolean type I suggest to use the enum and reserve
> the
> > > > ability to extend compressions algorithms in future
> > > >
> > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> > > > [hidden email]>
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > I want to share with the developer community my compression
> > prototype.
> > > > >
> > > > > Long story short, it compresses BinaryObject's byte[] as they are
> > > written
> > > > > to Durable Memory page, operating on a pre-built dictionary.
> Typical
> > > > > compression ratio is 0.4 (meaning 2.5x compression) using custom
> > > > > LZW+Huffman. Metadata, indexes and primitive values are unaffected
> > > > > entirely.
> > > > >
> > > > > This is akin to DB2's table-level compression[1] but independently
> > > > > invented.
> > > > >
> > > > > On Yardstick tests performance hit is -6% with PDS and up to -25%
> (in
> > > > > throughput) with In-Memory loads. It also means you can fit ~twice
> as
> > > > much
> > > > > data into the same IM cluster, or have higher ram/disk ratio with
> PDS
> > > > > cluster, saving on hardware or decreasing latency.
> > > > >
> > > > > The code is available as PR 4295[2] (set
> > IGNITE_ENABLE_COMPRESSION=true
> > > > to
> > > > > activate). Note that it will not presently survive a PDS node
> > restart.
> > > > > The impact is very small, the patch should be applicable to most
> 2.x
> > > > > releases.
> > > > >
> > > > > Sure there's a long way before this prototype can have hope of
> being
> > > > > included, but first I would like to hear input from fellow
> igniters.
> > > > >
> > > > > See also IEP-20[3].
> > > > >
> > > > > 1.
> > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > > > 2. https://github.com/apache/ignite/pull/4295
> > > > > 3.
> > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > 20%3A+Data+Compression+in+Ignite
> > > > >
> > > > > Regards,
> > > > >
> > > > > --
> > > > > Ilya Kasnacheev
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Sergey Kozlov
> > > > GridGain Systems
> > > > www.gridgain.com
> > > >
> > >
> >
>

daradurvs

Re: Compression prototype

Hi Igniters!

Ilya, I'm glad to see one more person who is interested in the
compression feature in Ignite.

I looked through the pull request and want to share following thoughts:

It's very dangerous using a custom algorithm in this way - you store
serialized data separate from a dictionary and there are a lot of
points when we may lose data: rebalancing, serialization errors, node
rebooting and so on.

I'd suggest the following ways to improve reliability:
- use well know algorithms: zstd, deflater, lzma, gzip e.g. that
allows us to decompress data in any situation
- store the dictionary inside page with data

Also, we have a lot of discussions [1] [2] about compression on
BinaryObject and BinaryMarshaller level and Vladimir Ozerov was
strictly against a compression on this level.
If something has changed since then, you may look through [1] [2] [3]
I've done a lot of research in algorithms comparison it may be useful
for you.

[1] http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html
[2] http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html
[3] https://issues.apache.org/jira/browse/IGNITE-3592
[4] https://issues.apache.org/jira/browse/IGNITE-5226
[5] https://github.com/daradurvs/ignite-compression
On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> wrote:

>
> >
> > Currently, the dictionary for decompression is only stored on heap. After
> > restart there's compressed data in the PDS, but there's no dictionary :)
>
>
> Basically, it means that I've lost my data, right? How about persisting
> data to disk.
>
> Overall, we need Vladimir Ozerov to check the contribution. He was the one
> who sponsored the IEP and knows the area best.
>
> --
> Denis
>
> On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <[hidden email]>
> wrote:
>
> > Hello!
> >
> > It is somewhat a part of IEP-20, since I have updated it with this
> > particular direction.
> >
> > Regards,
> >
> > --
> > Ilya Kasnacheev
> >
> > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>:
> >
> > > Hi Ilya,
> > >
> > > Sounds terrific! Is this part of the following Ignite enhancement
> > proposal?
> > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > 20%3A+Data+Compression+in+Ignite
> > >
> > > --
> > > Denis
> > >
> > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
> > [hidden email]
> > > >
> > > wrote:
> > >
> > > > Hello!
> > > >
> > > > My plan was to add a compression section to cache configuration, where
> > > you
> > > > can enable compression, enable key compression (which has heavier
> > > > performance implications), adjust dictionary gathering settings, and in
> > > the
> > > > future possibly choose betwen algorithms. In fact I'm not sure, since
> > my
> > > > assumption is that you can always just use latest&greatest, but maybe
> > we
> > > > can have e.g. very fast and not very strong vs. slower but stronger
> > one.
> > > >
> > > > I'm not sure yet if we should share dictionary between all caches vs.
> > > > having separate dictionary for every cache.
> > > >
> > > > With regards to data format, of course there will be room for further
> > > > extension.
> > > >
> > > > Regards,
> > > >
> > > > --
> > > > Ilya Kasnacheev
> > > >
> > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>:
> > > >
> > > > > Hi Ilya
> > > > >
> > > > > Is there a plan to introduce it as an option of Ignite configuration?
> > > In
> > > > > that instead the boolean type I suggest to use the enum and reserve
> > the
> > > > > ability to extend compressions algorithms in future
> > > > >
> > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> > > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Hello!
> > > > > >
> > > > > > I want to share with the developer community my compression
> > > prototype.
> > > > > >
> > > > > > Long story short, it compresses BinaryObject's byte[] as they are
> > > > written
> > > > > > to Durable Memory page, operating on a pre-built dictionary.
> > Typical
> > > > > > compression ratio is 0.4 (meaning 2.5x compression) using custom
> > > > > > LZW+Huffman. Metadata, indexes and primitive values are unaffected
> > > > > > entirely.
> > > > > >
> > > > > > This is akin to DB2's table-level compression[1] but independently
> > > > > > invented.
> > > > > >
> > > > > > On Yardstick tests performance hit is -6% with PDS and up to -25%
> > (in
> > > > > > throughput) with In-Memory loads. It also means you can fit ~twice
> > as
> > > > > much
> > > > > > data into the same IM cluster, or have higher ram/disk ratio with
> > PDS
> > > > > > cluster, saving on hardware or decreasing latency.
> > > > > >
> > > > > > The code is available as PR 4295[2] (set
> > > IGNITE_ENABLE_COMPRESSION=true
> > > > > to
> > > > > > activate). Note that it will not presently survive a PDS node
> > > restart.
> > > > > > The impact is very small, the patch should be applicable to most
> > 2.x
> > > > > > releases.
> > > > > >
> > > > > > Sure there's a long way before this prototype can have hope of
> > being
> > > > > > included, but first I would like to hear input from fellow
> > igniters.
> > > > > >
> > > > > > See also IEP-20[3].
> > > > > >
> > > > > > 1.
> > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > > > > 2. https://github.com/apache/ignite/pull/4295
> > > > > > 3.
> > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > --
> > > > > > Ilya Kasnacheev
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sergey Kozlov
> > > > > GridGain Systems
> > > > > www.gridgain.com
> > > > >
> > > >
> > >
> >

--
Best Regards, Vyacheslav D.

Ilya Kasnacheev

Re: Compression prototype

Hello Vyacheslav!

Unfortunately I have not found any efficient algorithms that will allow me
to use external dictionary as a pre-processed data structure. If plain gzip
is used without dictionary, the compression is around 0.7, as opposed to
0.4 that I will get with custom implementation, AFAIR the performance was
also worse. I didn't really try it with dictionary, but I assume
performance will be even worse since it will have to scan dictionary before
getting to actual data.

We have such a huge array of tests that we can just run them all with
compression enabled, see if there are any new failures. But the impact of
my commit is fairly low, it is only triggered when data is written to page
(maybe to WAL also?), and we don't really do much frivolous stuff to pages.

Still, I am very much interested in finding existing compression
implementations with support of external dictionary; I am also very much
interested in having different implementations of compression for Apache
Ignite (such as per page compression) and comparing them by benchmark and
by code impact. I am also very interested in large standard datasets for
Apache Ignite (or generators thereof) so that we can run precise benchmarks
on various compression schemes. If you have any of the following, please
get back to me.

Regards,
--
Ilya Kasnacheev

пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>:

> Hi Igniters!
>
> Ilya, I'm glad to see one more person who is interested in the
> compression feature in Ignite.
>
> I looked through the pull request and want to share following thoughts:
>
> It's very dangerous using a custom algorithm in this way - you store
> serialized data separate from a dictionary and there are a lot of
> points when we may lose data: rebalancing, serialization errors, node
> rebooting and so on.
>
> I'd suggest the following ways to improve reliability:
> - use well know algorithms: zstd, deflater, lzma, gzip e.g. that
> allows us to decompress data in any situation
> - store the dictionary inside page with data
>
> Also, we have a lot of discussions [1] [2] about compression on
> BinaryObject and BinaryMarshaller level and Vladimir Ozerov was
> strictly against a compression on this level.
> If something has changed since then, you may look through [1] [2] [3]
> I've done a lot of research in algorithms comparison it may be useful
> for you.
>
> [1]
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html
> [2]
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html
> [3] https://issues.apache.org/jira/browse/IGNITE-3592
> [4] https://issues.apache.org/jira/browse/IGNITE-5226
> [5] https://github.com/daradurvs/ignite-compression
> On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> wrote:
> >
> > >
> > > Currently, the dictionary for decompression is only stored on heap.
> After
> > > restart there's compressed data in the PDS, but there's no dictionary
> :)
> >
> >
> > Basically, it means that I've lost my data, right? How about persisting
> > data to disk.
> >
> > Overall, we need Vladimir Ozerov to check the contribution. He was the
> one
> > who sponsored the IEP and knows the area best.
> >
> > --
> > Denis
> >
> > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <
> [hidden email]>
> > wrote:
> >
> > > Hello!
> > >
> > > It is somewhat a part of IEP-20, since I have updated it with this
> > > particular direction.
> > >
> > > Regards,
> > >
> > > --
> > > Ilya Kasnacheev
> > >
> > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>:
> > >
> > > > Hi Ilya,
> > > >
> > > > Sounds terrific! Is this part of the following Ignite enhancement
> > > proposal?
> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > 20%3A+Data+Compression+in+Ignite
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
> > > [hidden email]
> > > > >
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > My plan was to add a compression section to cache configuration,
> where
> > > > you
> > > > > can enable compression, enable key compression (which has heavier
> > > > > performance implications), adjust dictionary gathering settings,
> and in
> > > > the
> > > > > future possibly choose betwen algorithms. In fact I'm not sure,
> since
> > > my
> > > > > assumption is that you can always just use latest&greatest, but
> maybe
> > > we
> > > > > can have e.g. very fast and not very strong vs. slower but stronger
> > > one.
> > > > >
> > > > > I'm not sure yet if we should share dictionary between all caches
> vs.
> > > > > having separate dictionary for every cache.
> > > > >
> > > > > With regards to data format, of course there will be room for
> further
> > > > > extension.
> > > > >
> > > > > Regards,
> > > > >
> > > > > --
> > > > > Ilya Kasnacheev
> > > > >
> > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>:
> > > > >
> > > > > > Hi Ilya
> > > > > >
> > > > > > Is there a plan to introduce it as an option of Ignite
> configuration?
> > > > In
> > > > > > that instead the boolean type I suggest to use the enum and
> reserve
> > > the
> > > > > > ability to extend compressions algorithms in future
> > > > > >
> > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> > > > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > I want to share with the developer community my compression
> > > > prototype.
> > > > > > >
> > > > > > > Long story short, it compresses BinaryObject's byte[] as they
> are
> > > > > written
> > > > > > > to Durable Memory page, operating on a pre-built dictionary.
> > > Typical
> > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using
> custom
> > > > > > > LZW+Huffman. Metadata, indexes and primitive values are
> unaffected
> > > > > > > entirely.
> > > > > > >
> > > > > > > This is akin to DB2's table-level compression[1] but
> independently
> > > > > > > invented.
> > > > > > >
> > > > > > > On Yardstick tests performance hit is -6% with PDS and up to
> -25%
> > > (in
> > > > > > > throughput) with In-Memory loads. It also means you can fit
> ~twice
> > > as
> > > > > > much
> > > > > > > data into the same IM cluster, or have higher ram/disk ratio
> with
> > > PDS
> > > > > > > cluster, saving on hardware or decreasing latency.
> > > > > > >
> > > > > > > The code is available as PR 4295[2] (set
> > > > IGNITE_ENABLE_COMPRESSION=true
> > > > > > to
> > > > > > > activate). Note that it will not presently survive a PDS node
> > > > restart.
> > > > > > > The impact is very small, the patch should be applicable to
> most
> > > 2.x
> > > > > > > releases.
> > > > > > >
> > > > > > > Sure there's a long way before this prototype can have hope of
> > > being
> > > > > > > included, but first I would like to hear input from fellow
> > > igniters.
> > > > > > >
> > > > > > > See also IEP-20[3].
> > > > > > >
> > > > > > > 1.
> > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > > > > > 2. https://github.com/apache/ignite/pull/4295
> > > > > > > 3.
> > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > --
> > > > > > > Ilya Kasnacheev
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sergey Kozlov
> > > > > > GridGain Systems
> > > > > > www.gridgain.com
> > > > > >
> > > > >
> > > >
> > >
>
>
>
> --
> Best Regards, Vyacheslav D.
>

daradurvs

Re: Compression prototype

According to my benchmarks - zstd compression algorithm [1] looks very
interesting, it has a high compression ratio with quite good speed.
AFAIK it supports external dictionaries, but I'm not sure about using
it with "on the fly building" dictionaries. Anyway, have look at (it
has ASF 2.0 friendly license).

Also, here is data generator / loader [1]. If it will be useful for
you we should ask Nikolay Izhikov to share public docs to start.

[1] https://github.com/facebook/zstd
[2] https://github.com/nizhikov/ignite-cod-data-loader
On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev
<[hidden email]> wrote:

>
> Hello Vyacheslav!
>
> Unfortunately I have not found any efficient algorithms that will allow me
> to use external dictionary as a pre-processed data structure. If plain gzip
> is used without dictionary, the compression is around 0.7, as opposed to
> 0.4 that I will get with custom implementation, AFAIR the performance was
> also worse. I didn't really try it with dictionary, but I assume
> performance will be even worse since it will have to scan dictionary before
> getting to actual data.
>
> We have such a huge array of tests that we can just run them all with
> compression enabled, see if there are any new failures. But the impact of
> my commit is fairly low, it is only triggered when data is written to page
> (maybe to WAL also?), and we don't really do much frivolous stuff to pages.
>
> Still, I am very much interested in finding existing compression
> implementations with support of external dictionary; I am also very much
> interested in having different implementations of compression for Apache
> Ignite (such as per page compression) and comparing them by benchmark and
> by code impact. I am also very interested in large standard datasets for
> Apache Ignite (or generators thereof) so that we can run precise benchmarks
> on various compression schemes. If you have any of the following, please
> get back to me.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>:
>
> > Hi Igniters!
> >
> > Ilya, I'm glad to see one more person who is interested in the
> > compression feature in Ignite.
> >
> > I looked through the pull request and want to share following thoughts:
> >
> > It's very dangerous using a custom algorithm in this way - you store
> > serialized data separate from a dictionary and there are a lot of
> > points when we may lose data: rebalancing, serialization errors, node
> > rebooting and so on.
> >
> > I'd suggest the following ways to improve reliability:
> > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that
> > allows us to decompress data in any situation
> > - store the dictionary inside page with data
> >
> > Also, we have a lot of discussions [1] [2] about compression on
> > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was
> > strictly against a compression on this level.
> > If something has changed since then, you may look through [1] [2] [3]
> > I've done a lot of research in algorithms comparison it may be useful
> > for you.
> >
> > [1]
> > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html
> > [2]
> > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html
> > [3] https://issues.apache.org/jira/browse/IGNITE-3592
> > [4] https://issues.apache.org/jira/browse/IGNITE-5226
> > [5] https://github.com/daradurvs/ignite-compression
> > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> wrote:
> > >
> > > >
> > > > Currently, the dictionary for decompression is only stored on heap.
> > After
> > > > restart there's compressed data in the PDS, but there's no dictionary
> > :)
> > >
> > >
> > > Basically, it means that I've lost my data, right? How about persisting
> > > data to disk.
> > >
> > > Overall, we need Vladimir Ozerov to check the contribution. He was the
> > one
> > > who sponsored the IEP and knows the area best.
> > >
> > > --
> > > Denis
> > >
> > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <
> > [hidden email]>
> > > wrote:
> > >
> > > > Hello!
> > > >
> > > > It is somewhat a part of IEP-20, since I have updated it with this
> > > > particular direction.
> > > >
> > > > Regards,
> > > >
> > > > --
> > > > Ilya Kasnacheev
> > > >
> > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>:
> > > >
> > > > > Hi Ilya,
> > > > >
> > > > > Sounds terrific! Is this part of the following Ignite enhancement
> > > > proposal?
> > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > 20%3A+Data+Compression+in+Ignite
> > > > >
> > > > > --
> > > > > Denis
> > > > >
> > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
> > > > [hidden email]
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hello!
> > > > > >
> > > > > > My plan was to add a compression section to cache configuration,
> > where
> > > > > you
> > > > > > can enable compression, enable key compression (which has heavier
> > > > > > performance implications), adjust dictionary gathering settings,
> > and in
> > > > > the
> > > > > > future possibly choose betwen algorithms. In fact I'm not sure,
> > since
> > > > my
> > > > > > assumption is that you can always just use latest&greatest, but
> > maybe
> > > > we
> > > > > > can have e.g. very fast and not very strong vs. slower but stronger
> > > > one.
> > > > > >
> > > > > > I'm not sure yet if we should share dictionary between all caches
> > vs.
> > > > > > having separate dictionary for every cache.
> > > > > >
> > > > > > With regards to data format, of course there will be room for
> > further
> > > > > > extension.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > --
> > > > > > Ilya Kasnacheev
> > > > > >
> > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>:
> > > > > >
> > > > > > > Hi Ilya
> > > > > > >
> > > > > > > Is there a plan to introduce it as an option of Ignite
> > configuration?
> > > > > In
> > > > > > > that instead the boolean type I suggest to use the enum and
> > reserve
> > > > the
> > > > > > > ability to extend compressions algorithms in future
> > > > > > >
> > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> > > > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello!
> > > > > > > >
> > > > > > > > I want to share with the developer community my compression
> > > > > prototype.
> > > > > > > >
> > > > > > > > Long story short, it compresses BinaryObject's byte[] as they
> > are
> > > > > > written
> > > > > > > > to Durable Memory page, operating on a pre-built dictionary.
> > > > Typical
> > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using
> > custom
> > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are
> > unaffected
> > > > > > > > entirely.
> > > > > > > >
> > > > > > > > This is akin to DB2's table-level compression[1] but
> > independently
> > > > > > > > invented.
> > > > > > > >
> > > > > > > > On Yardstick tests performance hit is -6% with PDS and up to
> > -25%
> > > > (in
> > > > > > > > throughput) with In-Memory loads. It also means you can fit
> > ~twice
> > > > as
> > > > > > > much
> > > > > > > > data into the same IM cluster, or have higher ram/disk ratio
> > with
> > > > PDS
> > > > > > > > cluster, saving on hardware or decreasing latency.
> > > > > > > >
> > > > > > > > The code is available as PR 4295[2] (set
> > > > > IGNITE_ENABLE_COMPRESSION=true
> > > > > > > to
> > > > > > > > activate). Note that it will not presently survive a PDS node
> > > > > restart.
> > > > > > > > The impact is very small, the patch should be applicable to
> > most
> > > > 2.x
> > > > > > > > releases.
> > > > > > > >
> > > > > > > > Sure there's a long way before this prototype can have hope of
> > > > being
> > > > > > > > included, but first I would like to hear input from fellow
> > > > igniters.
> > > > > > > >
> > > > > > > > See also IEP-20[3].
> > > > > > > >
> > > > > > > > 1.
> > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > > > > > > 2. https://github.com/apache/ignite/pull/4295
> > > > > > > > 3.
> > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > --
> > > > > > > > Ilya Kasnacheev
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sergey Kozlov
> > > > > > > GridGain Systems
> > > > > > > www.gridgain.com
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> >
> >
> > --
> > Best Regards, Vyacheslav D.
> >

--
Best Regards, Vyacheslav D.

daradurvs

Re: Compression prototype

I have another suggestion which may help us reduce objects size
extremely - implementing some kind of SQL Scheme.

For now, BinaryObject's format is too excessive - each serialized
object stores offset of every serialized field even if the offset can
be easily calculated.

If we move this metadata from a serialized object to a separate entity
- this will reduce an object's size.
On Mon, Aug 27, 2018 at 2:53 PM Vyacheslav Daradur <[hidden email]> wrote:

>
> According to my benchmarks - zstd compression algorithm [1] looks very
> interesting, it has a high compression ratio with quite good speed.
> AFAIK it supports external dictionaries, but I'm not sure about using
> it with "on the fly building" dictionaries. Anyway, have look at (it
> has ASF 2.0 friendly license).
>
> Also, here is data generator / loader [1]. If it will be useful for
> you we should ask Nikolay Izhikov to share public docs to start.
>
> [1] https://github.com/facebook/zstd
> [2] https://github.com/nizhikov/ignite-cod-data-loader
> On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev
> <[hidden email]> wrote:
> >
> > Hello Vyacheslav!
> >
> > Unfortunately I have not found any efficient algorithms that will allow me
> > to use external dictionary as a pre-processed data structure. If plain gzip
> > is used without dictionary, the compression is around 0.7, as opposed to
> > 0.4 that I will get with custom implementation, AFAIR the performance was
> > also worse. I didn't really try it with dictionary, but I assume
> > performance will be even worse since it will have to scan dictionary before
> > getting to actual data.
> >
> > We have such a huge array of tests that we can just run them all with
> > compression enabled, see if there are any new failures. But the impact of
> > my commit is fairly low, it is only triggered when data is written to page
> > (maybe to WAL also?), and we don't really do much frivolous stuff to pages.
> >
> > Still, I am very much interested in finding existing compression
> > implementations with support of external dictionary; I am also very much
> > interested in having different implementations of compression for Apache
> > Ignite (such as per page compression) and comparing them by benchmark and
> > by code impact. I am also very interested in large standard datasets for
> > Apache Ignite (or generators thereof) so that we can run precise benchmarks
> > on various compression schemes. If you have any of the following, please
> > get back to me.
> >
> > Regards,
> > --
> > Ilya Kasnacheev
> >
> >
> > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>:
> >
> > > Hi Igniters!
> > >
> > > Ilya, I'm glad to see one more person who is interested in the
> > > compression feature in Ignite.
> > >
> > > I looked through the pull request and want to share following thoughts:
> > >
> > > It's very dangerous using a custom algorithm in this way - you store
> > > serialized data separate from a dictionary and there are a lot of
> > > points when we may lose data: rebalancing, serialization errors, node
> > > rebooting and so on.
> > >
> > > I'd suggest the following ways to improve reliability:
> > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that
> > > allows us to decompress data in any situation
> > > - store the dictionary inside page with data
> > >
> > > Also, we have a lot of discussions [1] [2] about compression on
> > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was
> > > strictly against a compression on this level.
> > > If something has changed since then, you may look through [1] [2] [3]
> > > I've done a lot of research in algorithms comparison it may be useful
> > > for you.
> > >
> > > [1]
> > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html
> > > [2]
> > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html
> > > [3] https://issues.apache.org/jira/browse/IGNITE-3592
> > > [4] https://issues.apache.org/jira/browse/IGNITE-5226
> > > [5] https://github.com/daradurvs/ignite-compression
> > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> wrote:
> > > >
> > > > >
> > > > > Currently, the dictionary for decompression is only stored on heap.
> > > After
> > > > > restart there's compressed data in the PDS, but there's no dictionary
> > > :)
> > > >
> > > >
> > > > Basically, it means that I've lost my data, right? How about persisting
> > > > data to disk.
> > > >
> > > > Overall, we need Vladimir Ozerov to check the contribution. He was the
> > > one
> > > > who sponsored the IEP and knows the area best.
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > It is somewhat a part of IEP-20, since I have updated it with this
> > > > > particular direction.
> > > > >
> > > > > Regards,
> > > > >
> > > > > --
> > > > > Ilya Kasnacheev
> > > > >
> > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>:
> > > > >
> > > > > > Hi Ilya,
> > > > > >
> > > > > > Sounds terrific! Is this part of the following Ignite enhancement
> > > > > proposal?
> > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > >
> > > > > > --
> > > > > > Denis
> > > > > >
> > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
> > > > > [hidden email]
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > My plan was to add a compression section to cache configuration,
> > > where
> > > > > > you
> > > > > > > can enable compression, enable key compression (which has heavier
> > > > > > > performance implications), adjust dictionary gathering settings,
> > > and in
> > > > > > the
> > > > > > > future possibly choose betwen algorithms. In fact I'm not sure,
> > > since
> > > > > my
> > > > > > > assumption is that you can always just use latest&greatest, but
> > > maybe
> > > > > we
> > > > > > > can have e.g. very fast and not very strong vs. slower but stronger
> > > > > one.
> > > > > > >
> > > > > > > I'm not sure yet if we should share dictionary between all caches
> > > vs.
> > > > > > > having separate dictionary for every cache.
> > > > > > >
> > > > > > > With regards to data format, of course there will be room for
> > > further
> > > > > > > extension.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > --
> > > > > > > Ilya Kasnacheev
> > > > > > >
> > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>:
> > > > > > >
> > > > > > > > Hi Ilya
> > > > > > > >
> > > > > > > > Is there a plan to introduce it as an option of Ignite
> > > configuration?
> > > > > > In
> > > > > > > > that instead the boolean type I suggest to use the enum and
> > > reserve
> > > > > the
> > > > > > > > ability to extend compressions algorithms in future
> > > > > > > >
> > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> > > > > > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello!
> > > > > > > > >
> > > > > > > > > I want to share with the developer community my compression
> > > > > > prototype.
> > > > > > > > >
> > > > > > > > > Long story short, it compresses BinaryObject's byte[] as they
> > > are
> > > > > > > written
> > > > > > > > > to Durable Memory page, operating on a pre-built dictionary.
> > > > > Typical
> > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using
> > > custom
> > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are
> > > unaffected
> > > > > > > > > entirely.
> > > > > > > > >
> > > > > > > > > This is akin to DB2's table-level compression[1] but
> > > independently
> > > > > > > > > invented.
> > > > > > > > >
> > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up to
> > > -25%
> > > > > (in
> > > > > > > > > throughput) with In-Memory loads. It also means you can fit
> > > ~twice
> > > > > as
> > > > > > > > much
> > > > > > > > > data into the same IM cluster, or have higher ram/disk ratio
> > > with
> > > > > PDS
> > > > > > > > > cluster, saving on hardware or decreasing latency.
> > > > > > > > >
> > > > > > > > > The code is available as PR 4295[2] (set
> > > > > > IGNITE_ENABLE_COMPRESSION=true
> > > > > > > > to
> > > > > > > > > activate). Note that it will not presently survive a PDS node
> > > > > > restart.
> > > > > > > > > The impact is very small, the patch should be applicable to
> > > most
> > > > > 2.x
> > > > > > > > > releases.
> > > > > > > > >
> > > > > > > > > Sure there's a long way before this prototype can have hope of
> > > > > being
> > > > > > > > > included, but first I would like to hear input from fellow
> > > > > igniters.
> > > > > > > > >
> > > > > > > > > See also IEP-20[3].
> > > > > > > > >
> > > > > > > > > 1.
> > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > > > > > > > 2. https://github.com/apache/ignite/pull/4295
> > > > > > > > > 3.
> > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Ilya Kasnacheev
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Sergey Kozlov
> > > > > > > > GridGain Systems
> > > > > > > > www.gridgain.com
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > Best Regards, Vyacheslav D.
> > >
>
>
>
> --
> Best Regards, Vyacheslav D.

--
Best Regards, Vyacheslav D.

Ilya Kasnacheev

Re: Compression prototype

Hello!

Yes, we can tinker with BinaryObject format, which is currently clearly
excessive.

But the best part with compression, it will automatically remove this
redundancy for us, for free. Even if we had hairy XML as binary object
format, it will still compress roughly to the same number of bytes. If we
will have fast transparent compression, we can just skip this work. Of
course, codifying offsets can have other uses, but it also have a lot of
limitations.

Regards,
--
Ilya Kasnacheev

вт, 28 авг. 2018 г. в 18:30, Vyacheslav Daradur <[hidden email]>:

> I have another suggestion which may help us reduce objects size
> extremely - implementing some kind of SQL Scheme.
>
> For now, BinaryObject's format is too excessive - each serialized
> object stores offset of every serialized field even if the offset can
> be easily calculated.
>
> If we move this metadata from a serialized object to a separate entity
> - this will reduce an object's size.
> On Mon, Aug 27, 2018 at 2:53 PM Vyacheslav Daradur <[hidden email]>
> wrote:
> >
> > According to my benchmarks - zstd compression algorithm [1] looks very
> > interesting, it has a high compression ratio with quite good speed.
> > AFAIK it supports external dictionaries, but I'm not sure about using
> > it with "on the fly building" dictionaries. Anyway, have look at (it
> > has ASF 2.0 friendly license).
> >
> > Also, here is data generator / loader [1]. If it will be useful for
> > you we should ask Nikolay Izhikov to share public docs to start.
> >
> > [1] https://github.com/facebook/zstd
> > [2] https://github.com/nizhikov/ignite-cod-data-loader
> > On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev
> > <[hidden email]> wrote:
> > >
> > > Hello Vyacheslav!
> > >
> > > Unfortunately I have not found any efficient algorithms that will
> allow me
> > > to use external dictionary as a pre-processed data structure. If plain
> gzip
> > > is used without dictionary, the compression is around 0.7, as opposed
> to
> > > 0.4 that I will get with custom implementation, AFAIR the performance
> was
> > > also worse. I didn't really try it with dictionary, but I assume
> > > performance will be even worse since it will have to scan dictionary
> before
> > > getting to actual data.
> > >
> > > We have such a huge array of tests that we can just run them all with
> > > compression enabled, see if there are any new failures. But the impact
> of
> > > my commit is fairly low, it is only triggered when data is written to
> page
> > > (maybe to WAL also?), and we don't really do much frivolous stuff to
> pages.
> > >
> > > Still, I am very much interested in finding existing compression
> > > implementations with support of external dictionary; I am also very
> much
> > > interested in having different implementations of compression for
> Apache
> > > Ignite (such as per page compression) and comparing them by benchmark
> and
> > > by code impact. I am also very interested in large standard datasets
> for
> > > Apache Ignite (or generators thereof) so that we can run precise
> benchmarks
> > > on various compression schemes. If you have any of the following,
> please
> > > get back to me.
> > >
> > > Regards,
> > > --
> > > Ilya Kasnacheev
> > >
> > >
> > > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>:
> > >
> > > > Hi Igniters!
> > > >
> > > > Ilya, I'm glad to see one more person who is interested in the
> > > > compression feature in Ignite.
> > > >
> > > > I looked through the pull request and want to share following
> thoughts:
> > > >
> > > > It's very dangerous using a custom algorithm in this way - you store
> > > > serialized data separate from a dictionary and there are a lot of
> > > > points when we may lose data: rebalancing, serialization errors, node
> > > > rebooting and so on.
> > > >
> > > > I'd suggest the following ways to improve reliability:
> > > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that
> > > > allows us to decompress data in any situation
> > > > - store the dictionary inside page with data
> > > >
> > > > Also, we have a lot of discussions [1] [2] about compression on
> > > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was
> > > > strictly against a compression on this level.
> > > > If something has changed since then, you may look through [1] [2] [3]
> > > > I've done a lot of research in algorithms comparison it may be useful
> > > > for you.
> > > >
> > > > [1]
> > > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html
> > > > [2]
> > > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html
> > > > [3] https://issues.apache.org/jira/browse/IGNITE-3592
> > > > [4] https://issues.apache.org/jira/browse/IGNITE-5226
> > > > [5] https://github.com/daradurvs/ignite-compression
> > > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]>
> wrote:
> > > > >
> > > > > >
> > > > > > Currently, the dictionary for decompression is only stored on
> heap.
> > > > After
> > > > > > restart there's compressed data in the PDS, but there's no
> dictionary
> > > > :)
> > > > >
> > > > >
> > > > > Basically, it means that I've lost my data, right? How about
> persisting
> > > > > data to disk.
> > > > >
> > > > > Overall, we need Vladimir Ozerov to check the contribution. He was
> the
> > > > one
> > > > > who sponsored the IEP and knows the area best.
> > > > >
> > > > > --
> > > > > Denis
> > > > >
> > > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <
> > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Hello!
> > > > > >
> > > > > > It is somewhat a part of IEP-20, since I have updated it with
> this
> > > > > > particular direction.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > --
> > > > > > Ilya Kasnacheev
> > > > > >
> > > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>:
> > > > > >
> > > > > > > Hi Ilya,
> > > > > > >
> > > > > > > Sounds terrific! Is this part of the following Ignite
> enhancement
> > > > > > proposal?
> > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > > >
> > > > > > > --
> > > > > > > Denis
> > > > > > >
> > > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
> > > > > > [hidden email]
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello!
> > > > > > > >
> > > > > > > > My plan was to add a compression section to cache
> configuration,
> > > > where
> > > > > > > you
> > > > > > > > can enable compression, enable key compression (which has
> heavier
> > > > > > > > performance implications), adjust dictionary gathering
> settings,
> > > > and in
> > > > > > > the
> > > > > > > > future possibly choose betwen algorithms. In fact I'm not
> sure,
> > > > since
> > > > > > my
> > > > > > > > assumption is that you can always just use latest&greatest,
> but
> > > > maybe
> > > > > > we
> > > > > > > > can have e.g. very fast and not very strong vs. slower but
> stronger
> > > > > > one.
> > > > > > > >
> > > > > > > > I'm not sure yet if we should share dictionary between all
> caches
> > > > vs.
> > > > > > > > having separate dictionary for every cache.
> > > > > > > >
> > > > > > > > With regards to data format, of course there will be room for
> > > > further
> > > > > > > > extension.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > --
> > > > > > > > Ilya Kasnacheev
> > > > > > > >
> > > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <
> [hidden email]>:
> > > > > > > >
> > > > > > > > > Hi Ilya
> > > > > > > > >
> > > > > > > > > Is there a plan to introduce it as an option of Ignite
> > > > configuration?
> > > > > > > In
> > > > > > > > > that instead the boolean type I suggest to use the enum and
> > > > reserve
> > > > > > the
> > > > > > > > > ability to extend compressions algorithms in future
> > > > > > > > >
> > > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> > > > > > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello!
> > > > > > > > > >
> > > > > > > > > > I want to share with the developer community my
> compression
> > > > > > > prototype.
> > > > > > > > > >
> > > > > > > > > > Long story short, it compresses BinaryObject's byte[] as
> they
> > > > are
> > > > > > > > written
> > > > > > > > > > to Durable Memory page, operating on a pre-built
> dictionary.
> > > > > > Typical
> > > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using
> > > > custom
> > > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are
> > > > unaffected
> > > > > > > > > > entirely.
> > > > > > > > > >
> > > > > > > > > > This is akin to DB2's table-level compression[1] but
> > > > independently
> > > > > > > > > > invented.
> > > > > > > > > >
> > > > > > > > > > On Yardstick tests performance hit is -6% with PDS and
> up to
> > > > -25%
> > > > > > (in
> > > > > > > > > > throughput) with In-Memory loads. It also means you can
> fit
> > > > ~twice
> > > > > > as
> > > > > > > > > much
> > > > > > > > > > data into the same IM cluster, or have higher ram/disk
> ratio
> > > > with
> > > > > > PDS
> > > > > > > > > > cluster, saving on hardware or decreasing latency.
> > > > > > > > > >
> > > > > > > > > > The code is available as PR 4295[2] (set
> > > > > > > IGNITE_ENABLE_COMPRESSION=true
> > > > > > > > > to
> > > > > > > > > > activate). Note that it will not presently survive a PDS
> node
> > > > > > > restart.
> > > > > > > > > > The impact is very small, the patch should be applicable
> to
> > > > most
> > > > > > 2.x
> > > > > > > > > > releases.
> > > > > > > > > >
> > > > > > > > > > Sure there's a long way before this prototype can have
> hope of
> > > > > > being
> > > > > > > > > > included, but first I would like to hear input from
> fellow
> > > > > > igniters.
> > > > > > > > > >
> > > > > > > > > > See also IEP-20[3].
> > > > > > > > > >
> > > > > > > > > > 1.
> > > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10
> .
> > > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > > > > > > > > 2. https://github.com/apache/ignite/pull/4295
> > > > > > > > > > 3.
> > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Ilya Kasnacheev
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Sergey Kozlov
> > > > > > > > > GridGain Systems
> > > > > > > > > www.gridgain.com
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards, Vyacheslav D.
> > > >
> >
> >
> >
> > --
> > Best Regards, Vyacheslav D.
>
>
>
> --
> Best Regards, Vyacheslav D.
>

Ilya Kasnacheev

Re: Compression prototype

In reply to this post by daradurvs

Hello!

I am testing Zstd with dictionary, and it looks very very promising. I'm
confident I can choose settings where it is faster than my own algo while
bringing better compression ratio, on "cod" dataset.

So I am happliy retiring my code and switching to Zstd. Would probably mean
that we will ship compression implementation as a separate module.

It is a pity that I did not find out about Zstd dictionary support earlier,
that would mean I could skip a few days of work.

Without dictionary the results of Zstd were worse than my own algo, but it
was faster.

Regards,
--
Ilya Kasnacheev

пн, 27 авг. 2018 г. в 14:53, Vyacheslav Daradur <[hidden email]>:

> According to my benchmarks - zstd compression algorithm [1] looks very
> interesting, it has a high compression ratio with quite good speed.
> AFAIK it supports external dictionaries, but I'm not sure about using
> it with "on the fly building" dictionaries. Anyway, have look at (it
> has ASF 2.0 friendly license).
>
> Also, here is data generator / loader [1]. If it will be useful for
> you we should ask Nikolay Izhikov to share public docs to start.
>
> [1] https://github.com/facebook/zstd
> [2] https://github.com/nizhikov/ignite-cod-data-loader
> On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev
> <[hidden email]> wrote:
> >
> > Hello Vyacheslav!
> >
> > Unfortunately I have not found any efficient algorithms that will allow
> me
> > to use external dictionary as a pre-processed data structure. If plain
> gzip
> > is used without dictionary, the compression is around 0.7, as opposed to
> > 0.4 that I will get with custom implementation, AFAIR the performance was
> > also worse. I didn't really try it with dictionary, but I assume
> > performance will be even worse since it will have to scan dictionary
> before
> > getting to actual data.
> >
> > We have such a huge array of tests that we can just run them all with
> > compression enabled, see if there are any new failures. But the impact of
> > my commit is fairly low, it is only triggered when data is written to
> page
> > (maybe to WAL also?), and we don't really do much frivolous stuff to
> pages.
> >
> > Still, I am very much interested in finding existing compression
> > implementations with support of external dictionary; I am also very much
> > interested in having different implementations of compression for Apache
> > Ignite (such as per page compression) and comparing them by benchmark and
> > by code impact. I am also very interested in large standard datasets for
> > Apache Ignite (or generators thereof) so that we can run precise
> benchmarks
> > on various compression schemes. If you have any of the following, please
> > get back to me.
> >
> > Regards,
> > --
> > Ilya Kasnacheev
> >
> >
> > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>:
> >
> > > Hi Igniters!
> > >
> > > Ilya, I'm glad to see one more person who is interested in the
> > > compression feature in Ignite.
> > >
> > > I looked through the pull request and want to share following thoughts:
> > >
> > > It's very dangerous using a custom algorithm in this way - you store
> > > serialized data separate from a dictionary and there are a lot of
> > > points when we may lose data: rebalancing, serialization errors, node
> > > rebooting and so on.
> > >
> > > I'd suggest the following ways to improve reliability:
> > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that
> > > allows us to decompress data in any situation
> > > - store the dictionary inside page with data
> > >
> > > Also, we have a lot of discussions [1] [2] about compression on
> > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was
> > > strictly against a compression on this level.
> > > If something has changed since then, you may look through [1] [2] [3]
> > > I've done a lot of research in algorithms comparison it may be useful
> > > for you.
> > >
> > > [1]
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html
> > > [2]
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html
> > > [3] https://issues.apache.org/jira/browse/IGNITE-3592
> > > [4] https://issues.apache.org/jira/browse/IGNITE-5226
> > > [5] https://github.com/daradurvs/ignite-compression
> > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> wrote:
> > > >
> > > > >
> > > > > Currently, the dictionary for decompression is only stored on heap.
> > > After
> > > > > restart there's compressed data in the PDS, but there's no
> dictionary
> > > :)
> > > >
> > > >
> > > > Basically, it means that I've lost my data, right? How about
> persisting
> > > > data to disk.
> > > >
> > > > Overall, we need Vladimir Ozerov to check the contribution. He was
> the
> > > one
> > > > who sponsored the IEP and knows the area best.
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > It is somewhat a part of IEP-20, since I have updated it with this
> > > > > particular direction.
> > > > >
> > > > > Regards,
> > > > >
> > > > > --
> > > > > Ilya Kasnacheev
> > > > >
> > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>:
> > > > >
> > > > > > Hi Ilya,
> > > > > >
> > > > > > Sounds terrific! Is this part of the following Ignite enhancement
> > > > > proposal?
> > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > >
> > > > > > --
> > > > > > Denis
> > > > > >
> > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
> > > > > [hidden email]
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > My plan was to add a compression section to cache
> configuration,
> > > where
> > > > > > you
> > > > > > > can enable compression, enable key compression (which has
> heavier
> > > > > > > performance implications), adjust dictionary gathering
> settings,
> > > and in
> > > > > > the
> > > > > > > future possibly choose betwen algorithms. In fact I'm not sure,
> > > since
> > > > > my
> > > > > > > assumption is that you can always just use latest&greatest, but
> > > maybe
> > > > > we
> > > > > > > can have e.g. very fast and not very strong vs. slower but
> stronger
> > > > > one.
> > > > > > >
> > > > > > > I'm not sure yet if we should share dictionary between all
> caches
> > > vs.
> > > > > > > having separate dictionary for every cache.
> > > > > > >
> > > > > > > With regards to data format, of course there will be room for
> > > further
> > > > > > > extension.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > --
> > > > > > > Ilya Kasnacheev
> > > > > > >
> > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]
> >:
> > > > > > >
> > > > > > > > Hi Ilya
> > > > > > > >
> > > > > > > > Is there a plan to introduce it as an option of Ignite
> > > configuration?
> > > > > > In
> > > > > > > > that instead the boolean type I suggest to use the enum and
> > > reserve
> > > > > the
> > > > > > > > ability to extend compressions algorithms in future
> > > > > > > >
> > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
> > > > > > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello!
> > > > > > > > >
> > > > > > > > > I want to share with the developer community my compression
> > > > > > prototype.
> > > > > > > > >
> > > > > > > > > Long story short, it compresses BinaryObject's byte[] as
> they
> > > are
> > > > > > > written
> > > > > > > > > to Durable Memory page, operating on a pre-built
> dictionary.
> > > > > Typical
> > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using
> > > custom
> > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are
> > > unaffected
> > > > > > > > > entirely.
> > > > > > > > >
> > > > > > > > > This is akin to DB2's table-level compression[1] but
> > > independently
> > > > > > > > > invented.
> > > > > > > > >
> > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up
> to
> > > -25%
> > > > > (in
> > > > > > > > > throughput) with In-Memory loads. It also means you can fit
> > > ~twice
> > > > > as
> > > > > > > > much
> > > > > > > > > data into the same IM cluster, or have higher ram/disk
> ratio
> > > with
> > > > > PDS
> > > > > > > > > cluster, saving on hardware or decreasing latency.
> > > > > > > > >
> > > > > > > > > The code is available as PR 4295[2] (set
> > > > > > IGNITE_ENABLE_COMPRESSION=true
> > > > > > > > to
> > > > > > > > > activate). Note that it will not presently survive a PDS
> node
> > > > > > restart.
> > > > > > > > > The impact is very small, the patch should be applicable to
> > > most
> > > > > 2.x
> > > > > > > > > releases.
> > > > > > > > >
> > > > > > > > > Sure there's a long way before this prototype can have
> hope of
> > > > > being
> > > > > > > > > included, but first I would like to hear input from fellow
> > > > > igniters.
> > > > > > > > >
> > > > > > > > > See also IEP-20[3].
> > > > > > > > >
> > > > > > > > > 1.
> > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
> > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
> > > > > > > > > 2. https://github.com/apache/ignite/pull/4295
> > > > > > > > > 3.
> > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > > > > > > > > 20%3A+Data+Compression+in+Ignite
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Ilya Kasnacheev
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Sergey Kozlov
> > > > > > > > GridGain Systems
> > > > > > > > www.gridgain.com
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > Best Regards, Vyacheslav D.
> > >
>
>
>
> --
> Best Regards, Vyacheslav D.
>

Ilya Kasnacheev

Re: Compression prototype

Just as I have started praising Zstd, it began to show JVM crashes in
native code in train dict :(

I guess it has limits to train buffer, after which errorneous behaviour is
exhibited. Maybe we will need to submit a pull request:)

Regards,
--
Ilya Kasnacheev

пт, 31 авг. 2018 г. в 11:56, Ilya Kasnacheev <[hidden email]>:

> Hello!
>
> I am testing Zstd with dictionary, and it looks very very promising. I'm
> confident I can choose settings where it is faster than my own algo while
> bringing better compression ratio, on "cod" dataset.
>
> So I am happliy retiring my code and switching to Zstd. Would probably
> mean that we will ship compression implementation as a separate module.
>
> It is a pity that I did not find out about Zstd dictionary support
> earlier, that would mean I could skip a few days of work.
>
> Without dictionary the results of Zstd were worse than my own algo, but it
> was faster.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 27 авг. 2018 г. в 14:53, Vyacheslav Daradur <[hidden email]>:
>
>> According to my benchmarks - zstd compression algorithm [1] looks very
>> interesting, it has a high compression ratio with quite good speed.
>> AFAIK it supports external dictionaries, but I'm not sure about using
>> it with "on the fly building" dictionaries. Anyway, have look at (it
>> has ASF 2.0 friendly license).
>>
>> Also, here is data generator / loader [1]. If it will be useful for
>> you we should ask Nikolay Izhikov to share public docs to start.
>>
>> [1] https://github.com/facebook/zstd
>> [2] https://github.com/nizhikov/ignite-cod-data-loader
>> On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev
>> <[hidden email]> wrote:
>> >
>> > Hello Vyacheslav!
>> >
>> > Unfortunately I have not found any efficient algorithms that will allow
>> me
>> > to use external dictionary as a pre-processed data structure. If plain
>> gzip
>> > is used without dictionary, the compression is around 0.7, as opposed to
>> > 0.4 that I will get with custom implementation, AFAIR the performance
>> was
>> > also worse. I didn't really try it with dictionary, but I assume
>> > performance will be even worse since it will have to scan dictionary
>> before
>> > getting to actual data.
>> >
>> > We have such a huge array of tests that we can just run them all with
>> > compression enabled, see if there are any new failures. But the impact
>> of
>> > my commit is fairly low, it is only triggered when data is written to
>> page
>> > (maybe to WAL also?), and we don't really do much frivolous stuff to
>> pages.
>> >
>> > Still, I am very much interested in finding existing compression
>> > implementations with support of external dictionary; I am also very much
>> > interested in having different implementations of compression for Apache
>> > Ignite (such as per page compression) and comparing them by benchmark
>> and
>> > by code impact. I am also very interested in large standard datasets for
>> > Apache Ignite (or generators thereof) so that we can run precise
>> benchmarks
>> > on various compression schemes. If you have any of the following, please
>> > get back to me.
>> >
>> > Regards,
>> > --
>> > Ilya Kasnacheev
>> >
>> >
>> > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>:
>> >
>> > > Hi Igniters!
>> > >
>> > > Ilya, I'm glad to see one more person who is interested in the
>> > > compression feature in Ignite.
>> > >
>> > > I looked through the pull request and want to share following
>> thoughts:
>> > >
>> > > It's very dangerous using a custom algorithm in this way - you store
>> > > serialized data separate from a dictionary and there are a lot of
>> > > points when we may lose data: rebalancing, serialization errors, node
>> > > rebooting and so on.
>> > >
>> > > I'd suggest the following ways to improve reliability:
>> > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that
>> > > allows us to decompress data in any situation
>> > > - store the dictionary inside page with data
>> > >
>> > > Also, we have a lot of discussions [1] [2] about compression on
>> > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was
>> > > strictly against a compression on this level.
>> > > If something has changed since then, you may look through [1] [2] [3]
>> > > I've done a lot of research in algorithms comparison it may be useful
>> > > for you.
>> > >
>> > > [1]
>> > >
>> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html
>> > > [2]
>> > >
>> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html
>> > > [3] https://issues.apache.org/jira/browse/IGNITE-3592
>> > > [4] https://issues.apache.org/jira/browse/IGNITE-5226
>> > > [5] https://github.com/daradurvs/ignite-compression
>> > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]>
>> wrote:
>> > > >
>> > > > >
>> > > > > Currently, the dictionary for decompression is only stored on
>> heap.
>> > > After
>> > > > > restart there's compressed data in the PDS, but there's no
>> dictionary
>> > > :)
>> > > >
>> > > >
>> > > > Basically, it means that I've lost my data, right? How about
>> persisting
>> > > > data to disk.
>> > > >
>> > > > Overall, we need Vladimir Ozerov to check the contribution. He was
>> the
>> > > one
>> > > > who sponsored the IEP and knows the area best.
>> > > >
>> > > > --
>> > > > Denis
>> > > >
>> > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <
>> > > [hidden email]>
>> > > > wrote:
>> > > >
>> > > > > Hello!
>> > > > >
>> > > > > It is somewhat a part of IEP-20, since I have updated it with this
>> > > > > particular direction.
>> > > > >
>> > > > > Regards,
>> > > > >
>> > > > > --
>> > > > > Ilya Kasnacheev
>> > > > >
>> > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>:
>> > > > >
>> > > > > > Hi Ilya,
>> > > > > >
>> > > > > > Sounds terrific! Is this part of the following Ignite
>> enhancement
>> > > > > proposal?
>> > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>> > > > > > 20%3A+Data+Compression+in+Ignite
>> > > > > >
>> > > > > > --
>> > > > > > Denis
>> > > > > >
>> > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <
>> > > > > [hidden email]
>> > > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hello!
>> > > > > > >
>> > > > > > > My plan was to add a compression section to cache
>> configuration,
>> > > where
>> > > > > > you
>> > > > > > > can enable compression, enable key compression (which has
>> heavier
>> > > > > > > performance implications), adjust dictionary gathering
>> settings,
>> > > and in
>> > > > > > the
>> > > > > > > future possibly choose betwen algorithms. In fact I'm not
>> sure,
>> > > since
>> > > > > my
>> > > > > > > assumption is that you can always just use latest&greatest,
>> but
>> > > maybe
>> > > > > we
>> > > > > > > can have e.g. very fast and not very strong vs. slower but
>> stronger
>> > > > > one.
>> > > > > > >
>> > > > > > > I'm not sure yet if we should share dictionary between all
>> caches
>> > > vs.
>> > > > > > > having separate dictionary for every cache.
>> > > > > > >
>> > > > > > > With regards to data format, of course there will be room for
>> > > further
>> > > > > > > extension.
>> > > > > > >
>> > > > > > > Regards,
>> > > > > > >
>> > > > > > > --
>> > > > > > > Ilya Kasnacheev
>> > > > > > >
>> > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <
>> [hidden email]>:
>> > > > > > >
>> > > > > > > > Hi Ilya
>> > > > > > > >
>> > > > > > > > Is there a plan to introduce it as an option of Ignite
>> > > configuration?
>> > > > > > In
>> > > > > > > > that instead the boolean type I suggest to use the enum and
>> > > reserve
>> > > > > the
>> > > > > > > > ability to extend compressions algorithms in future
>> > > > > > > >
>> > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <
>> > > > > > > > [hidden email]>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hello!
>> > > > > > > > >
>> > > > > > > > > I want to share with the developer community my
>> compression
>> > > > > > prototype.
>> > > > > > > > >
>> > > > > > > > > Long story short, it compresses BinaryObject's byte[] as
>> they
>> > > are
>> > > > > > > written
>> > > > > > > > > to Durable Memory page, operating on a pre-built
>> dictionary.
>> > > > > Typical
>> > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using
>> > > custom
>> > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are
>> > > unaffected
>> > > > > > > > > entirely.
>> > > > > > > > >
>> > > > > > > > > This is akin to DB2's table-level compression[1] but
>> > > independently
>> > > > > > > > > invented.
>> > > > > > > > >
>> > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up
>> to
>> > > -25%
>> > > > > (in
>> > > > > > > > > throughput) with In-Memory loads. It also means you can
>> fit
>> > > ~twice
>> > > > > as
>> > > > > > > > much
>> > > > > > > > > data into the same IM cluster, or have higher ram/disk
>> ratio
>> > > with
>> > > > > PDS
>> > > > > > > > > cluster, saving on hardware or decreasing latency.
>> > > > > > > > >
>> > > > > > > > > The code is available as PR 4295[2] (set
>> > > > > > IGNITE_ENABLE_COMPRESSION=true
>> > > > > > > > to
>> > > > > > > > > activate). Note that it will not presently survive a PDS
>> node
>> > > > > > restart.
>> > > > > > > > > The impact is very small, the patch should be applicable
>> to
>> > > most
>> > > > > 2.x
>> > > > > > > > > releases.
>> > > > > > > > >
>> > > > > > > > > Sure there's a long way before this prototype can have
>> hope of
>> > > > > being
>> > > > > > > > > included, but first I would like to hear input from fellow
>> > > > > igniters.
>> > > > > > > > >
>> > > > > > > > > See also IEP-20[3].
>> > > > > > > > >
>> > > > > > > > > 1.
>> > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.
>> > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html
>> > > > > > > > > 2. https://github.com/apache/ignite/pull/4295
>> > > > > > > > > 3.
>> > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>> > > > > > > > > 20%3A+Data+Compression+in+Ignite
>> > > > > > > > >
>> > > > > > > > > Regards,
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Ilya Kasnacheev
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Sergey Kozlov
>> > > > > > > > GridGain Systems
>> > > > > > > > www.gridgain.com
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards, Vyacheslav D.
>> > >
>>
>>
>>
>> --
>> Best Regards, Vyacheslav D.
>>
>

Ilya Kasnacheev

Re: Compression prototype

Hello again!

I've been running various compression parameters through cod dataset.

It looks like the best compression level in terms of speed is either 1 or 2.
The default for Zstd seems to be 3 which would almost always perform worse.
For best performance a dictionary of 1024 is optimal, for better compression
one might choose larger dictionaries, 6k looks good but I will also run a
few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample size
is set to more than 16k entries (I guess I should probe the max buffer size
where problems begin).

I'm attaching two charts which show what's we've got. Compression rate is a
fraction of original records size. Time to run is wall clock time the test
run. Reasonable compression will increase the run time twofold (of a program
that only does text record parsing -> creates objects -> binarylizes them ->
compresses -> decompresses). Notation: s{number of bin objects used to
train}-d{dictionary length in bytes}-l{compression level}.
<http://apache-ignite-developers.2346864.n4.nabble.com/file/t374/chart1.png>
Second one is basically a zoom in on the first.
<http://apache-ignite-developers.2346864.n4.nabble.com/file/t374/chart2.png>
I think that in additional to dictionary compression we should have
dictionary-less compression. On typical data of small records it shows
compression rate of 0.8 ~ 0.65, but I can imagine that with larger
unstructured records it can be as good as dict-based and much less of a
hassle dictionary-processing-wise. WDYT?
Sorry for the fine prints. I hope my charts will visible.

You can see the updated code as pull request:
https://github.com/apache/ignite/pull/4673

Regards,

--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/

dsetrakyan

Re: Compression prototype

Hi Ilya,

This is very useful. Is the compression going to be per-page, in which case
the dictionary is going to be kept inside of a page? Or do you have some
other design in mind?

D.

On Mon, Sep 3, 2018 at 10:36 AM, Ilya Kasnacheev <[hidden email]>
wrote:

> Hello again!
>
> I've been running various compression parameters through cod dataset.
>
> It looks like the best compression level in terms of speed is either 1 or
> 2.
> The default for Zstd seems to be 3 which would almost always perform worse.
> For best performance a dictionary of 1024 is optimal, for better
> compression
> one might choose larger dictionaries, 6k looks good but I will also run a
> few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample size
> is set to more than 16k entries (I guess I should probe the max buffer size
> where problems begin).
>
> I'm attaching two charts which show what's we've got. Compression rate is a
> fraction of original records size. Time to run is wall clock time the test
> run. Reasonable compression will increase the run time twofold (of a
> program
> that only does text record parsing -> creates objects -> binarylizes them
> ->
> compresses -> decompresses). Notation: s{number of bin objects used to
> train}-d{dictionary length in bytes}-l{compression level}.
> <http://apache-ignite-developers.2346864.n4.nabble.
> com/file/t374/chart1.png>
> Second one is basically a zoom in on the first.
> <http://apache-ignite-developers.2346864.n4.nabble.
> com/file/t374/chart2.png>
> I think that in additional to dictionary compression we should have
> dictionary-less compression. On typical data of small records it shows
> compression rate of 0.8 ~ 0.65, but I can imagine that with larger
> unstructured records it can be as good as dict-based and much less of a
> hassle dictionary-processing-wise. WDYT?
> Sorry for the fine prints. I hope my charts will visible.
>
> You can see the updated code as pull request:
> https://github.com/apache/ignite/pull/4673
>
> Regards,
>
>
>
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>

Ilya Kasnacheev

Re: Compression prototype

Hello!

The compression is per-binary-object, but dictionary is external, shared
between multiple (millions of) entries and stored alongside compressed data.

Regards,
--
Ilya Kasnacheev

вт, 4 сент. 2018 г. в 2:40, Dmitriy Setrakyan <[hidden email]>:

> Hi Ilya,
>
> This is very useful. Is the compression going to be per-page, in which case
> the dictionary is going to be kept inside of a page? Or do you have some
> other design in mind?
>
> D.
>
> On Mon, Sep 3, 2018 at 10:36 AM, Ilya Kasnacheev <
> [hidden email]>
> wrote:
>
> > Hello again!
> >
> > I've been running various compression parameters through cod dataset.
> >
> > It looks like the best compression level in terms of speed is either 1 or
> > 2.
> > The default for Zstd seems to be 3 which would almost always perform
> worse.
> > For best performance a dictionary of 1024 is optimal, for better
> > compression
> > one might choose larger dictionaries, 6k looks good but I will also run a
> > few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample
> size
> > is set to more than 16k entries (I guess I should probe the max buffer
> size
> > where problems begin).
> >
> > I'm attaching two charts which show what's we've got. Compression rate
> is a
> > fraction of original records size. Time to run is wall clock time the
> test
> > run. Reasonable compression will increase the run time twofold (of a
> > program
> > that only does text record parsing -> creates objects -> binarylizes them
> > ->
> > compresses -> decompresses). Notation: s{number of bin objects used to
> > train}-d{dictionary length in bytes}-l{compression level}.
> > <http://apache-ignite-developers.2346864.n4.nabble.
> > com/file/t374/chart1.png>
> > Second one is basically a zoom in on the first.
> > <http://apache-ignite-developers.2346864.n4.nabble.
> > com/file/t374/chart2.png>
> > I think that in additional to dictionary compression we should have
> > dictionary-less compression. On typical data of small records it shows
> > compression rate of 0.8 ~ 0.65, but I can imagine that with larger
> > unstructured records it can be as good as dict-based and much less of a
> > hassle dictionary-processing-wise. WDYT?
> > Sorry for the fine prints. I hope my charts will visible.
> >
> > You can see the updated code as pull request:
> > https://github.com/apache/ignite/pull/4673
> >
> > Regards,
> >
> >
> >
> > --
> > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> >
>

dsetrakyan

Re: Compression prototype

On Tue, Sep 4, 2018 at 1:16 AM, Ilya Kasnacheev <[hidden email]>
wrote:

> Hello!
>
> The compression is per-binary-object, but dictionary is external, shared
> between multiple (millions of) entries and stored alongside compressed
> data.
>

I was under a different impression. If the dictionary is for the whole data
set, then it will occupy megabytes (if not gigabytes) of data. What happens
when a new node joins and has no idea about the dictionary? What happens
when dictionary between nodes get out-of-sync?

D.