Hello!
I want to share with the developer community my compression prototype. Long story short, it compresses BinaryObject's byte[] as they are written to Durable Memory page, operating on a pre-built dictionary. Typical compression ratio is 0.4 (meaning 2.5x compression) using custom LZW+Huffman. Metadata, indexes and primitive values are unaffected entirely. This is akin to DB2's table-level compression[1] but independently invented. On Yardstick tests performance hit is -6% with PDS and up to -25% (in throughput) with In-Memory loads. It also means you can fit ~twice as much data into the same IM cluster, or have higher ram/disk ratio with PDS cluster, saving on hardware or decreasing latency. The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true to activate). Note that it will not presently survive a PDS node restart. The impact is very small, the patch should be applicable to most 2.x releases. Sure there's a long way before this prototype can have hope of being included, but first I would like to hear input from fellow igniters. See also IEP-20[3]. 1. https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html 2. https://github.com/apache/ignite/pull/4295 3. https://cwiki.apache.org/confluence/display/IGNITE/IEP-20%3A+Data+Compression+in+Ignite Regards, -- Ilya Kasnacheev |
Hi Ilya,
Thank you for sharing this here. I believe this contribution will be accepted by the Community. Moreover, it shows so remarkable performance boost. I'm pretty sure this patch will be reviewed by Ignite Native Persistence experts soon. What do you mean by can't survive PDS node restart? Sincerely, Dmitriy Pavlov чт, 23 авг. 2018 г. в 13:09, Ilya Kasnacheev <[hidden email]>: > Hello! > > I want to share with the developer community my compression prototype. > > Long story short, it compresses BinaryObject's byte[] as they are written > to Durable Memory page, operating on a pre-built dictionary. Typical > compression ratio is 0.4 (meaning 2.5x compression) using custom > LZW+Huffman. Metadata, indexes and primitive values are unaffected > entirely. > > This is akin to DB2's table-level compression[1] but independently > invented. > > On Yardstick tests performance hit is -6% with PDS and up to -25% (in > throughput) with In-Memory loads. It also means you can fit ~twice as much > data into the same IM cluster, or have higher ram/disk ratio with PDS > cluster, saving on hardware or decreasing latency. > > The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true to > activate). Note that it will not presently survive a PDS node restart. > The impact is very small, the patch should be applicable to most 2.x > releases. > > Sure there's a long way before this prototype can have hope of being > included, but first I would like to hear input from fellow igniters. > > See also IEP-20[3]. > > 1. > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > 2. https://github.com/apache/ignite/pull/4295 > 3. > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-20%3A+Data+Compression+in+Ignite > > Regards, > > -- > Ilya Kasnacheev > |
Hello!
Currently, the dictionary for decompression is only stored on heap. After restart there's compressed data in the PDS, but there's no dictionary :) Regards, -- Ilya Kasnacheev 2018-08-23 14:58 GMT+03:00 Dmitriy Pavlov <[hidden email]>: > Hi Ilya, > > Thank you for sharing this here. I believe this contribution will be > accepted by the Community. Moreover, it shows so remarkable performance > boost. > > I'm pretty sure this patch will be reviewed by Ignite Native Persistence > experts soon. > > What do you mean by can't survive PDS node restart? > > Sincerely, > Dmitriy Pavlov > > чт, 23 авг. 2018 г. в 13:09, Ilya Kasnacheev <[hidden email]>: > > > Hello! > > > > I want to share with the developer community my compression prototype. > > > > Long story short, it compresses BinaryObject's byte[] as they are written > > to Durable Memory page, operating on a pre-built dictionary. Typical > > compression ratio is 0.4 (meaning 2.5x compression) using custom > > LZW+Huffman. Metadata, indexes and primitive values are unaffected > > entirely. > > > > This is akin to DB2's table-level compression[1] but independently > > invented. > > > > On Yardstick tests performance hit is -6% with PDS and up to -25% (in > > throughput) with In-Memory loads. It also means you can fit ~twice as > much > > data into the same IM cluster, or have higher ram/disk ratio with PDS > > cluster, saving on hardware or decreasing latency. > > > > The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true > to > > activate). Note that it will not presently survive a PDS node restart. > > The impact is very small, the patch should be applicable to most 2.x > > releases. > > > > Sure there's a long way before this prototype can have hope of being > > included, but first I would like to hear input from fellow igniters. > > > > See also IEP-20[3]. > > > > 1. > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > 2. https://github.com/apache/ignite/pull/4295 > > 3. > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > 20%3A+Data+Compression+in+Ignite > > > > Regards, > > > > -- > > Ilya Kasnacheev > > > |
In reply to this post by Ilya Kasnacheev
Hi Ilya
Is there a plan to introduce it as an option of Ignite configuration? In that instead the boolean type I suggest to use the enum and reserve the ability to extend compressions algorithms in future On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev <[hidden email]> wrote: > Hello! > > I want to share with the developer community my compression prototype. > > Long story short, it compresses BinaryObject's byte[] as they are written > to Durable Memory page, operating on a pre-built dictionary. Typical > compression ratio is 0.4 (meaning 2.5x compression) using custom > LZW+Huffman. Metadata, indexes and primitive values are unaffected > entirely. > > This is akin to DB2's table-level compression[1] but independently > invented. > > On Yardstick tests performance hit is -6% with PDS and up to -25% (in > throughput) with In-Memory loads. It also means you can fit ~twice as much > data into the same IM cluster, or have higher ram/disk ratio with PDS > cluster, saving on hardware or decreasing latency. > > The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true to > activate). Note that it will not presently survive a PDS node restart. > The impact is very small, the patch should be applicable to most 2.x > releases. > > Sure there's a long way before this prototype can have hope of being > included, but first I would like to hear input from fellow igniters. > > See also IEP-20[3]. > > 1. > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > 2. https://github.com/apache/ignite/pull/4295 > 3. > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > 20%3A+Data+Compression+in+Ignite > > Regards, > > -- > Ilya Kasnacheev > -- Sergey Kozlov GridGain Systems www.gridgain.com |
In reply to this post by Ilya Kasnacheev
Ok, thanks. IMO we need to store the dictionary in Durable memory before
merging into master. чт, 23 авг. 2018 г. в 15:12, Ilya Kasnacheev <[hidden email]>: > Hello! > > Currently, the dictionary for decompression is only stored on heap. After > restart there's compressed data in the PDS, but there's no dictionary :) > > Regards, > > -- > Ilya Kasnacheev > > 2018-08-23 14:58 GMT+03:00 Dmitriy Pavlov <[hidden email]>: > > > Hi Ilya, > > > > Thank you for sharing this here. I believe this contribution will be > > accepted by the Community. Moreover, it shows so remarkable performance > > boost. > > > > I'm pretty sure this patch will be reviewed by Ignite Native Persistence > > experts soon. > > > > What do you mean by can't survive PDS node restart? > > > > Sincerely, > > Dmitriy Pavlov > > > > чт, 23 авг. 2018 г. в 13:09, Ilya Kasnacheev <[hidden email] > >: > > > > > Hello! > > > > > > I want to share with the developer community my compression prototype. > > > > > > Long story short, it compresses BinaryObject's byte[] as they are > written > > > to Durable Memory page, operating on a pre-built dictionary. Typical > > > compression ratio is 0.4 (meaning 2.5x compression) using custom > > > LZW+Huffman. Metadata, indexes and primitive values are unaffected > > > entirely. > > > > > > This is akin to DB2's table-level compression[1] but independently > > > invented. > > > > > > On Yardstick tests performance hit is -6% with PDS and up to -25% (in > > > throughput) with In-Memory loads. It also means you can fit ~twice as > > much > > > data into the same IM cluster, or have higher ram/disk ratio with PDS > > > cluster, saving on hardware or decreasing latency. > > > > > > The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true > > to > > > activate). Note that it will not presently survive a PDS node restart. > > > The impact is very small, the patch should be applicable to most 2.x > > > releases. > > > > > > Sure there's a long way before this prototype can have hope of being > > > included, but first I would like to hear input from fellow igniters. > > > > > > See also IEP-20[3]. > > > > > > 1. > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > 2. https://github.com/apache/ignite/pull/4295 > > > 3. > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > 20%3A+Data+Compression+in+Ignite > > > > > > Regards, > > > > > > -- > > > Ilya Kasnacheev > > > > > > |
In reply to this post by Sergey Kozlov
Hello!
My plan was to add a compression section to cache configuration, where you can enable compression, enable key compression (which has heavier performance implications), adjust dictionary gathering settings, and in the future possibly choose betwen algorithms. In fact I'm not sure, since my assumption is that you can always just use latest&greatest, but maybe we can have e.g. very fast and not very strong vs. slower but stronger one. I'm not sure yet if we should share dictionary between all caches vs. having separate dictionary for every cache. With regards to data format, of course there will be room for further extension. Regards, -- Ilya Kasnacheev 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>: > Hi Ilya > > Is there a plan to introduce it as an option of Ignite configuration? In > that instead the boolean type I suggest to use the enum and reserve the > ability to extend compressions algorithms in future > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > [hidden email]> > wrote: > > > Hello! > > > > I want to share with the developer community my compression prototype. > > > > Long story short, it compresses BinaryObject's byte[] as they are written > > to Durable Memory page, operating on a pre-built dictionary. Typical > > compression ratio is 0.4 (meaning 2.5x compression) using custom > > LZW+Huffman. Metadata, indexes and primitive values are unaffected > > entirely. > > > > This is akin to DB2's table-level compression[1] but independently > > invented. > > > > On Yardstick tests performance hit is -6% with PDS and up to -25% (in > > throughput) with In-Memory loads. It also means you can fit ~twice as > much > > data into the same IM cluster, or have higher ram/disk ratio with PDS > > cluster, saving on hardware or decreasing latency. > > > > The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true > to > > activate). Note that it will not presently survive a PDS node restart. > > The impact is very small, the patch should be applicable to most 2.x > > releases. > > > > Sure there's a long way before this prototype can have hope of being > > included, but first I would like to hear input from fellow igniters. > > > > See also IEP-20[3]. > > > > 1. > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > 2. https://github.com/apache/ignite/pull/4295 > > 3. > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > 20%3A+Data+Compression+in+Ignite > > > > Regards, > > > > -- > > Ilya Kasnacheev > > > > > > -- > Sergey Kozlov > GridGain Systems > www.gridgain.com > |
Hi Ilya,
Sounds terrific! Is this part of the following Ignite enhancement proposal? https://cwiki.apache.org/confluence/display/IGNITE/IEP-20%3A+Data+Compression+in+Ignite -- Denis On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <[hidden email]> wrote: > Hello! > > My plan was to add a compression section to cache configuration, where you > can enable compression, enable key compression (which has heavier > performance implications), adjust dictionary gathering settings, and in the > future possibly choose betwen algorithms. In fact I'm not sure, since my > assumption is that you can always just use latest&greatest, but maybe we > can have e.g. very fast and not very strong vs. slower but stronger one. > > I'm not sure yet if we should share dictionary between all caches vs. > having separate dictionary for every cache. > > With regards to data format, of course there will be room for further > extension. > > Regards, > > -- > Ilya Kasnacheev > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>: > > > Hi Ilya > > > > Is there a plan to introduce it as an option of Ignite configuration? In > > that instead the boolean type I suggest to use the enum and reserve the > > ability to extend compressions algorithms in future > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > > [hidden email]> > > wrote: > > > > > Hello! > > > > > > I want to share with the developer community my compression prototype. > > > > > > Long story short, it compresses BinaryObject's byte[] as they are > written > > > to Durable Memory page, operating on a pre-built dictionary. Typical > > > compression ratio is 0.4 (meaning 2.5x compression) using custom > > > LZW+Huffman. Metadata, indexes and primitive values are unaffected > > > entirely. > > > > > > This is akin to DB2's table-level compression[1] but independently > > > invented. > > > > > > On Yardstick tests performance hit is -6% with PDS and up to -25% (in > > > throughput) with In-Memory loads. It also means you can fit ~twice as > > much > > > data into the same IM cluster, or have higher ram/disk ratio with PDS > > > cluster, saving on hardware or decreasing latency. > > > > > > The code is available as PR 4295[2] (set IGNITE_ENABLE_COMPRESSION=true > > to > > > activate). Note that it will not presently survive a PDS node restart. > > > The impact is very small, the patch should be applicable to most 2.x > > > releases. > > > > > > Sure there's a long way before this prototype can have hope of being > > > included, but first I would like to hear input from fellow igniters. > > > > > > See also IEP-20[3]. > > > > > > 1. > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > 2. https://github.com/apache/ignite/pull/4295 > > > 3. > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > 20%3A+Data+Compression+in+Ignite > > > > > > Regards, > > > > > > -- > > > Ilya Kasnacheev > > > > > > > > > > > -- > > Sergey Kozlov > > GridGain Systems > > www.gridgain.com > > > |
Hello!
It is somewhat a part of IEP-20, since I have updated it with this particular direction. Regards, -- Ilya Kasnacheev 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>: > Hi Ilya, > > Sounds terrific! Is this part of the following Ignite enhancement proposal? > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > 20%3A+Data+Compression+in+Ignite > > -- > Denis > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev <[hidden email] > > > wrote: > > > Hello! > > > > My plan was to add a compression section to cache configuration, where > you > > can enable compression, enable key compression (which has heavier > > performance implications), adjust dictionary gathering settings, and in > the > > future possibly choose betwen algorithms. In fact I'm not sure, since my > > assumption is that you can always just use latest&greatest, but maybe we > > can have e.g. very fast and not very strong vs. slower but stronger one. > > > > I'm not sure yet if we should share dictionary between all caches vs. > > having separate dictionary for every cache. > > > > With regards to data format, of course there will be room for further > > extension. > > > > Regards, > > > > -- > > Ilya Kasnacheev > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>: > > > > > Hi Ilya > > > > > > Is there a plan to introduce it as an option of Ignite configuration? > In > > > that instead the boolean type I suggest to use the enum and reserve the > > > ability to extend compressions algorithms in future > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > > > [hidden email]> > > > wrote: > > > > > > > Hello! > > > > > > > > I want to share with the developer community my compression > prototype. > > > > > > > > Long story short, it compresses BinaryObject's byte[] as they are > > written > > > > to Durable Memory page, operating on a pre-built dictionary. Typical > > > > compression ratio is 0.4 (meaning 2.5x compression) using custom > > > > LZW+Huffman. Metadata, indexes and primitive values are unaffected > > > > entirely. > > > > > > > > This is akin to DB2's table-level compression[1] but independently > > > > invented. > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up to -25% (in > > > > throughput) with In-Memory loads. It also means you can fit ~twice as > > > much > > > > data into the same IM cluster, or have higher ram/disk ratio with PDS > > > > cluster, saving on hardware or decreasing latency. > > > > > > > > The code is available as PR 4295[2] (set > IGNITE_ENABLE_COMPRESSION=true > > > to > > > > activate). Note that it will not presently survive a PDS node > restart. > > > > The impact is very small, the patch should be applicable to most 2.x > > > > releases. > > > > > > > > Sure there's a long way before this prototype can have hope of being > > > > included, but first I would like to hear input from fellow igniters. > > > > > > > > See also IEP-20[3]. > > > > > > > > 1. > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > > 2. https://github.com/apache/ignite/pull/4295 > > > > 3. > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > Regards, > > > > > > > > -- > > > > Ilya Kasnacheev > > > > > > > > > > > > > > > > -- > > > Sergey Kozlov > > > GridGain Systems > > > www.gridgain.com > > > > > > |
>
> Currently, the dictionary for decompression is only stored on heap. After > restart there's compressed data in the PDS, but there's no dictionary :) Basically, it means that I've lost my data, right? How about persisting data to disk. Overall, we need Vladimir Ozerov to check the contribution. He was the one who sponsored the IEP and knows the area best. -- Denis On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <[hidden email]> wrote: > Hello! > > It is somewhat a part of IEP-20, since I have updated it with this > particular direction. > > Regards, > > -- > Ilya Kasnacheev > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>: > > > Hi Ilya, > > > > Sounds terrific! Is this part of the following Ignite enhancement > proposal? > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > 20%3A+Data+Compression+in+Ignite > > > > -- > > Denis > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev < > [hidden email] > > > > > wrote: > > > > > Hello! > > > > > > My plan was to add a compression section to cache configuration, where > > you > > > can enable compression, enable key compression (which has heavier > > > performance implications), adjust dictionary gathering settings, and in > > the > > > future possibly choose betwen algorithms. In fact I'm not sure, since > my > > > assumption is that you can always just use latest&greatest, but maybe > we > > > can have e.g. very fast and not very strong vs. slower but stronger > one. > > > > > > I'm not sure yet if we should share dictionary between all caches vs. > > > having separate dictionary for every cache. > > > > > > With regards to data format, of course there will be room for further > > > extension. > > > > > > Regards, > > > > > > -- > > > Ilya Kasnacheev > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>: > > > > > > > Hi Ilya > > > > > > > > Is there a plan to introduce it as an option of Ignite configuration? > > In > > > > that instead the boolean type I suggest to use the enum and reserve > the > > > > ability to extend compressions algorithms in future > > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > > > > [hidden email]> > > > > wrote: > > > > > > > > > Hello! > > > > > > > > > > I want to share with the developer community my compression > > prototype. > > > > > > > > > > Long story short, it compresses BinaryObject's byte[] as they are > > > written > > > > > to Durable Memory page, operating on a pre-built dictionary. > Typical > > > > > compression ratio is 0.4 (meaning 2.5x compression) using custom > > > > > LZW+Huffman. Metadata, indexes and primitive values are unaffected > > > > > entirely. > > > > > > > > > > This is akin to DB2's table-level compression[1] but independently > > > > > invented. > > > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up to -25% > (in > > > > > throughput) with In-Memory loads. It also means you can fit ~twice > as > > > > much > > > > > data into the same IM cluster, or have higher ram/disk ratio with > PDS > > > > > cluster, saving on hardware or decreasing latency. > > > > > > > > > > The code is available as PR 4295[2] (set > > IGNITE_ENABLE_COMPRESSION=true > > > > to > > > > > activate). Note that it will not presently survive a PDS node > > restart. > > > > > The impact is very small, the patch should be applicable to most > 2.x > > > > > releases. > > > > > > > > > > Sure there's a long way before this prototype can have hope of > being > > > > > included, but first I would like to hear input from fellow > igniters. > > > > > > > > > > See also IEP-20[3]. > > > > > > > > > > 1. > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > > > 2. https://github.com/apache/ignite/pull/4295 > > > > > 3. > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > Regards, > > > > > > > > > > -- > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > > > > > > > > -- > > > > Sergey Kozlov > > > > GridGain Systems > > > > www.gridgain.com > > > > > > > > > > |
Hi Igniters!
Ilya, I'm glad to see one more person who is interested in the compression feature in Ignite. I looked through the pull request and want to share following thoughts: It's very dangerous using a custom algorithm in this way - you store serialized data separate from a dictionary and there are a lot of points when we may lose data: rebalancing, serialization errors, node rebooting and so on. I'd suggest the following ways to improve reliability: - use well know algorithms: zstd, deflater, lzma, gzip e.g. that allows us to decompress data in any situation - store the dictionary inside page with data Also, we have a lot of discussions [1] [2] about compression on BinaryObject and BinaryMarshaller level and Vladimir Ozerov was strictly against a compression on this level. If something has changed since then, you may look through [1] [2] [3] I've done a lot of research in algorithms comparison it may be useful for you. [1] http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html [2] http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html [3] https://issues.apache.org/jira/browse/IGNITE-3592 [4] https://issues.apache.org/jira/browse/IGNITE-5226 [5] https://github.com/daradurvs/ignite-compression On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> wrote: > > > > > Currently, the dictionary for decompression is only stored on heap. After > > restart there's compressed data in the PDS, but there's no dictionary :) > > > Basically, it means that I've lost my data, right? How about persisting > data to disk. > > Overall, we need Vladimir Ozerov to check the contribution. He was the one > who sponsored the IEP and knows the area best. > > -- > Denis > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev <[hidden email]> > wrote: > > > Hello! > > > > It is somewhat a part of IEP-20, since I have updated it with this > > particular direction. > > > > Regards, > > > > -- > > Ilya Kasnacheev > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>: > > > > > Hi Ilya, > > > > > > Sounds terrific! Is this part of the following Ignite enhancement > > proposal? > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > 20%3A+Data+Compression+in+Ignite > > > > > > -- > > > Denis > > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev < > > [hidden email] > > > > > > > wrote: > > > > > > > Hello! > > > > > > > > My plan was to add a compression section to cache configuration, where > > > you > > > > can enable compression, enable key compression (which has heavier > > > > performance implications), adjust dictionary gathering settings, and in > > > the > > > > future possibly choose betwen algorithms. In fact I'm not sure, since > > my > > > > assumption is that you can always just use latest&greatest, but maybe > > we > > > > can have e.g. very fast and not very strong vs. slower but stronger > > one. > > > > > > > > I'm not sure yet if we should share dictionary between all caches vs. > > > > having separate dictionary for every cache. > > > > > > > > With regards to data format, of course there will be room for further > > > > extension. > > > > > > > > Regards, > > > > > > > > -- > > > > Ilya Kasnacheev > > > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>: > > > > > > > > > Hi Ilya > > > > > > > > > > Is there a plan to introduce it as an option of Ignite configuration? > > > In > > > > > that instead the boolean type I suggest to use the enum and reserve > > the > > > > > ability to extend compressions algorithms in future > > > > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > > > > > [hidden email]> > > > > > wrote: > > > > > > > > > > > Hello! > > > > > > > > > > > > I want to share with the developer community my compression > > > prototype. > > > > > > > > > > > > Long story short, it compresses BinaryObject's byte[] as they are > > > > written > > > > > > to Durable Memory page, operating on a pre-built dictionary. > > Typical > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using custom > > > > > > LZW+Huffman. Metadata, indexes and primitive values are unaffected > > > > > > entirely. > > > > > > > > > > > > This is akin to DB2's table-level compression[1] but independently > > > > > > invented. > > > > > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up to -25% > > (in > > > > > > throughput) with In-Memory loads. It also means you can fit ~twice > > as > > > > > much > > > > > > data into the same IM cluster, or have higher ram/disk ratio with > > PDS > > > > > > cluster, saving on hardware or decreasing latency. > > > > > > > > > > > > The code is available as PR 4295[2] (set > > > IGNITE_ENABLE_COMPRESSION=true > > > > > to > > > > > > activate). Note that it will not presently survive a PDS node > > > restart. > > > > > > The impact is very small, the patch should be applicable to most > > 2.x > > > > > > releases. > > > > > > > > > > > > Sure there's a long way before this prototype can have hope of > > being > > > > > > included, but first I would like to hear input from fellow > > igniters. > > > > > > > > > > > > See also IEP-20[3]. > > > > > > > > > > > > 1. > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > > > > 2. https://github.com/apache/ignite/pull/4295 > > > > > > 3. > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > Regards, > > > > > > > > > > > > -- > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Sergey Kozlov > > > > > GridGain Systems > > > > > www.gridgain.com > > > > > > > > > > > > > > -- Best Regards, Vyacheslav D. |
Hello Vyacheslav!
Unfortunately I have not found any efficient algorithms that will allow me to use external dictionary as a pre-processed data structure. If plain gzip is used without dictionary, the compression is around 0.7, as opposed to 0.4 that I will get with custom implementation, AFAIR the performance was also worse. I didn't really try it with dictionary, but I assume performance will be even worse since it will have to scan dictionary before getting to actual data. We have such a huge array of tests that we can just run them all with compression enabled, see if there are any new failures. But the impact of my commit is fairly low, it is only triggered when data is written to page (maybe to WAL also?), and we don't really do much frivolous stuff to pages. Still, I am very much interested in finding existing compression implementations with support of external dictionary; I am also very much interested in having different implementations of compression for Apache Ignite (such as per page compression) and comparing them by benchmark and by code impact. I am also very interested in large standard datasets for Apache Ignite (or generators thereof) so that we can run precise benchmarks on various compression schemes. If you have any of the following, please get back to me. Regards, -- Ilya Kasnacheev пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>: > Hi Igniters! > > Ilya, I'm glad to see one more person who is interested in the > compression feature in Ignite. > > I looked through the pull request and want to share following thoughts: > > It's very dangerous using a custom algorithm in this way - you store > serialized data separate from a dictionary and there are a lot of > points when we may lose data: rebalancing, serialization errors, node > rebooting and so on. > > I'd suggest the following ways to improve reliability: > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that > allows us to decompress data in any situation > - store the dictionary inside page with data > > Also, we have a lot of discussions [1] [2] about compression on > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was > strictly against a compression on this level. > If something has changed since then, you may look through [1] [2] [3] > I've done a lot of research in algorithms comparison it may be useful > for you. > > [1] > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html > [2] > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html > [3] https://issues.apache.org/jira/browse/IGNITE-3592 > [4] https://issues.apache.org/jira/browse/IGNITE-5226 > [5] https://github.com/daradurvs/ignite-compression > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> wrote: > > > > > > > > Currently, the dictionary for decompression is only stored on heap. > After > > > restart there's compressed data in the PDS, but there's no dictionary > :) > > > > > > Basically, it means that I've lost my data, right? How about persisting > > data to disk. > > > > Overall, we need Vladimir Ozerov to check the contribution. He was the > one > > who sponsored the IEP and knows the area best. > > > > -- > > Denis > > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev < > [hidden email]> > > wrote: > > > > > Hello! > > > > > > It is somewhat a part of IEP-20, since I have updated it with this > > > particular direction. > > > > > > Regards, > > > > > > -- > > > Ilya Kasnacheev > > > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>: > > > > > > > Hi Ilya, > > > > > > > > Sounds terrific! Is this part of the following Ignite enhancement > > > proposal? > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > -- > > > > Denis > > > > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev < > > > [hidden email] > > > > > > > > > wrote: > > > > > > > > > Hello! > > > > > > > > > > My plan was to add a compression section to cache configuration, > where > > > > you > > > > > can enable compression, enable key compression (which has heavier > > > > > performance implications), adjust dictionary gathering settings, > and in > > > > the > > > > > future possibly choose betwen algorithms. In fact I'm not sure, > since > > > my > > > > > assumption is that you can always just use latest&greatest, but > maybe > > > we > > > > > can have e.g. very fast and not very strong vs. slower but stronger > > > one. > > > > > > > > > > I'm not sure yet if we should share dictionary between all caches > vs. > > > > > having separate dictionary for every cache. > > > > > > > > > > With regards to data format, of course there will be room for > further > > > > > extension. > > > > > > > > > > Regards, > > > > > > > > > > -- > > > > > Ilya Kasnacheev > > > > > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>: > > > > > > > > > > > Hi Ilya > > > > > > > > > > > > Is there a plan to introduce it as an option of Ignite > configuration? > > > > In > > > > > > that instead the boolean type I suggest to use the enum and > reserve > > > the > > > > > > ability to extend compressions algorithms in future > > > > > > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > > > > > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > I want to share with the developer community my compression > > > > prototype. > > > > > > > > > > > > > > Long story short, it compresses BinaryObject's byte[] as they > are > > > > > written > > > > > > > to Durable Memory page, operating on a pre-built dictionary. > > > Typical > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using > custom > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are > unaffected > > > > > > > entirely. > > > > > > > > > > > > > > This is akin to DB2's table-level compression[1] but > independently > > > > > > > invented. > > > > > > > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up to > -25% > > > (in > > > > > > > throughput) with In-Memory loads. It also means you can fit > ~twice > > > as > > > > > > much > > > > > > > data into the same IM cluster, or have higher ram/disk ratio > with > > > PDS > > > > > > > cluster, saving on hardware or decreasing latency. > > > > > > > > > > > > > > The code is available as PR 4295[2] (set > > > > IGNITE_ENABLE_COMPRESSION=true > > > > > > to > > > > > > > activate). Note that it will not presently survive a PDS node > > > > restart. > > > > > > > The impact is very small, the patch should be applicable to > most > > > 2.x > > > > > > > releases. > > > > > > > > > > > > > > Sure there's a long way before this prototype can have hope of > > > being > > > > > > > included, but first I would like to hear input from fellow > > > igniters. > > > > > > > > > > > > > > See also IEP-20[3]. > > > > > > > > > > > > > > 1. > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > > > > > 2. https://github.com/apache/ignite/pull/4295 > > > > > > > 3. > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > -- > > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Sergey Kozlov > > > > > > GridGain Systems > > > > > > www.gridgain.com > > > > > > > > > > > > > > > > > > > > > > -- > Best Regards, Vyacheslav D. > |
According to my benchmarks - zstd compression algorithm [1] looks very
interesting, it has a high compression ratio with quite good speed. AFAIK it supports external dictionaries, but I'm not sure about using it with "on the fly building" dictionaries. Anyway, have look at (it has ASF 2.0 friendly license). Also, here is data generator / loader [1]. If it will be useful for you we should ask Nikolay Izhikov to share public docs to start. [1] https://github.com/facebook/zstd [2] https://github.com/nizhikov/ignite-cod-data-loader On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev <[hidden email]> wrote: > > Hello Vyacheslav! > > Unfortunately I have not found any efficient algorithms that will allow me > to use external dictionary as a pre-processed data structure. If plain gzip > is used without dictionary, the compression is around 0.7, as opposed to > 0.4 that I will get with custom implementation, AFAIR the performance was > also worse. I didn't really try it with dictionary, but I assume > performance will be even worse since it will have to scan dictionary before > getting to actual data. > > We have such a huge array of tests that we can just run them all with > compression enabled, see if there are any new failures. But the impact of > my commit is fairly low, it is only triggered when data is written to page > (maybe to WAL also?), and we don't really do much frivolous stuff to pages. > > Still, I am very much interested in finding existing compression > implementations with support of external dictionary; I am also very much > interested in having different implementations of compression for Apache > Ignite (such as per page compression) and comparing them by benchmark and > by code impact. I am also very interested in large standard datasets for > Apache Ignite (or generators thereof) so that we can run precise benchmarks > on various compression schemes. If you have any of the following, please > get back to me. > > Regards, > -- > Ilya Kasnacheev > > > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>: > > > Hi Igniters! > > > > Ilya, I'm glad to see one more person who is interested in the > > compression feature in Ignite. > > > > I looked through the pull request and want to share following thoughts: > > > > It's very dangerous using a custom algorithm in this way - you store > > serialized data separate from a dictionary and there are a lot of > > points when we may lose data: rebalancing, serialization errors, node > > rebooting and so on. > > > > I'd suggest the following ways to improve reliability: > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that > > allows us to decompress data in any situation > > - store the dictionary inside page with data > > > > Also, we have a lot of discussions [1] [2] about compression on > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was > > strictly against a compression on this level. > > If something has changed since then, you may look through [1] [2] [3] > > I've done a lot of research in algorithms comparison it may be useful > > for you. > > > > [1] > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html > > [2] > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html > > [3] https://issues.apache.org/jira/browse/IGNITE-3592 > > [4] https://issues.apache.org/jira/browse/IGNITE-5226 > > [5] https://github.com/daradurvs/ignite-compression > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> wrote: > > > > > > > > > > > Currently, the dictionary for decompression is only stored on heap. > > After > > > > restart there's compressed data in the PDS, but there's no dictionary > > :) > > > > > > > > > Basically, it means that I've lost my data, right? How about persisting > > > data to disk. > > > > > > Overall, we need Vladimir Ozerov to check the contribution. He was the > > one > > > who sponsored the IEP and knows the area best. > > > > > > -- > > > Denis > > > > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev < > > [hidden email]> > > > wrote: > > > > > > > Hello! > > > > > > > > It is somewhat a part of IEP-20, since I have updated it with this > > > > particular direction. > > > > > > > > Regards, > > > > > > > > -- > > > > Ilya Kasnacheev > > > > > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>: > > > > > > > > > Hi Ilya, > > > > > > > > > > Sounds terrific! Is this part of the following Ignite enhancement > > > > proposal? > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > -- > > > > > Denis > > > > > > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev < > > > > [hidden email] > > > > > > > > > > > wrote: > > > > > > > > > > > Hello! > > > > > > > > > > > > My plan was to add a compression section to cache configuration, > > where > > > > > you > > > > > > can enable compression, enable key compression (which has heavier > > > > > > performance implications), adjust dictionary gathering settings, > > and in > > > > > the > > > > > > future possibly choose betwen algorithms. In fact I'm not sure, > > since > > > > my > > > > > > assumption is that you can always just use latest&greatest, but > > maybe > > > > we > > > > > > can have e.g. very fast and not very strong vs. slower but stronger > > > > one. > > > > > > > > > > > > I'm not sure yet if we should share dictionary between all caches > > vs. > > > > > > having separate dictionary for every cache. > > > > > > > > > > > > With regards to data format, of course there will be room for > > further > > > > > > extension. > > > > > > > > > > > > Regards, > > > > > > > > > > > > -- > > > > > > Ilya Kasnacheev > > > > > > > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>: > > > > > > > > > > > > > Hi Ilya > > > > > > > > > > > > > > Is there a plan to introduce it as an option of Ignite > > configuration? > > > > > In > > > > > > > that instead the boolean type I suggest to use the enum and > > reserve > > > > the > > > > > > > ability to extend compressions algorithms in future > > > > > > > > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > > > > > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > > > I want to share with the developer community my compression > > > > > prototype. > > > > > > > > > > > > > > > > Long story short, it compresses BinaryObject's byte[] as they > > are > > > > > > written > > > > > > > > to Durable Memory page, operating on a pre-built dictionary. > > > > Typical > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using > > custom > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are > > unaffected > > > > > > > > entirely. > > > > > > > > > > > > > > > > This is akin to DB2's table-level compression[1] but > > independently > > > > > > > > invented. > > > > > > > > > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up to > > -25% > > > > (in > > > > > > > > throughput) with In-Memory loads. It also means you can fit > > ~twice > > > > as > > > > > > > much > > > > > > > > data into the same IM cluster, or have higher ram/disk ratio > > with > > > > PDS > > > > > > > > cluster, saving on hardware or decreasing latency. > > > > > > > > > > > > > > > > The code is available as PR 4295[2] (set > > > > > IGNITE_ENABLE_COMPRESSION=true > > > > > > > to > > > > > > > > activate). Note that it will not presently survive a PDS node > > > > > restart. > > > > > > > > The impact is very small, the patch should be applicable to > > most > > > > 2.x > > > > > > > > releases. > > > > > > > > > > > > > > > > Sure there's a long way before this prototype can have hope of > > > > being > > > > > > > > included, but first I would like to hear input from fellow > > > > igniters. > > > > > > > > > > > > > > > > See also IEP-20[3]. > > > > > > > > > > > > > > > > 1. > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > > > > > > 2. https://github.com/apache/ignite/pull/4295 > > > > > > > > 3. > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > -- > > > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Sergey Kozlov > > > > > > > GridGain Systems > > > > > > > www.gridgain.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Best Regards, Vyacheslav D. > > -- Best Regards, Vyacheslav D. |
I have another suggestion which may help us reduce objects size
extremely - implementing some kind of SQL Scheme. For now, BinaryObject's format is too excessive - each serialized object stores offset of every serialized field even if the offset can be easily calculated. If we move this metadata from a serialized object to a separate entity - this will reduce an object's size. On Mon, Aug 27, 2018 at 2:53 PM Vyacheslav Daradur <[hidden email]> wrote: > > According to my benchmarks - zstd compression algorithm [1] looks very > interesting, it has a high compression ratio with quite good speed. > AFAIK it supports external dictionaries, but I'm not sure about using > it with "on the fly building" dictionaries. Anyway, have look at (it > has ASF 2.0 friendly license). > > Also, here is data generator / loader [1]. If it will be useful for > you we should ask Nikolay Izhikov to share public docs to start. > > [1] https://github.com/facebook/zstd > [2] https://github.com/nizhikov/ignite-cod-data-loader > On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev > <[hidden email]> wrote: > > > > Hello Vyacheslav! > > > > Unfortunately I have not found any efficient algorithms that will allow me > > to use external dictionary as a pre-processed data structure. If plain gzip > > is used without dictionary, the compression is around 0.7, as opposed to > > 0.4 that I will get with custom implementation, AFAIR the performance was > > also worse. I didn't really try it with dictionary, but I assume > > performance will be even worse since it will have to scan dictionary before > > getting to actual data. > > > > We have such a huge array of tests that we can just run them all with > > compression enabled, see if there are any new failures. But the impact of > > my commit is fairly low, it is only triggered when data is written to page > > (maybe to WAL also?), and we don't really do much frivolous stuff to pages. > > > > Still, I am very much interested in finding existing compression > > implementations with support of external dictionary; I am also very much > > interested in having different implementations of compression for Apache > > Ignite (such as per page compression) and comparing them by benchmark and > > by code impact. I am also very interested in large standard datasets for > > Apache Ignite (or generators thereof) so that we can run precise benchmarks > > on various compression schemes. If you have any of the following, please > > get back to me. > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>: > > > > > Hi Igniters! > > > > > > Ilya, I'm glad to see one more person who is interested in the > > > compression feature in Ignite. > > > > > > I looked through the pull request and want to share following thoughts: > > > > > > It's very dangerous using a custom algorithm in this way - you store > > > serialized data separate from a dictionary and there are a lot of > > > points when we may lose data: rebalancing, serialization errors, node > > > rebooting and so on. > > > > > > I'd suggest the following ways to improve reliability: > > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that > > > allows us to decompress data in any situation > > > - store the dictionary inside page with data > > > > > > Also, we have a lot of discussions [1] [2] about compression on > > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was > > > strictly against a compression on this level. > > > If something has changed since then, you may look through [1] [2] [3] > > > I've done a lot of research in algorithms comparison it may be useful > > > for you. > > > > > > [1] > > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html > > > [2] > > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html > > > [3] https://issues.apache.org/jira/browse/IGNITE-3592 > > > [4] https://issues.apache.org/jira/browse/IGNITE-5226 > > > [5] https://github.com/daradurvs/ignite-compression > > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> wrote: > > > > > > > > > > > > > > Currently, the dictionary for decompression is only stored on heap. > > > After > > > > > restart there's compressed data in the PDS, but there's no dictionary > > > :) > > > > > > > > > > > > Basically, it means that I've lost my data, right? How about persisting > > > > data to disk. > > > > > > > > Overall, we need Vladimir Ozerov to check the contribution. He was the > > > one > > > > who sponsored the IEP and knows the area best. > > > > > > > > -- > > > > Denis > > > > > > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev < > > > [hidden email]> > > > > wrote: > > > > > > > > > Hello! > > > > > > > > > > It is somewhat a part of IEP-20, since I have updated it with this > > > > > particular direction. > > > > > > > > > > Regards, > > > > > > > > > > -- > > > > > Ilya Kasnacheev > > > > > > > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>: > > > > > > > > > > > Hi Ilya, > > > > > > > > > > > > Sounds terrific! Is this part of the following Ignite enhancement > > > > > proposal? > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > -- > > > > > > Denis > > > > > > > > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev < > > > > > [hidden email] > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > My plan was to add a compression section to cache configuration, > > > where > > > > > > you > > > > > > > can enable compression, enable key compression (which has heavier > > > > > > > performance implications), adjust dictionary gathering settings, > > > and in > > > > > > the > > > > > > > future possibly choose betwen algorithms. In fact I'm not sure, > > > since > > > > > my > > > > > > > assumption is that you can always just use latest&greatest, but > > > maybe > > > > > we > > > > > > > can have e.g. very fast and not very strong vs. slower but stronger > > > > > one. > > > > > > > > > > > > > > I'm not sure yet if we should share dictionary between all caches > > > vs. > > > > > > > having separate dictionary for every cache. > > > > > > > > > > > > > > With regards to data format, of course there will be room for > > > further > > > > > > > extension. > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > -- > > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email]>: > > > > > > > > > > > > > > > Hi Ilya > > > > > > > > > > > > > > > > Is there a plan to introduce it as an option of Ignite > > > configuration? > > > > > > In > > > > > > > > that instead the boolean type I suggest to use the enum and > > > reserve > > > > > the > > > > > > > > ability to extend compressions algorithms in future > > > > > > > > > > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > > > > > > > > [hidden email]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > > > > > I want to share with the developer community my compression > > > > > > prototype. > > > > > > > > > > > > > > > > > > Long story short, it compresses BinaryObject's byte[] as they > > > are > > > > > > > written > > > > > > > > > to Durable Memory page, operating on a pre-built dictionary. > > > > > Typical > > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using > > > custom > > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are > > > unaffected > > > > > > > > > entirely. > > > > > > > > > > > > > > > > > > This is akin to DB2's table-level compression[1] but > > > independently > > > > > > > > > invented. > > > > > > > > > > > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up to > > > -25% > > > > > (in > > > > > > > > > throughput) with In-Memory loads. It also means you can fit > > > ~twice > > > > > as > > > > > > > > much > > > > > > > > > data into the same IM cluster, or have higher ram/disk ratio > > > with > > > > > PDS > > > > > > > > > cluster, saving on hardware or decreasing latency. > > > > > > > > > > > > > > > > > > The code is available as PR 4295[2] (set > > > > > > IGNITE_ENABLE_COMPRESSION=true > > > > > > > > to > > > > > > > > > activate). Note that it will not presently survive a PDS node > > > > > > restart. > > > > > > > > > The impact is very small, the patch should be applicable to > > > most > > > > > 2.x > > > > > > > > > releases. > > > > > > > > > > > > > > > > > > Sure there's a long way before this prototype can have hope of > > > > > being > > > > > > > > > included, but first I would like to hear input from fellow > > > > > igniters. > > > > > > > > > > > > > > > > > > See also IEP-20[3]. > > > > > > > > > > > > > > > > > > 1. > > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > > > > > > > 2. https://github.com/apache/ignite/pull/4295 > > > > > > > > > 3. > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Sergey Kozlov > > > > > > > > GridGain Systems > > > > > > > > www.gridgain.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best Regards, Vyacheslav D. > > > > > > > -- > Best Regards, Vyacheslav D. -- Best Regards, Vyacheslav D. |
Hello!
Yes, we can tinker with BinaryObject format, which is currently clearly excessive. But the best part with compression, it will automatically remove this redundancy for us, for free. Even if we had hairy XML as binary object format, it will still compress roughly to the same number of bytes. If we will have fast transparent compression, we can just skip this work. Of course, codifying offsets can have other uses, but it also have a lot of limitations. Regards, -- Ilya Kasnacheev вт, 28 авг. 2018 г. в 18:30, Vyacheslav Daradur <[hidden email]>: > I have another suggestion which may help us reduce objects size > extremely - implementing some kind of SQL Scheme. > > For now, BinaryObject's format is too excessive - each serialized > object stores offset of every serialized field even if the offset can > be easily calculated. > > If we move this metadata from a serialized object to a separate entity > - this will reduce an object's size. > On Mon, Aug 27, 2018 at 2:53 PM Vyacheslav Daradur <[hidden email]> > wrote: > > > > According to my benchmarks - zstd compression algorithm [1] looks very > > interesting, it has a high compression ratio with quite good speed. > > AFAIK it supports external dictionaries, but I'm not sure about using > > it with "on the fly building" dictionaries. Anyway, have look at (it > > has ASF 2.0 friendly license). > > > > Also, here is data generator / loader [1]. If it will be useful for > > you we should ask Nikolay Izhikov to share public docs to start. > > > > [1] https://github.com/facebook/zstd > > [2] https://github.com/nizhikov/ignite-cod-data-loader > > On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev > > <[hidden email]> wrote: > > > > > > Hello Vyacheslav! > > > > > > Unfortunately I have not found any efficient algorithms that will > allow me > > > to use external dictionary as a pre-processed data structure. If plain > gzip > > > is used without dictionary, the compression is around 0.7, as opposed > to > > > 0.4 that I will get with custom implementation, AFAIR the performance > was > > > also worse. I didn't really try it with dictionary, but I assume > > > performance will be even worse since it will have to scan dictionary > before > > > getting to actual data. > > > > > > We have such a huge array of tests that we can just run them all with > > > compression enabled, see if there are any new failures. But the impact > of > > > my commit is fairly low, it is only triggered when data is written to > page > > > (maybe to WAL also?), and we don't really do much frivolous stuff to > pages. > > > > > > Still, I am very much interested in finding existing compression > > > implementations with support of external dictionary; I am also very > much > > > interested in having different implementations of compression for > Apache > > > Ignite (such as per page compression) and comparing them by benchmark > and > > > by code impact. I am also very interested in large standard datasets > for > > > Apache Ignite (or generators thereof) so that we can run precise > benchmarks > > > on various compression schemes. If you have any of the following, > please > > > get back to me. > > > > > > Regards, > > > -- > > > Ilya Kasnacheev > > > > > > > > > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>: > > > > > > > Hi Igniters! > > > > > > > > Ilya, I'm glad to see one more person who is interested in the > > > > compression feature in Ignite. > > > > > > > > I looked through the pull request and want to share following > thoughts: > > > > > > > > It's very dangerous using a custom algorithm in this way - you store > > > > serialized data separate from a dictionary and there are a lot of > > > > points when we may lose data: rebalancing, serialization errors, node > > > > rebooting and so on. > > > > > > > > I'd suggest the following ways to improve reliability: > > > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that > > > > allows us to decompress data in any situation > > > > - store the dictionary inside page with data > > > > > > > > Also, we have a lot of discussions [1] [2] about compression on > > > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was > > > > strictly against a compression on this level. > > > > If something has changed since then, you may look through [1] [2] [3] > > > > I've done a lot of research in algorithms comparison it may be useful > > > > for you. > > > > > > > > [1] > > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html > > > > [2] > > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html > > > > [3] https://issues.apache.org/jira/browse/IGNITE-3592 > > > > [4] https://issues.apache.org/jira/browse/IGNITE-5226 > > > > [5] https://github.com/daradurvs/ignite-compression > > > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> > wrote: > > > > > > > > > > > > > > > > > Currently, the dictionary for decompression is only stored on > heap. > > > > After > > > > > > restart there's compressed data in the PDS, but there's no > dictionary > > > > :) > > > > > > > > > > > > > > > Basically, it means that I've lost my data, right? How about > persisting > > > > > data to disk. > > > > > > > > > > Overall, we need Vladimir Ozerov to check the contribution. He was > the > > > > one > > > > > who sponsored the IEP and knows the area best. > > > > > > > > > > -- > > > > > Denis > > > > > > > > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev < > > > > [hidden email]> > > > > > wrote: > > > > > > > > > > > Hello! > > > > > > > > > > > > It is somewhat a part of IEP-20, since I have updated it with > this > > > > > > particular direction. > > > > > > > > > > > > Regards, > > > > > > > > > > > > -- > > > > > > Ilya Kasnacheev > > > > > > > > > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>: > > > > > > > > > > > > > Hi Ilya, > > > > > > > > > > > > > > Sounds terrific! Is this part of the following Ignite > enhancement > > > > > > proposal? > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > > > -- > > > > > > > Denis > > > > > > > > > > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev < > > > > > > [hidden email] > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > > > My plan was to add a compression section to cache > configuration, > > > > where > > > > > > > you > > > > > > > > can enable compression, enable key compression (which has > heavier > > > > > > > > performance implications), adjust dictionary gathering > settings, > > > > and in > > > > > > > the > > > > > > > > future possibly choose betwen algorithms. In fact I'm not > sure, > > > > since > > > > > > my > > > > > > > > assumption is that you can always just use latest&greatest, > but > > > > maybe > > > > > > we > > > > > > > > can have e.g. very fast and not very strong vs. slower but > stronger > > > > > > one. > > > > > > > > > > > > > > > > I'm not sure yet if we should share dictionary between all > caches > > > > vs. > > > > > > > > having separate dictionary for every cache. > > > > > > > > > > > > > > > > With regards to data format, of course there will be room for > > > > further > > > > > > > > extension. > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > -- > > > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov < > [hidden email]>: > > > > > > > > > > > > > > > > > Hi Ilya > > > > > > > > > > > > > > > > > > Is there a plan to introduce it as an option of Ignite > > > > configuration? > > > > > > > In > > > > > > > > > that instead the boolean type I suggest to use the enum and > > > > reserve > > > > > > the > > > > > > > > > ability to extend compressions algorithms in future > > > > > > > > > > > > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > > > > > > > > > [hidden email]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > > > > > > > I want to share with the developer community my > compression > > > > > > > prototype. > > > > > > > > > > > > > > > > > > > > Long story short, it compresses BinaryObject's byte[] as > they > > > > are > > > > > > > > written > > > > > > > > > > to Durable Memory page, operating on a pre-built > dictionary. > > > > > > Typical > > > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using > > > > custom > > > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are > > > > unaffected > > > > > > > > > > entirely. > > > > > > > > > > > > > > > > > > > > This is akin to DB2's table-level compression[1] but > > > > independently > > > > > > > > > > invented. > > > > > > > > > > > > > > > > > > > > On Yardstick tests performance hit is -6% with PDS and > up to > > > > -25% > > > > > > (in > > > > > > > > > > throughput) with In-Memory loads. It also means you can > fit > > > > ~twice > > > > > > as > > > > > > > > > much > > > > > > > > > > data into the same IM cluster, or have higher ram/disk > ratio > > > > with > > > > > > PDS > > > > > > > > > > cluster, saving on hardware or decreasing latency. > > > > > > > > > > > > > > > > > > > > The code is available as PR 4295[2] (set > > > > > > > IGNITE_ENABLE_COMPRESSION=true > > > > > > > > > to > > > > > > > > > > activate). Note that it will not presently survive a PDS > node > > > > > > > restart. > > > > > > > > > > The impact is very small, the patch should be applicable > to > > > > most > > > > > > 2.x > > > > > > > > > > releases. > > > > > > > > > > > > > > > > > > > > Sure there's a long way before this prototype can have > hope of > > > > > > being > > > > > > > > > > included, but first I would like to hear input from > fellow > > > > > > igniters. > > > > > > > > > > > > > > > > > > > > See also IEP-20[3]. > > > > > > > > > > > > > > > > > > > > 1. > > > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10 > . > > > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > > > > > > > > 2. https://github.com/apache/ignite/pull/4295 > > > > > > > > > > 3. > > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Sergey Kozlov > > > > > > > > > GridGain Systems > > > > > > > > > www.gridgain.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best Regards, Vyacheslav D. > > > > > > > > > > > > -- > > Best Regards, Vyacheslav D. > > > > -- > Best Regards, Vyacheslav D. > |
In reply to this post by daradurvs
Hello!
I am testing Zstd with dictionary, and it looks very very promising. I'm confident I can choose settings where it is faster than my own algo while bringing better compression ratio, on "cod" dataset. So I am happliy retiring my code and switching to Zstd. Would probably mean that we will ship compression implementation as a separate module. It is a pity that I did not find out about Zstd dictionary support earlier, that would mean I could skip a few days of work. Without dictionary the results of Zstd were worse than my own algo, but it was faster. Regards, -- Ilya Kasnacheev пн, 27 авг. 2018 г. в 14:53, Vyacheslav Daradur <[hidden email]>: > According to my benchmarks - zstd compression algorithm [1] looks very > interesting, it has a high compression ratio with quite good speed. > AFAIK it supports external dictionaries, but I'm not sure about using > it with "on the fly building" dictionaries. Anyway, have look at (it > has ASF 2.0 friendly license). > > Also, here is data generator / loader [1]. If it will be useful for > you we should ask Nikolay Izhikov to share public docs to start. > > [1] https://github.com/facebook/zstd > [2] https://github.com/nizhikov/ignite-cod-data-loader > On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev > <[hidden email]> wrote: > > > > Hello Vyacheslav! > > > > Unfortunately I have not found any efficient algorithms that will allow > me > > to use external dictionary as a pre-processed data structure. If plain > gzip > > is used without dictionary, the compression is around 0.7, as opposed to > > 0.4 that I will get with custom implementation, AFAIR the performance was > > also worse. I didn't really try it with dictionary, but I assume > > performance will be even worse since it will have to scan dictionary > before > > getting to actual data. > > > > We have such a huge array of tests that we can just run them all with > > compression enabled, see if there are any new failures. But the impact of > > my commit is fairly low, it is only triggered when data is written to > page > > (maybe to WAL also?), and we don't really do much frivolous stuff to > pages. > > > > Still, I am very much interested in finding existing compression > > implementations with support of external dictionary; I am also very much > > interested in having different implementations of compression for Apache > > Ignite (such as per page compression) and comparing them by benchmark and > > by code impact. I am also very interested in large standard datasets for > > Apache Ignite (or generators thereof) so that we can run precise > benchmarks > > on various compression schemes. If you have any of the following, please > > get back to me. > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>: > > > > > Hi Igniters! > > > > > > Ilya, I'm glad to see one more person who is interested in the > > > compression feature in Ignite. > > > > > > I looked through the pull request and want to share following thoughts: > > > > > > It's very dangerous using a custom algorithm in this way - you store > > > serialized data separate from a dictionary and there are a lot of > > > points when we may lose data: rebalancing, serialization errors, node > > > rebooting and so on. > > > > > > I'd suggest the following ways to improve reliability: > > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that > > > allows us to decompress data in any situation > > > - store the dictionary inside page with data > > > > > > Also, we have a lot of discussions [1] [2] about compression on > > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was > > > strictly against a compression on this level. > > > If something has changed since then, you may look through [1] [2] [3] > > > I've done a lot of research in algorithms comparison it may be useful > > > for you. > > > > > > [1] > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html > > > [2] > > > > http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html > > > [3] https://issues.apache.org/jira/browse/IGNITE-3592 > > > [4] https://issues.apache.org/jira/browse/IGNITE-5226 > > > [5] https://github.com/daradurvs/ignite-compression > > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> wrote: > > > > > > > > > > > > > > Currently, the dictionary for decompression is only stored on heap. > > > After > > > > > restart there's compressed data in the PDS, but there's no > dictionary > > > :) > > > > > > > > > > > > Basically, it means that I've lost my data, right? How about > persisting > > > > data to disk. > > > > > > > > Overall, we need Vladimir Ozerov to check the contribution. He was > the > > > one > > > > who sponsored the IEP and knows the area best. > > > > > > > > -- > > > > Denis > > > > > > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev < > > > [hidden email]> > > > > wrote: > > > > > > > > > Hello! > > > > > > > > > > It is somewhat a part of IEP-20, since I have updated it with this > > > > > particular direction. > > > > > > > > > > Regards, > > > > > > > > > > -- > > > > > Ilya Kasnacheev > > > > > > > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>: > > > > > > > > > > > Hi Ilya, > > > > > > > > > > > > Sounds terrific! Is this part of the following Ignite enhancement > > > > > proposal? > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > -- > > > > > > Denis > > > > > > > > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev < > > > > > [hidden email] > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > My plan was to add a compression section to cache > configuration, > > > where > > > > > > you > > > > > > > can enable compression, enable key compression (which has > heavier > > > > > > > performance implications), adjust dictionary gathering > settings, > > > and in > > > > > > the > > > > > > > future possibly choose betwen algorithms. In fact I'm not sure, > > > since > > > > > my > > > > > > > assumption is that you can always just use latest&greatest, but > > > maybe > > > > > we > > > > > > > can have e.g. very fast and not very strong vs. slower but > stronger > > > > > one. > > > > > > > > > > > > > > I'm not sure yet if we should share dictionary between all > caches > > > vs. > > > > > > > having separate dictionary for every cache. > > > > > > > > > > > > > > With regards to data format, of course there will be room for > > > further > > > > > > > extension. > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > -- > > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov <[hidden email] > >: > > > > > > > > > > > > > > > Hi Ilya > > > > > > > > > > > > > > > > Is there a plan to introduce it as an option of Ignite > > > configuration? > > > > > > In > > > > > > > > that instead the boolean type I suggest to use the enum and > > > reserve > > > > > the > > > > > > > > ability to extend compressions algorithms in future > > > > > > > > > > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < > > > > > > > > [hidden email]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > > > > > I want to share with the developer community my compression > > > > > > prototype. > > > > > > > > > > > > > > > > > > Long story short, it compresses BinaryObject's byte[] as > they > > > are > > > > > > > written > > > > > > > > > to Durable Memory page, operating on a pre-built > dictionary. > > > > > Typical > > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using > > > custom > > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are > > > unaffected > > > > > > > > > entirely. > > > > > > > > > > > > > > > > > > This is akin to DB2's table-level compression[1] but > > > independently > > > > > > > > > invented. > > > > > > > > > > > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up > to > > > -25% > > > > > (in > > > > > > > > > throughput) with In-Memory loads. It also means you can fit > > > ~twice > > > > > as > > > > > > > > much > > > > > > > > > data into the same IM cluster, or have higher ram/disk > ratio > > > with > > > > > PDS > > > > > > > > > cluster, saving on hardware or decreasing latency. > > > > > > > > > > > > > > > > > > The code is available as PR 4295[2] (set > > > > > > IGNITE_ENABLE_COMPRESSION=true > > > > > > > > to > > > > > > > > > activate). Note that it will not presently survive a PDS > node > > > > > > restart. > > > > > > > > > The impact is very small, the patch should be applicable to > > > most > > > > > 2.x > > > > > > > > > releases. > > > > > > > > > > > > > > > > > > Sure there's a long way before this prototype can have > hope of > > > > > being > > > > > > > > > included, but first I would like to hear input from fellow > > > > > igniters. > > > > > > > > > > > > > > > > > > See also IEP-20[3]. > > > > > > > > > > > > > > > > > > 1. > > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. > > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html > > > > > > > > > 2. https://github.com/apache/ignite/pull/4295 > > > > > > > > > 3. > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > > > > > > > > 20%3A+Data+Compression+in+Ignite > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Ilya Kasnacheev > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Sergey Kozlov > > > > > > > > GridGain Systems > > > > > > > > www.gridgain.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best Regards, Vyacheslav D. > > > > > > > -- > Best Regards, Vyacheslav D. > |
Just as I have started praising Zstd, it began to show JVM crashes in
native code in train dict :( I guess it has limits to train buffer, after which errorneous behaviour is exhibited. Maybe we will need to submit a pull request:) Regards, -- Ilya Kasnacheev пт, 31 авг. 2018 г. в 11:56, Ilya Kasnacheev <[hidden email]>: > Hello! > > I am testing Zstd with dictionary, and it looks very very promising. I'm > confident I can choose settings where it is faster than my own algo while > bringing better compression ratio, on "cod" dataset. > > So I am happliy retiring my code and switching to Zstd. Would probably > mean that we will ship compression implementation as a separate module. > > It is a pity that I did not find out about Zstd dictionary support > earlier, that would mean I could skip a few days of work. > > Without dictionary the results of Zstd were worse than my own algo, but it > was faster. > > Regards, > -- > Ilya Kasnacheev > > > пн, 27 авг. 2018 г. в 14:53, Vyacheslav Daradur <[hidden email]>: > >> According to my benchmarks - zstd compression algorithm [1] looks very >> interesting, it has a high compression ratio with quite good speed. >> AFAIK it supports external dictionaries, but I'm not sure about using >> it with "on the fly building" dictionaries. Anyway, have look at (it >> has ASF 2.0 friendly license). >> >> Also, here is data generator / loader [1]. If it will be useful for >> you we should ask Nikolay Izhikov to share public docs to start. >> >> [1] https://github.com/facebook/zstd >> [2] https://github.com/nizhikov/ignite-cod-data-loader >> On Mon, Aug 27, 2018 at 2:11 PM Ilya Kasnacheev >> <[hidden email]> wrote: >> > >> > Hello Vyacheslav! >> > >> > Unfortunately I have not found any efficient algorithms that will allow >> me >> > to use external dictionary as a pre-processed data structure. If plain >> gzip >> > is used without dictionary, the compression is around 0.7, as opposed to >> > 0.4 that I will get with custom implementation, AFAIR the performance >> was >> > also worse. I didn't really try it with dictionary, but I assume >> > performance will be even worse since it will have to scan dictionary >> before >> > getting to actual data. >> > >> > We have such a huge array of tests that we can just run them all with >> > compression enabled, see if there are any new failures. But the impact >> of >> > my commit is fairly low, it is only triggered when data is written to >> page >> > (maybe to WAL also?), and we don't really do much frivolous stuff to >> pages. >> > >> > Still, I am very much interested in finding existing compression >> > implementations with support of external dictionary; I am also very much >> > interested in having different implementations of compression for Apache >> > Ignite (such as per page compression) and comparing them by benchmark >> and >> > by code impact. I am also very interested in large standard datasets for >> > Apache Ignite (or generators thereof) so that we can run precise >> benchmarks >> > on various compression schemes. If you have any of the following, please >> > get back to me. >> > >> > Regards, >> > -- >> > Ilya Kasnacheev >> > >> > >> > пн, 27 авг. 2018 г. в 11:35, Vyacheslav Daradur <[hidden email]>: >> > >> > > Hi Igniters! >> > > >> > > Ilya, I'm glad to see one more person who is interested in the >> > > compression feature in Ignite. >> > > >> > > I looked through the pull request and want to share following >> thoughts: >> > > >> > > It's very dangerous using a custom algorithm in this way - you store >> > > serialized data separate from a dictionary and there are a lot of >> > > points when we may lose data: rebalancing, serialization errors, node >> > > rebooting and so on. >> > > >> > > I'd suggest the following ways to improve reliability: >> > > - use well know algorithms: zstd, deflater, lzma, gzip e.g. that >> > > allows us to decompress data in any situation >> > > - store the dictionary inside page with data >> > > >> > > Also, we have a lot of discussions [1] [2] about compression on >> > > BinaryObject and BinaryMarshaller level and Vladimir Ozerov was >> > > strictly against a compression on this level. >> > > If something has changed since then, you may look through [1] [2] [3] >> > > I've done a lot of research in algorithms comparison it may be useful >> > > for you. >> > > >> > > [1] >> > > >> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-2-0-td10099.html >> > > [2] >> > > >> http://apache-ignite-developers.2346864.n4.nabble.com/Data-compression-in-Ignite-td20679.html >> > > [3] https://issues.apache.org/jira/browse/IGNITE-3592 >> > > [4] https://issues.apache.org/jira/browse/IGNITE-5226 >> > > [5] https://github.com/daradurvs/ignite-compression >> > > On Sat, Aug 25, 2018 at 2:51 AM Denis Magda <[hidden email]> >> wrote: >> > > > >> > > > > >> > > > > Currently, the dictionary for decompression is only stored on >> heap. >> > > After >> > > > > restart there's compressed data in the PDS, but there's no >> dictionary >> > > :) >> > > > >> > > > >> > > > Basically, it means that I've lost my data, right? How about >> persisting >> > > > data to disk. >> > > > >> > > > Overall, we need Vladimir Ozerov to check the contribution. He was >> the >> > > one >> > > > who sponsored the IEP and knows the area best. >> > > > >> > > > -- >> > > > Denis >> > > > >> > > > On Fri, Aug 24, 2018 at 4:31 AM Ilya Kasnacheev < >> > > [hidden email]> >> > > > wrote: >> > > > >> > > > > Hello! >> > > > > >> > > > > It is somewhat a part of IEP-20, since I have updated it with this >> > > > > particular direction. >> > > > > >> > > > > Regards, >> > > > > >> > > > > -- >> > > > > Ilya Kasnacheev >> > > > > >> > > > > 2018-08-24 2:56 GMT+03:00 Denis Magda <[hidden email]>: >> > > > > >> > > > > > Hi Ilya, >> > > > > > >> > > > > > Sounds terrific! Is this part of the following Ignite >> enhancement >> > > > > proposal? >> > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- >> > > > > > 20%3A+Data+Compression+in+Ignite >> > > > > > >> > > > > > -- >> > > > > > Denis >> > > > > > >> > > > > > On Thu, Aug 23, 2018 at 5:17 AM Ilya Kasnacheev < >> > > > > [hidden email] >> > > > > > > >> > > > > > wrote: >> > > > > > >> > > > > > > Hello! >> > > > > > > >> > > > > > > My plan was to add a compression section to cache >> configuration, >> > > where >> > > > > > you >> > > > > > > can enable compression, enable key compression (which has >> heavier >> > > > > > > performance implications), adjust dictionary gathering >> settings, >> > > and in >> > > > > > the >> > > > > > > future possibly choose betwen algorithms. In fact I'm not >> sure, >> > > since >> > > > > my >> > > > > > > assumption is that you can always just use latest&greatest, >> but >> > > maybe >> > > > > we >> > > > > > > can have e.g. very fast and not very strong vs. slower but >> stronger >> > > > > one. >> > > > > > > >> > > > > > > I'm not sure yet if we should share dictionary between all >> caches >> > > vs. >> > > > > > > having separate dictionary for every cache. >> > > > > > > >> > > > > > > With regards to data format, of course there will be room for >> > > further >> > > > > > > extension. >> > > > > > > >> > > > > > > Regards, >> > > > > > > >> > > > > > > -- >> > > > > > > Ilya Kasnacheev >> > > > > > > >> > > > > > > 2018-08-23 15:13 GMT+03:00 Sergey Kozlov < >> [hidden email]>: >> > > > > > > >> > > > > > > > Hi Ilya >> > > > > > > > >> > > > > > > > Is there a plan to introduce it as an option of Ignite >> > > configuration? >> > > > > > In >> > > > > > > > that instead the boolean type I suggest to use the enum and >> > > reserve >> > > > > the >> > > > > > > > ability to extend compressions algorithms in future >> > > > > > > > >> > > > > > > > On Thu, Aug 23, 2018 at 1:09 PM, Ilya Kasnacheev < >> > > > > > > > [hidden email]> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > Hello! >> > > > > > > > > >> > > > > > > > > I want to share with the developer community my >> compression >> > > > > > prototype. >> > > > > > > > > >> > > > > > > > > Long story short, it compresses BinaryObject's byte[] as >> they >> > > are >> > > > > > > written >> > > > > > > > > to Durable Memory page, operating on a pre-built >> dictionary. >> > > > > Typical >> > > > > > > > > compression ratio is 0.4 (meaning 2.5x compression) using >> > > custom >> > > > > > > > > LZW+Huffman. Metadata, indexes and primitive values are >> > > unaffected >> > > > > > > > > entirely. >> > > > > > > > > >> > > > > > > > > This is akin to DB2's table-level compression[1] but >> > > independently >> > > > > > > > > invented. >> > > > > > > > > >> > > > > > > > > On Yardstick tests performance hit is -6% with PDS and up >> to >> > > -25% >> > > > > (in >> > > > > > > > > throughput) with In-Memory loads. It also means you can >> fit >> > > ~twice >> > > > > as >> > > > > > > > much >> > > > > > > > > data into the same IM cluster, or have higher ram/disk >> ratio >> > > with >> > > > > PDS >> > > > > > > > > cluster, saving on hardware or decreasing latency. >> > > > > > > > > >> > > > > > > > > The code is available as PR 4295[2] (set >> > > > > > IGNITE_ENABLE_COMPRESSION=true >> > > > > > > > to >> > > > > > > > > activate). Note that it will not presently survive a PDS >> node >> > > > > > restart. >> > > > > > > > > The impact is very small, the patch should be applicable >> to >> > > most >> > > > > 2.x >> > > > > > > > > releases. >> > > > > > > > > >> > > > > > > > > Sure there's a long way before this prototype can have >> hope of >> > > > > being >> > > > > > > > > included, but first I would like to hear input from fellow >> > > > > igniters. >> > > > > > > > > >> > > > > > > > > See also IEP-20[3]. >> > > > > > > > > >> > > > > > > > > 1. >> > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10. >> > > > > > > > > 5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html >> > > > > > > > > 2. https://github.com/apache/ignite/pull/4295 >> > > > > > > > > 3. >> > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- >> > > > > > > > > 20%3A+Data+Compression+in+Ignite >> > > > > > > > > >> > > > > > > > > Regards, >> > > > > > > > > >> > > > > > > > > -- >> > > > > > > > > Ilya Kasnacheev >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > -- >> > > > > > > > Sergey Kozlov >> > > > > > > > GridGain Systems >> > > > > > > > www.gridgain.com >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > >> > > >> > > >> > > -- >> > > Best Regards, Vyacheslav D. >> > > >> >> >> >> -- >> Best Regards, Vyacheslav D. >> > |
Hello again!
I've been running various compression parameters through cod dataset. It looks like the best compression level in terms of speed is either 1 or 2. The default for Zstd seems to be 3 which would almost always perform worse. For best performance a dictionary of 1024 is optimal, for better compression one might choose larger dictionaries, 6k looks good but I will also run a few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample size is set to more than 16k entries (I guess I should probe the max buffer size where problems begin). I'm attaching two charts which show what's we've got. Compression rate is a fraction of original records size. Time to run is wall clock time the test run. Reasonable compression will increase the run time twofold (of a program that only does text record parsing -> creates objects -> binarylizes them -> compresses -> decompresses). Notation: s{number of bin objects used to train}-d{dictionary length in bytes}-l{compression level}. <http://apache-ignite-developers.2346864.n4.nabble.com/file/t374/chart1.png> Second one is basically a zoom in on the first. <http://apache-ignite-developers.2346864.n4.nabble.com/file/t374/chart2.png> I think that in additional to dictionary compression we should have dictionary-less compression. On typical data of small records it shows compression rate of 0.8 ~ 0.65, but I can imagine that with larger unstructured records it can be as good as dict-based and much less of a hassle dictionary-processing-wise. WDYT? Sorry for the fine prints. I hope my charts will visible. You can see the updated code as pull request: https://github.com/apache/ignite/pull/4673 Regards, -- Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/ |
Hi Ilya,
This is very useful. Is the compression going to be per-page, in which case the dictionary is going to be kept inside of a page? Or do you have some other design in mind? D. On Mon, Sep 3, 2018 at 10:36 AM, Ilya Kasnacheev <[hidden email]> wrote: > Hello again! > > I've been running various compression parameters through cod dataset. > > It looks like the best compression level in terms of speed is either 1 or > 2. > The default for Zstd seems to be 3 which would almost always perform worse. > For best performance a dictionary of 1024 is optimal, for better > compression > one might choose larger dictionaries, 6k looks good but I will also run a > few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample size > is set to more than 16k entries (I guess I should probe the max buffer size > where problems begin). > > I'm attaching two charts which show what's we've got. Compression rate is a > fraction of original records size. Time to run is wall clock time the test > run. Reasonable compression will increase the run time twofold (of a > program > that only does text record parsing -> creates objects -> binarylizes them > -> > compresses -> decompresses). Notation: s{number of bin objects used to > train}-d{dictionary length in bytes}-l{compression level}. > <http://apache-ignite-developers.2346864.n4.nabble. > com/file/t374/chart1.png> > Second one is basically a zoom in on the first. > <http://apache-ignite-developers.2346864.n4.nabble. > com/file/t374/chart2.png> > I think that in additional to dictionary compression we should have > dictionary-less compression. On typical data of small records it shows > compression rate of 0.8 ~ 0.65, but I can imagine that with larger > unstructured records it can be as good as dict-based and much less of a > hassle dictionary-processing-wise. WDYT? > Sorry for the fine prints. I hope my charts will visible. > > You can see the updated code as pull request: > https://github.com/apache/ignite/pull/4673 > > Regards, > > > > -- > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/ > |
Hello!
The compression is per-binary-object, but dictionary is external, shared between multiple (millions of) entries and stored alongside compressed data. Regards, -- Ilya Kasnacheev вт, 4 сент. 2018 г. в 2:40, Dmitriy Setrakyan <[hidden email]>: > Hi Ilya, > > This is very useful. Is the compression going to be per-page, in which case > the dictionary is going to be kept inside of a page? Or do you have some > other design in mind? > > D. > > On Mon, Sep 3, 2018 at 10:36 AM, Ilya Kasnacheev < > [hidden email]> > wrote: > > > Hello again! > > > > I've been running various compression parameters through cod dataset. > > > > It looks like the best compression level in terms of speed is either 1 or > > 2. > > The default for Zstd seems to be 3 which would almost always perform > worse. > > For best performance a dictionary of 1024 is optimal, for better > > compression > > one might choose larger dictionaries, 6k looks good but I will also run a > > few benchmarks on larger dicts. Unfortunately, Zstd crashes if sample > size > > is set to more than 16k entries (I guess I should probe the max buffer > size > > where problems begin). > > > > I'm attaching two charts which show what's we've got. Compression rate > is a > > fraction of original records size. Time to run is wall clock time the > test > > run. Reasonable compression will increase the run time twofold (of a > > program > > that only does text record parsing -> creates objects -> binarylizes them > > -> > > compresses -> decompresses). Notation: s{number of bin objects used to > > train}-d{dictionary length in bytes}-l{compression level}. > > <http://apache-ignite-developers.2346864.n4.nabble. > > com/file/t374/chart1.png> > > Second one is basically a zoom in on the first. > > <http://apache-ignite-developers.2346864.n4.nabble. > > com/file/t374/chart2.png> > > I think that in additional to dictionary compression we should have > > dictionary-less compression. On typical data of small records it shows > > compression rate of 0.8 ~ 0.65, but I can imagine that with larger > > unstructured records it can be as good as dict-based and much less of a > > hassle dictionary-processing-wise. WDYT? > > Sorry for the fine prints. I hope my charts will visible. > > > > You can see the updated code as pull request: > > https://github.com/apache/ignite/pull/4673 > > > > Regards, > > > > > > > > -- > > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/ > > > |
On Tue, Sep 4, 2018 at 1:16 AM, Ilya Kasnacheev <[hidden email]>
wrote: > Hello! > > The compression is per-binary-object, but dictionary is external, shared > between multiple (millions of) entries and stored alongside compressed > data. > I was under a different impression. If the dictionary is for the whole data set, then it will occupy megabytes (if not gigabytes) of data. What happens when a new node joins and has no idea about the dictionary? What happens when dictionary between nodes get out-of-sync? D. |
Free forum by Nabble | Edit this page |