Custom string encoding

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Custom string encoding

Valentin Kulichenko
Folks,

Currently binary marshaller always encodes strings in UTF-8. However,
sometimes it can be useful to customize this. For example, if data contains
a lot of Cyrillic, Chinese or other symbols, but not so many Latin symbols,
memory is used very inefficiently. In this case it would be great to encode
most frequently used symbols in one byte instead of two or three.

I propose to introduce BinaryStringEncoder interface that will convert
strings to byte arrays and back, and make it pluggable via
BinaryConfiguration. This will allow users to plug in any encoding
algorithms based on their requirements.

Thoughts?

https://issues.apache.org/jira/browse/IGNITE-5655

-Val
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

Andrew Mashenkov
Val,

Looks like make sense.

This will not affect FullText index, as Lucene has own format for storing
data.

But.. would it be compatible with H2 indexing ? I doubt.

1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
[hidden email]> написал:

> Folks,
>
> Currently binary marshaller always encodes strings in UTF-8. However,
> sometimes it can be useful to customize this. For example, if data contains
> a lot of Cyrillic, Chinese or other symbols, but not so many Latin symbols,
> memory is used very inefficiently. In this case it would be great to encode
> most frequently used symbols in one byte instead of two or three.
>
> I propose to introduce BinaryStringEncoder interface that will convert
> strings to byte arrays and back, and make it pluggable via
> BinaryConfiguration. This will allow users to plug in any encoding
> algorithms based on their requirements.
>
> Thoughts?
>
> https://issues.apache.org/jira/browse/IGNITE-5655
>
> -Val
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

Valentin Kulichenko
Andrey,

Can you elaborate more on this? What is your concern?

-Val

On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <[hidden email]>
wrote:

> Val,
>
> Looks like make sense.
>
> This will not affect FullText index, as Lucene has own format for storing
> data.
>
> But.. would it be compatible with H2 indexing ? I doubt.
>
> 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> [hidden email]> написал:
>
> > Folks,
> >
> > Currently binary marshaller always encodes strings in UTF-8. However,
> > sometimes it can be useful to customize this. For example, if data
> contains
> > a lot of Cyrillic, Chinese or other symbols, but not so many Latin
> symbols,
> > memory is used very inefficiently. In this case it would be great to
> encode
> > most frequently used symbols in one byte instead of two or three.
> >
> > I propose to introduce BinaryStringEncoder interface that will convert
> > strings to byte arrays and back, and make it pluggable via
> > BinaryConfiguration. This will allow users to plug in any encoding
> > algorithms based on their requirements.
> >
> > Thoughts?
> >
> > https://issues.apache.org/jira/browse/IGNITE-5655
> >
> > -Val
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

dsetrakyan
Val, do you know how we compare strings in SQL queries? Will we be able to
use this encoder?

Additionally, I think that the encoder is a bit too abstract. Why not go
even further and allow users create their own ASCII table for encoding?

D.

On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
[hidden email]> wrote:

> Andrey,
>
> Can you elaborate more on this? What is your concern?
>
> -Val
>
> On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> [hidden email]>
> wrote:
>
> > Val,
> >
> > Looks like make sense.
> >
> > This will not affect FullText index, as Lucene has own format for storing
> > data.
> >
> > But.. would it be compatible with H2 indexing ? I doubt.
> >
> > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > [hidden email]> написал:
> >
> > > Folks,
> > >
> > > Currently binary marshaller always encodes strings in UTF-8. However,
> > > sometimes it can be useful to customize this. For example, if data
> > contains
> > > a lot of Cyrillic, Chinese or other symbols, but not so many Latin
> > symbols,
> > > memory is used very inefficiently. In this case it would be great to
> > encode
> > > most frequently used symbols in one byte instead of two or three.
> > >
> > > I propose to introduce BinaryStringEncoder interface that will convert
> > > strings to byte arrays and back, and make it pluggable via
> > > BinaryConfiguration. This will allow users to plug in any encoding
> > > algorithms based on their requirements.
> > >
> > > Thoughts?
> > >
> > > https://issues.apache.org/jira/browse/IGNITE-5655
> > >
> > > -Val
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

Sergi
In SQL indexes we may store partial strings and assume them to be in UTF-8,
I don't think this can be abstracted away. But may be this is not a big
deal if in indexes we still will use UTF-8.

Sergi

2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:

> Val, do you know how we compare strings in SQL queries? Will we be able to
> use this encoder?
>
> Additionally, I think that the encoder is a bit too abstract. Why not go
> even further and allow users create their own ASCII table for encoding?
>
> D.
>
> On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> [hidden email]> wrote:
>
> > Andrey,
> >
> > Can you elaborate more on this? What is your concern?
> >
> > -Val
> >
> > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > [hidden email]>
> > wrote:
> >
> > > Val,
> > >
> > > Looks like make sense.
> > >
> > > This will not affect FullText index, as Lucene has own format for
> storing
> > > data.
> > >
> > > But.. would it be compatible with H2 indexing ? I doubt.
> > >
> > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > [hidden email]> написал:
> > >
> > > > Folks,
> > > >
> > > > Currently binary marshaller always encodes strings in UTF-8. However,
> > > > sometimes it can be useful to customize this. For example, if data
> > > contains
> > > > a lot of Cyrillic, Chinese or other symbols, but not so many Latin
> > > symbols,
> > > > memory is used very inefficiently. In this case it would be great to
> > > encode
> > > > most frequently used symbols in one byte instead of two or three.
> > > >
> > > > I propose to introduce BinaryStringEncoder interface that will
> convert
> > > > strings to byte arrays and back, and make it pluggable via
> > > > BinaryConfiguration. This will allow users to plug in any encoding
> > > > algorithms based on their requirements.
> > > >
> > > > Thoughts?
> > > >
> > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > >
> > > > -Val
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

dsetrakyan
On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <[hidden email]>
wrote:

> In SQL indexes we may store partial strings and assume them to be in UTF-8,
> I don't think this can be abstracted away. But may be this is not a big
> deal if in indexes we still will use UTF-8.
>

Sergi, why does it matter if it is UTF8 or custom encoding? Why can't we
use our own compact encoding in indexes?


>
> 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:
>
> > Val, do you know how we compare strings in SQL queries? Will we be able
> to
> > use this encoder?
> >
> > Additionally, I think that the encoder is a bit too abstract. Why not go
> > even further and allow users create their own ASCII table for encoding?
> >
> > D.
> >
> > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > [hidden email]> wrote:
> >
> > > Andrey,
> > >
> > > Can you elaborate more on this? What is your concern?
> > >
> > > -Val
> > >
> > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > [hidden email]>
> > > wrote:
> > >
> > > > Val,
> > > >
> > > > Looks like make sense.
> > > >
> > > > This will not affect FullText index, as Lucene has own format for
> > storing
> > > > data.
> > > >
> > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > >
> > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > [hidden email]> написал:
> > > >
> > > > > Folks,
> > > > >
> > > > > Currently binary marshaller always encodes strings in UTF-8.
> However,
> > > > > sometimes it can be useful to customize this. For example, if data
> > > > contains
> > > > > a lot of Cyrillic, Chinese or other symbols, but not so many Latin
> > > > symbols,
> > > > > memory is used very inefficiently. In this case it would be great
> to
> > > > encode
> > > > > most frequently used symbols in one byte instead of two or three.
> > > > >
> > > > > I propose to introduce BinaryStringEncoder interface that will
> > convert
> > > > > strings to byte arrays and back, and make it pluggable via
> > > > > BinaryConfiguration. This will allow users to plug in any encoding
> > > > > algorithms based on their requirements.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > >
> > > > > -Val
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

Vladimir Ozerov
Valya,

Personally I vote against this feature. BinaryConfiguration is proven to be
inconvenient, since it has to be configured before node start, it cannot be
changed in runtime, and it requires classes on the server. Moreover, if you
decide to change encoding at some point, it would be impossible.

I think, we should add this feature on API level instead. If string is
written in non-UTF8 form, we will write in different format:
[encoding_code][string]

BInaryWriter.writeString(String fieldName, String val);
BInaryWriter.writeString(String fieldName, String val, *String encoding*);

BinaryReader.readString(String fieldName);
BinaryReader.readString(String fieldName, *String encoding*);

BinaryObjectBuilder.writeString(String fieldName, String val, *String
encoding*);

class MyClass {
    *@BinaryString(encoding = "Cp1251")*
    private String myCyrillicString;
}

Vladimir.

On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <[hidden email]>
wrote:

> On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <[hidden email]>
> wrote:
>
> > In SQL indexes we may store partial strings and assume them to be in
> UTF-8,
> > I don't think this can be abstracted away. But may be this is not a big
> > deal if in indexes we still will use UTF-8.
> >
>
> Sergi, why does it matter if it is UTF8 or custom encoding? Why can't we
> use our own compact encoding in indexes?
>
>
> >
> > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:
> >
> > > Val, do you know how we compare strings in SQL queries? Will we be able
> > to
> > > use this encoder?
> > >
> > > Additionally, I think that the encoder is a bit too abstract. Why not
> go
> > > even further and allow users create their own ASCII table for encoding?
> > >
> > > D.
> > >
> > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > [hidden email]> wrote:
> > >
> > > > Andrey,
> > > >
> > > > Can you elaborate more on this? What is your concern?
> > > >
> > > > -Val
> > > >
> > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > [hidden email]>
> > > > wrote:
> > > >
> > > > > Val,
> > > > >
> > > > > Looks like make sense.
> > > > >
> > > > > This will not affect FullText index, as Lucene has own format for
> > > storing
> > > > > data.
> > > > >
> > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > >
> > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > [hidden email]> написал:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > > Currently binary marshaller always encodes strings in UTF-8.
> > However,
> > > > > > sometimes it can be useful to customize this. For example, if
> data
> > > > > contains
> > > > > > a lot of Cyrillic, Chinese or other symbols, but not so many
> Latin
> > > > > symbols,
> > > > > > memory is used very inefficiently. In this case it would be great
> > to
> > > > > encode
> > > > > > most frequently used symbols in one byte instead of two or three.
> > > > > >
> > > > > > I propose to introduce BinaryStringEncoder interface that will
> > > convert
> > > > > > strings to byte arrays and back, and make it pluggable via
> > > > > > BinaryConfiguration. This will allow users to plug in any
> encoding
> > > > > > algorithms based on their requirements.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > >
> > > > > > -Val
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

dsetrakyan
Vladimir, how would you plugin custom encoders in your design?

On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <[hidden email]>
wrote:

> Valya,
>
> Personally I vote against this feature. BinaryConfiguration is proven to be
> inconvenient, since it has to be configured before node start, it cannot be
> changed in runtime, and it requires classes on the server. Moreover, if you
> decide to change encoding at some point, it would be impossible.
>
> I think, we should add this feature on API level instead. If string is
> written in non-UTF8 form, we will write in different format:
> [encoding_code][string]
>
> BInaryWriter.writeString(String fieldName, String val);
> BInaryWriter.writeString(String fieldName, String val, *String encoding*);
>
> BinaryReader.readString(String fieldName);
> BinaryReader.readString(String fieldName, *String encoding*);
>
> BinaryObjectBuilder.writeString(String fieldName, String val, *String
> encoding*);
>
> class MyClass {
>     *@BinaryString(encoding = "Cp1251")*
>     private String myCyrillicString;
> }
>
> Vladimir.
>
> On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <[hidden email]>
> wrote:
>
> > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <[hidden email]
> >
> > wrote:
> >
> > > In SQL indexes we may store partial strings and assume them to be in
> > UTF-8,
> > > I don't think this can be abstracted away. But may be this is not a big
> > > deal if in indexes we still will use UTF-8.
> > >
> >
> > Sergi, why does it matter if it is UTF8 or custom encoding? Why can't we
> > use our own compact encoding in indexes?
> >
> >
> > >
> > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:
> > >
> > > > Val, do you know how we compare strings in SQL queries? Will we be
> able
> > > to
> > > > use this encoder?
> > > >
> > > > Additionally, I think that the encoder is a bit too abstract. Why not
> > go
> > > > even further and allow users create their own ASCII table for
> encoding?
> > > >
> > > > D.
> > > >
> > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > [hidden email]> wrote:
> > > >
> > > > > Andrey,
> > > > >
> > > > > Can you elaborate more on this? What is your concern?
> > > > >
> > > > > -Val
> > > > >
> > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Val,
> > > > > >
> > > > > > Looks like make sense.
> > > > > >
> > > > > > This will not affect FullText index, as Lucene has own format for
> > > > storing
> > > > > > data.
> > > > > >
> > > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > > >
> > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > > [hidden email]> написал:
> > > > > >
> > > > > > > Folks,
> > > > > > >
> > > > > > > Currently binary marshaller always encodes strings in UTF-8.
> > > However,
> > > > > > > sometimes it can be useful to customize this. For example, if
> > data
> > > > > > contains
> > > > > > > a lot of Cyrillic, Chinese or other symbols, but not so many
> > Latin
> > > > > > symbols,
> > > > > > > memory is used very inefficiently. In this case it would be
> great
> > > to
> > > > > > encode
> > > > > > > most frequently used symbols in one byte instead of two or
> three.
> > > > > > >
> > > > > > > I propose to introduce BinaryStringEncoder interface that will
> > > > convert
> > > > > > > strings to byte arrays and back, and make it pluggable via
> > > > > > > BinaryConfiguration. This will allow users to plug in any
> > encoding
> > > > > > > algorithms based on their requirements.
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > >
> > > > > > > -Val
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

Vladimir Ozerov
There is no need for custom encoders, as they are already built-in to Java.

вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <[hidden email]>:

> Vladimir, how would you plugin custom encoders in your design?
>
> On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > Valya,
> >
> > Personally I vote against this feature. BinaryConfiguration is proven to
> be
> > inconvenient, since it has to be configured before node start, it cannot
> be
> > changed in runtime, and it requires classes on the server. Moreover, if
> you
> > decide to change encoding at some point, it would be impossible.
> >
> > I think, we should add this feature on API level instead. If string is
> > written in non-UTF8 form, we will write in different format:
> > [encoding_code][string]
> >
> > BInaryWriter.writeString(String fieldName, String val);
> > BInaryWriter.writeString(String fieldName, String val, *String
> encoding*);
> >
> > BinaryReader.readString(String fieldName);
> > BinaryReader.readString(String fieldName, *String encoding*);
> >
> > BinaryObjectBuilder.writeString(String fieldName, String val, *String
> > encoding*);
> >
> > class MyClass {
> >     *@BinaryString(encoding = "Cp1251")*
> >     private String myCyrillicString;
> > }
> >
> > Vladimir.
> >
> > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <[hidden email]
> >
> > wrote:
> >
> > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > In SQL indexes we may store partial strings and assume them to be in
> > > UTF-8,
> > > > I don't think this can be abstracted away. But may be this is not a
> big
> > > > deal if in indexes we still will use UTF-8.
> > > >
> > >
> > > Sergi, why does it matter if it is UTF8 or custom encoding? Why can't
> we
> > > use our own compact encoding in indexes?
> > >
> > >
> > > >
> > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <[hidden email]
> >:
> > > >
> > > > > Val, do you know how we compare strings in SQL queries? Will we be
> > able
> > > > to
> > > > > use this encoder?
> > > > >
> > > > > Additionally, I think that the encoder is a bit too abstract. Why
> not
> > > go
> > > > > even further and allow users create their own ASCII table for
> > encoding?
> > > > >
> > > > > D.
> > > > >
> > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > [hidden email]> wrote:
> > > > >
> > > > > > Andrey,
> > > > > >
> > > > > > Can you elaborate more on this? What is your concern?
> > > > > >
> > > > > > -Val
> > > > > >
> > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Val,
> > > > > > >
> > > > > > > Looks like make sense.
> > > > > > >
> > > > > > > This will not affect FullText index, as Lucene has own format
> for
> > > > > storing
> > > > > > > data.
> > > > > > >
> > > > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > > > >
> > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > > > [hidden email]> написал:
> > > > > > >
> > > > > > > > Folks,
> > > > > > > >
> > > > > > > > Currently binary marshaller always encodes strings in UTF-8.
> > > > However,
> > > > > > > > sometimes it can be useful to customize this. For example, if
> > > data
> > > > > > > contains
> > > > > > > > a lot of Cyrillic, Chinese or other symbols, but not so many
> > > Latin
> > > > > > > symbols,
> > > > > > > > memory is used very inefficiently. In this case it would be
> > great
> > > > to
> > > > > > > encode
> > > > > > > > most frequently used symbols in one byte instead of two or
> > three.
> > > > > > > >
> > > > > > > > I propose to introduce BinaryStringEncoder interface that
> will
> > > > > convert
> > > > > > > > strings to byte arrays and back, and make it pluggable via
> > > > > > > > BinaryConfiguration. This will allow users to plug in any
> > > encoding
> > > > > > > > algorithms based on their requirements.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > >
> > > > > > > > -Val
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

dsetrakyan
On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <[hidden email]>
wrote:

> There is no need for custom encoders, as they are already built-in to Java.
>

Will non-ASCII encodings fit into 1 byte? The whole point here is to save
space.


>
> вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <[hidden email]>:
>
> > Vladimir, how would you plugin custom encoders in your design?
> >
> > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <[hidden email]>
> > wrote:
> >
> > > Valya,
> > >
> > > Personally I vote against this feature. BinaryConfiguration is proven
> to
> > be
> > > inconvenient, since it has to be configured before node start, it
> cannot
> > be
> > > changed in runtime, and it requires classes on the server. Moreover, if
> > you
> > > decide to change encoding at some point, it would be impossible.
> > >
> > > I think, we should add this feature on API level instead. If string is
> > > written in non-UTF8 form, we will write in different format:
> > > [encoding_code][string]
> > >
> > > BInaryWriter.writeString(String fieldName, String val);
> > > BInaryWriter.writeString(String fieldName, String val, *String
> > encoding*);
> > >
> > > BinaryReader.readString(String fieldName);
> > > BinaryReader.readString(String fieldName, *String encoding*);
> > >
> > > BinaryObjectBuilder.writeString(String fieldName, String val, *String
> > > encoding*);
> > >
> > > class MyClass {
> > >     *@BinaryString(encoding = "Cp1251")*
> > >     private String myCyrillicString;
> > > }
> > >
> > > Vladimir.
> > >
> > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > In SQL indexes we may store partial strings and assume them to be
> in
> > > > UTF-8,
> > > > > I don't think this can be abstracted away. But may be this is not a
> > big
> > > > > deal if in indexes we still will use UTF-8.
> > > > >
> > > >
> > > > Sergi, why does it matter if it is UTF8 or custom encoding? Why can't
> > we
> > > > use our own compact encoding in indexes?
> > > >
> > > >
> > > > >
> > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <
> [hidden email]
> > >:
> > > > >
> > > > > > Val, do you know how we compare strings in SQL queries? Will we
> be
> > > able
> > > > > to
> > > > > > use this encoder?
> > > > > >
> > > > > > Additionally, I think that the encoder is a bit too abstract. Why
> > not
> > > > go
> > > > > > even further and allow users create their own ASCII table for
> > > encoding?
> > > > > >
> > > > > > D.
> > > > > >
> > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > > [hidden email]> wrote:
> > > > > >
> > > > > > > Andrey,
> > > > > > >
> > > > > > > Can you elaborate more on this? What is your concern?
> > > > > > >
> > > > > > > -Val
> > > > > > >
> > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Val,
> > > > > > > >
> > > > > > > > Looks like make sense.
> > > > > > > >
> > > > > > > > This will not affect FullText index, as Lucene has own format
> > for
> > > > > > storing
> > > > > > > > data.
> > > > > > > >
> > > > > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > > > > >
> > > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > > > > [hidden email]> написал:
> > > > > > > >
> > > > > > > > > Folks,
> > > > > > > > >
> > > > > > > > > Currently binary marshaller always encodes strings in
> UTF-8.
> > > > > However,
> > > > > > > > > sometimes it can be useful to customize this. For example,
> if
> > > > data
> > > > > > > > contains
> > > > > > > > > a lot of Cyrillic, Chinese or other symbols, but not so
> many
> > > > Latin
> > > > > > > > symbols,
> > > > > > > > > memory is used very inefficiently. In this case it would be
> > > great
> > > > > to
> > > > > > > > encode
> > > > > > > > > most frequently used symbols in one byte instead of two or
> > > three.
> > > > > > > > >
> > > > > > > > > I propose to introduce BinaryStringEncoder interface that
> > will
> > > > > > convert
> > > > > > > > > strings to byte arrays and back, and make it pluggable via
> > > > > > > > > BinaryConfiguration. This will allow users to plug in any
> > > > encoding
> > > > > > > > > algorithms based on their requirements.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > >
> > > > > > > > > -Val
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

Vladimir Ozerov
Yes, this is exactly what non-UTF8 encodings do.

вс, 2 июля 2017 г. в 20:08, Dmitriy Setrakyan <[hidden email]>:

> On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > There is no need for custom encoders, as they are already built-in to
> Java.
> >
>
> Will non-ASCII encodings fit into 1 byte? The whole point here is to save
> space.
>
>
> >
> > вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <[hidden email]>:
> >
> > > Vladimir, how would you plugin custom encoders in your design?
> > >
> > > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <[hidden email]
> >
> > > wrote:
> > >
> > > > Valya,
> > > >
> > > > Personally I vote against this feature. BinaryConfiguration is proven
> > to
> > > be
> > > > inconvenient, since it has to be configured before node start, it
> > cannot
> > > be
> > > > changed in runtime, and it requires classes on the server. Moreover,
> if
> > > you
> > > > decide to change encoding at some point, it would be impossible.
> > > >
> > > > I think, we should add this feature on API level instead. If string
> is
> > > > written in non-UTF8 form, we will write in different format:
> > > > [encoding_code][string]
> > > >
> > > > BInaryWriter.writeString(String fieldName, String val);
> > > > BInaryWriter.writeString(String fieldName, String val, *String
> > > encoding*);
> > > >
> > > > BinaryReader.readString(String fieldName);
> > > > BinaryReader.readString(String fieldName, *String encoding*);
> > > >
> > > > BinaryObjectBuilder.writeString(String fieldName, String val, *String
> > > > encoding*);
> > > >
> > > > class MyClass {
> > > >     *@BinaryString(encoding = "Cp1251")*
> > > >     private String myCyrillicString;
> > > > }
> > > >
> > > > Vladimir.
> > > >
> > > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > > [hidden email]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > In SQL indexes we may store partial strings and assume them to be
> > in
> > > > > UTF-8,
> > > > > > I don't think this can be abstracted away. But may be this is
> not a
> > > big
> > > > > > deal if in indexes we still will use UTF-8.
> > > > > >
> > > > >
> > > > > Sergi, why does it matter if it is UTF8 or custom encoding? Why
> can't
> > > we
> > > > > use our own compact encoding in indexes?
> > > > >
> > > > >
> > > > > >
> > > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <
> > [hidden email]
> > > >:
> > > > > >
> > > > > > > Val, do you know how we compare strings in SQL queries? Will we
> > be
> > > > able
> > > > > > to
> > > > > > > use this encoder?
> > > > > > >
> > > > > > > Additionally, I think that the encoder is a bit too abstract.
> Why
> > > not
> > > > > go
> > > > > > > even further and allow users create their own ASCII table for
> > > > encoding?
> > > > > > >
> > > > > > > D.
> > > > > > >
> > > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > > > [hidden email]> wrote:
> > > > > > >
> > > > > > > > Andrey,
> > > > > > > >
> > > > > > > > Can you elaborate more on this? What is your concern?
> > > > > > > >
> > > > > > > > -Val
> > > > > > > >
> > > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > > > [hidden email]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Val,
> > > > > > > > >
> > > > > > > > > Looks like make sense.
> > > > > > > > >
> > > > > > > > > This will not affect FullText index, as Lucene has own
> format
> > > for
> > > > > > > storing
> > > > > > > > > data.
> > > > > > > > >
> > > > > > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > > > > > >
> > > > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > > > > > [hidden email]> написал:
> > > > > > > > >
> > > > > > > > > > Folks,
> > > > > > > > > >
> > > > > > > > > > Currently binary marshaller always encodes strings in
> > UTF-8.
> > > > > > However,
> > > > > > > > > > sometimes it can be useful to customize this. For
> example,
> > if
> > > > > data
> > > > > > > > > contains
> > > > > > > > > > a lot of Cyrillic, Chinese or other symbols, but not so
> > many
> > > > > Latin
> > > > > > > > > symbols,
> > > > > > > > > > memory is used very inefficiently. In this case it would
> be
> > > > great
> > > > > > to
> > > > > > > > > encode
> > > > > > > > > > most frequently used symbols in one byte instead of two
> or
> > > > three.
> > > > > > > > > >
> > > > > > > > > > I propose to introduce BinaryStringEncoder interface that
> > > will
> > > > > > > convert
> > > > > > > > > > strings to byte arrays and back, and make it pluggable
> via
> > > > > > > > > > BinaryConfiguration. This will allow users to plug in any
> > > > > encoding
> > > > > > > > > > algorithms based on their requirements.
> > > > > > > > > >
> > > > > > > > > > Thoughts?
> > > > > > > > > >
> > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > > >
> > > > > > > > > > -Val
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

Valentin Kulichenko
Vova,

That's actually a good point. Probably that would be enough and there is no
need to introduce absract encoder. However, I still think it makes sense to
specify default encoding in BinaryConfiguration and BinaryTypeConfiguration.

-Val

On Sun, Jul 2, 2017 at 10:31 AM Vladimir Ozerov <[hidden email]>
wrote:

> Yes, this is exactly what non-UTF8 encodings do.
>
> вс, 2 июля 2017 г. в 20:08, Dmitriy Setrakyan <[hidden email]>:
>
> > On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <[hidden email]>
> > wrote:
> >
> > > There is no need for custom encoders, as they are already built-in to
> > Java.
> > >
> >
> > Will non-ASCII encodings fit into 1 byte? The whole point here is to save
> > space.
> >
> >
> > >
> > > вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <[hidden email]>:
> > >
> > > > Vladimir, how would you plugin custom encoders in your design?
> > > >
> > > > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <
> [hidden email]
> > >
> > > > wrote:
> > > >
> > > > > Valya,
> > > > >
> > > > > Personally I vote against this feature. BinaryConfiguration is
> proven
> > > to
> > > > be
> > > > > inconvenient, since it has to be configured before node start, it
> > > cannot
> > > > be
> > > > > changed in runtime, and it requires classes on the server.
> Moreover,
> > if
> > > > you
> > > > > decide to change encoding at some point, it would be impossible.
> > > > >
> > > > > I think, we should add this feature on API level instead. If string
> > is
> > > > > written in non-UTF8 form, we will write in different format:
> > > > > [encoding_code][string]
> > > > >
> > > > > BInaryWriter.writeString(String fieldName, String val);
> > > > > BInaryWriter.writeString(String fieldName, String val, *String
> > > > encoding*);
> > > > >
> > > > > BinaryReader.readString(String fieldName);
> > > > > BinaryReader.readString(String fieldName, *String encoding*);
> > > > >
> > > > > BinaryObjectBuilder.writeString(String fieldName, String val,
> *String
> > > > > encoding*);
> > > > >
> > > > > class MyClass {
> > > > >     *@BinaryString(encoding = "Cp1251")*
> > > > >     private String myCyrillicString;
> > > > > }
> > > > >
> > > > > Vladimir.
> > > > >
> > > > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> > > [hidden email]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > > > [hidden email]
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > In SQL indexes we may store partial strings and assume them to
> be
> > > in
> > > > > > UTF-8,
> > > > > > > I don't think this can be abstracted away. But may be this is
> > not a
> > > > big
> > > > > > > deal if in indexes we still will use UTF-8.
> > > > > > >
> > > > > >
> > > > > > Sergi, why does it matter if it is UTF8 or custom encoding? Why
> > can't
> > > > we
> > > > > > use our own compact encoding in indexes?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <
> > > [hidden email]
> > > > >:
> > > > > > >
> > > > > > > > Val, do you know how we compare strings in SQL queries? Will
> we
> > > be
> > > > > able
> > > > > > > to
> > > > > > > > use this encoder?
> > > > > > > >
> > > > > > > > Additionally, I think that the encoder is a bit too abstract.
> > Why
> > > > not
> > > > > > go
> > > > > > > > even further and allow users create their own ASCII table for
> > > > > encoding?
> > > > > > > >
> > > > > > > > D.
> > > > > > > >
> > > > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > > > > [hidden email]> wrote:
> > > > > > > >
> > > > > > > > > Andrey,
> > > > > > > > >
> > > > > > > > > Can you elaborate more on this? What is your concern?
> > > > > > > > >
> > > > > > > > > -Val
> > > > > > > > >
> > > > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Val,
> > > > > > > > > >
> > > > > > > > > > Looks like make sense.
> > > > > > > > > >
> > > > > > > > > > This will not affect FullText index, as Lucene has own
> > format
> > > > for
> > > > > > > > storing
> > > > > > > > > > data.
> > > > > > > > > >
> > > > > > > > > > But.. would it be compatible with H2 indexing ? I doubt.
> > > > > > > > > >
> > > > > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko" <
> > > > > > > > > > [hidden email]> написал:
> > > > > > > > > >
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > Currently binary marshaller always encodes strings in
> > > UTF-8.
> > > > > > > However,
> > > > > > > > > > > sometimes it can be useful to customize this. For
> > example,
> > > if
> > > > > > data
> > > > > > > > > > contains
> > > > > > > > > > > a lot of Cyrillic, Chinese or other symbols, but not so
> > > many
> > > > > > Latin
> > > > > > > > > > symbols,
> > > > > > > > > > > memory is used very inefficiently. In this case it
> would
> > be
> > > > > great
> > > > > > > to
> > > > > > > > > > encode
> > > > > > > > > > > most frequently used symbols in one byte instead of two
> > or
> > > > > three.
> > > > > > > > > > >
> > > > > > > > > > > I propose to introduce BinaryStringEncoder interface
> that
> > > > will
> > > > > > > > convert
> > > > > > > > > > > strings to byte arrays and back, and make it pluggable
> > via
> > > > > > > > > > > BinaryConfiguration. This will allow users to plug in
> any
> > > > > > encoding
> > > > > > > > > > > algorithms based on their requirements.
> > > > > > > > > > >
> > > > > > > > > > > Thoughts?
> > > > > > > > > > >
> > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > > > >
> > > > > > > > > > > -Val
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

dsetrakyan
Agree with Valya on the system-wide default. We need to have it.

Also, are we certain that the encoding will provide 1-byte length for UTF-8
for different languages? Would be nice to test it to confirm, as it has a
potential to decrease the Ignite storage space by 2x in certain cases.

D.

On Sun, Jul 2, 2017 at 12:26 PM, Valentin Kulichenko <
[hidden email]> wrote:

> Vova,
>
> That's actually a good point. Probably that would be enough and there is no
> need to introduce absract encoder. However, I still think it makes sense to
> specify default encoding in BinaryConfiguration and
> BinaryTypeConfiguration.
>
> -Val
>
> On Sun, Jul 2, 2017 at 10:31 AM Vladimir Ozerov <[hidden email]>
> wrote:
>
> > Yes, this is exactly what non-UTF8 encodings do.
> >
> > вс, 2 июля 2017 г. в 20:08, Dmitriy Setrakyan <[hidden email]>:
> >
> > > On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <[hidden email]>
> > > wrote:
> > >
> > > > There is no need for custom encoders, as they are already built-in to
> > > Java.
> > > >
> > >
> > > Will non-ASCII encodings fit into 1 byte? The whole point here is to
> save
> > > space.
> > >
> > >
> > > >
> > > > вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <[hidden email]
> >:
> > > >
> > > > > Vladimir, how would you plugin custom encoders in your design?
> > > > >
> > > > > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <
> > [hidden email]
> > > >
> > > > > wrote:
> > > > >
> > > > > > Valya,
> > > > > >
> > > > > > Personally I vote against this feature. BinaryConfiguration is
> > proven
> > > > to
> > > > > be
> > > > > > inconvenient, since it has to be configured before node start, it
> > > > cannot
> > > > > be
> > > > > > changed in runtime, and it requires classes on the server.
> > Moreover,
> > > if
> > > > > you
> > > > > > decide to change encoding at some point, it would be impossible.
> > > > > >
> > > > > > I think, we should add this feature on API level instead. If
> string
> > > is
> > > > > > written in non-UTF8 form, we will write in different format:
> > > > > > [encoding_code][string]
> > > > > >
> > > > > > BInaryWriter.writeString(String fieldName, String val);
> > > > > > BInaryWriter.writeString(String fieldName, String val, *String
> > > > > encoding*);
> > > > > >
> > > > > > BinaryReader.readString(String fieldName);
> > > > > > BinaryReader.readString(String fieldName, *String encoding*);
> > > > > >
> > > > > > BinaryObjectBuilder.writeString(String fieldName, String val,
> > *String
> > > > > > encoding*);
> > > > > >
> > > > > > class MyClass {
> > > > > >     *@BinaryString(encoding = "Cp1251")*
> > > > > >     private String myCyrillicString;
> > > > > > }
> > > > > >
> > > > > > Vladimir.
> > > > > >
> > > > > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> > > > [hidden email]
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > > > > [hidden email]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > In SQL indexes we may store partial strings and assume them
> to
> > be
> > > > in
> > > > > > > UTF-8,
> > > > > > > > I don't think this can be abstracted away. But may be this is
> > > not a
> > > > > big
> > > > > > > > deal if in indexes we still will use UTF-8.
> > > > > > > >
> > > > > > >
> > > > > > > Sergi, why does it matter if it is UTF8 or custom encoding? Why
> > > can't
> > > > > we
> > > > > > > use our own compact encoding in indexes?
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <
> > > > [hidden email]
> > > > > >:
> > > > > > > >
> > > > > > > > > Val, do you know how we compare strings in SQL queries?
> Will
> > we
> > > > be
> > > > > > able
> > > > > > > > to
> > > > > > > > > use this encoder?
> > > > > > > > >
> > > > > > > > > Additionally, I think that the encoder is a bit too
> abstract.
> > > Why
> > > > > not
> > > > > > > go
> > > > > > > > > even further and allow users create their own ASCII table
> for
> > > > > > encoding?
> > > > > > > > >
> > > > > > > > > D.
> > > > > > > > >
> > > > > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > > > > > [hidden email]> wrote:
> > > > > > > > >
> > > > > > > > > > Andrey,
> > > > > > > > > >
> > > > > > > > > > Can you elaborate more on this? What is your concern?
> > > > > > > > > >
> > > > > > > > > > -Val
> > > > > > > > > >
> > > > > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Val,
> > > > > > > > > > >
> > > > > > > > > > > Looks like make sense.
> > > > > > > > > > >
> > > > > > > > > > > This will not affect FullText index, as Lucene has own
> > > format
> > > > > for
> > > > > > > > > storing
> > > > > > > > > > > data.
> > > > > > > > > > >
> > > > > > > > > > > But.. would it be compatible with H2 indexing ? I
> doubt.
> > > > > > > > > > >
> > > > > > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin Kulichenko"
> <
> > > > > > > > > > > [hidden email]> написал:
> > > > > > > > > > >
> > > > > > > > > > > > Folks,
> > > > > > > > > > > >
> > > > > > > > > > > > Currently binary marshaller always encodes strings in
> > > > UTF-8.
> > > > > > > > However,
> > > > > > > > > > > > sometimes it can be useful to customize this. For
> > > example,
> > > > if
> > > > > > > data
> > > > > > > > > > > contains
> > > > > > > > > > > > a lot of Cyrillic, Chinese or other symbols, but not
> so
> > > > many
> > > > > > > Latin
> > > > > > > > > > > symbols,
> > > > > > > > > > > > memory is used very inefficiently. In this case it
> > would
> > > be
> > > > > > great
> > > > > > > > to
> > > > > > > > > > > encode
> > > > > > > > > > > > most frequently used symbols in one byte instead of
> two
> > > or
> > > > > > three.
> > > > > > > > > > > >
> > > > > > > > > > > > I propose to introduce BinaryStringEncoder interface
> > that
> > > > > will
> > > > > > > > > convert
> > > > > > > > > > > > strings to byte arrays and back, and make it
> pluggable
> > > via
> > > > > > > > > > > > BinaryConfiguration. This will allow users to plug in
> > any
> > > > > > > encoding
> > > > > > > > > > > > algorithms based on their requirements.
> > > > > > > > > > > >
> > > > > > > > > > > > Thoughts?
> > > > > > > > > > > >
> > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > > > > >
> > > > > > > > > > > > -Val
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Custom string encoding

Valentin Kulichenko
Yes, this needs to be tested and confirmed. I will work on it.

Would be great to get more details about indexes. I'm not sure I understand
the limitation there.

-Val

On Mon, Jul 3, 2017 at 7:21 AM, Dmitriy Setrakyan <[hidden email]>
wrote:

> Agree with Valya on the system-wide default. We need to have it.
>
> Also, are we certain that the encoding will provide 1-byte length for UTF-8
> for different languages? Would be nice to test it to confirm, as it has a
> potential to decrease the Ignite storage space by 2x in certain cases.
>
> D.
>
> On Sun, Jul 2, 2017 at 12:26 PM, Valentin Kulichenko <
> [hidden email]> wrote:
>
> > Vova,
> >
> > That's actually a good point. Probably that would be enough and there is
> no
> > need to introduce absract encoder. However, I still think it makes sense
> to
> > specify default encoding in BinaryConfiguration and
> > BinaryTypeConfiguration.
> >
> > -Val
> >
> > On Sun, Jul 2, 2017 at 10:31 AM Vladimir Ozerov <[hidden email]>
> > wrote:
> >
> > > Yes, this is exactly what non-UTF8 encodings do.
> > >
> > > вс, 2 июля 2017 г. в 20:08, Dmitriy Setrakyan <[hidden email]>:
> > >
> > > > On Sun, Jul 2, 2017 at 9:50 AM, Vladimir Ozerov <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > There is no need for custom encoders, as they are already built-in
> to
> > > > Java.
> > > > >
> > > >
> > > > Will non-ASCII encodings fit into 1 byte? The whole point here is to
> > save
> > > > space.
> > > >
> > > >
> > > > >
> > > > > вс, 2 июля 2017 г. в 19:16, Dmitriy Setrakyan <
> [hidden email]
> > >:
> > > > >
> > > > > > Vladimir, how would you plugin custom encoders in your design?
> > > > > >
> > > > > > On Sat, Jul 1, 2017 at 11:53 PM, Vladimir Ozerov <
> > > [hidden email]
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Valya,
> > > > > > >
> > > > > > > Personally I vote against this feature. BinaryConfiguration is
> > > proven
> > > > > to
> > > > > > be
> > > > > > > inconvenient, since it has to be configured before node start,
> it
> > > > > cannot
> > > > > > be
> > > > > > > changed in runtime, and it requires classes on the server.
> > > Moreover,
> > > > if
> > > > > > you
> > > > > > > decide to change encoding at some point, it would be
> impossible.
> > > > > > >
> > > > > > > I think, we should add this feature on API level instead. If
> > string
> > > > is
> > > > > > > written in non-UTF8 form, we will write in different format:
> > > > > > > [encoding_code][string]
> > > > > > >
> > > > > > > BInaryWriter.writeString(String fieldName, String val);
> > > > > > > BInaryWriter.writeString(String fieldName, String val, *String
> > > > > > encoding*);
> > > > > > >
> > > > > > > BinaryReader.readString(String fieldName);
> > > > > > > BinaryReader.readString(String fieldName, *String encoding*);
> > > > > > >
> > > > > > > BinaryObjectBuilder.writeString(String fieldName, String val,
> > > *String
> > > > > > > encoding*);
> > > > > > >
> > > > > > > class MyClass {
> > > > > > >     *@BinaryString(encoding = "Cp1251")*
> > > > > > >     private String myCyrillicString;
> > > > > > > }
> > > > > > >
> > > > > > > Vladimir.
> > > > > > >
> > > > > > > On Sat, Jul 1, 2017 at 7:26 PM, Dmitriy Setrakyan <
> > > > > [hidden email]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Sat, Jul 1, 2017 at 2:24 AM, Sergi Vladykin <
> > > > > > [hidden email]
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > In SQL indexes we may store partial strings and assume them
> > to
> > > be
> > > > > in
> > > > > > > > UTF-8,
> > > > > > > > > I don't think this can be abstracted away. But may be this
> is
> > > > not a
> > > > > > big
> > > > > > > > > deal if in indexes we still will use UTF-8.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Sergi, why does it matter if it is UTF8 or custom encoding?
> Why
> > > > can't
> > > > > > we
> > > > > > > > use our own compact encoding in indexes?
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 2017-07-01 10:13 GMT+03:00 Dmitriy Setrakyan <
> > > > > [hidden email]
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > Val, do you know how we compare strings in SQL queries?
> > Will
> > > we
> > > > > be
> > > > > > > able
> > > > > > > > > to
> > > > > > > > > > use this encoder?
> > > > > > > > > >
> > > > > > > > > > Additionally, I think that the encoder is a bit too
> > abstract.
> > > > Why
> > > > > > not
> > > > > > > > go
> > > > > > > > > > even further and allow users create their own ASCII table
> > for
> > > > > > > encoding?
> > > > > > > > > >
> > > > > > > > > > D.
> > > > > > > > > >
> > > > > > > > > > On Fri, Jun 30, 2017 at 6:49 PM, Valentin Kulichenko <
> > > > > > > > > > [hidden email]> wrote:
> > > > > > > > > >
> > > > > > > > > > > Andrey,
> > > > > > > > > > >
> > > > > > > > > > > Can you elaborate more on this? What is your concern?
> > > > > > > > > > >
> > > > > > > > > > > -Val
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jun 30, 2017 at 6:17 PM Andrey Mashenkov <
> > > > > > > > > > > [hidden email]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Val,
> > > > > > > > > > > >
> > > > > > > > > > > > Looks like make sense.
> > > > > > > > > > > >
> > > > > > > > > > > > This will not affect FullText index, as Lucene has
> own
> > > > format
> > > > > > for
> > > > > > > > > > storing
> > > > > > > > > > > > data.
> > > > > > > > > > > >
> > > > > > > > > > > > But.. would it be compatible with H2 indexing ? I
> > doubt.
> > > > > > > > > > > >
> > > > > > > > > > > > 1 июля 2017 г. 2:27 пользователь "Valentin
> Kulichenko"
> > <
> > > > > > > > > > > > [hidden email]> написал:
> > > > > > > > > > > >
> > > > > > > > > > > > > Folks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Currently binary marshaller always encodes strings
> in
> > > > > UTF-8.
> > > > > > > > > However,
> > > > > > > > > > > > > sometimes it can be useful to customize this. For
> > > > example,
> > > > > if
> > > > > > > > data
> > > > > > > > > > > > contains
> > > > > > > > > > > > > a lot of Cyrillic, Chinese or other symbols, but
> not
> > so
> > > > > many
> > > > > > > > Latin
> > > > > > > > > > > > symbols,
> > > > > > > > > > > > > memory is used very inefficiently. In this case it
> > > would
> > > > be
> > > > > > > great
> > > > > > > > > to
> > > > > > > > > > > > encode
> > > > > > > > > > > > > most frequently used symbols in one byte instead of
> > two
> > > > or
> > > > > > > three.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I propose to introduce BinaryStringEncoder
> interface
> > > that
> > > > > > will
> > > > > > > > > > convert
> > > > > > > > > > > > > strings to byte arrays and back, and make it
> > pluggable
> > > > via
> > > > > > > > > > > > > BinaryConfiguration. This will allow users to plug
> in
> > > any
> > > > > > > > encoding
> > > > > > > > > > > > > algorithms based on their requirements.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > >
> > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5655
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Val
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>