Apache Ignite Developers - Legacy Mail Archive

Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Classic

List

Threaded

28 messages Options

Andrey Kuznetsov

Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

This post was updated on .

Hi Igniters,

I'd like to discuss future changes related to IGNITE-5655.

Is it really good idea to introduce new flag (ENCODED_STRING) for existing String datatype? It's possible to use existing STRING flag at negligible performance cost. Currently, utf-8-encoded string looks like

byteFlag nonNegativeIntStrLen bytes

This format can be backward compatibly extended to

byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes

Next, I suggest to add new BinaryConfiguration property for encoding to use instead of using global property. It seems to be more convenient for user.

I'll appreciate your feedback.

Andrey Kuznetsov

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

I apologize for damaged formatting. Below is my message as it should be.

Hi Igniters,

I'd like to discuss future changes related to https://issues.apache.org/
jira/browse/IGNITE-5655.

Is it really good idea to introduce new flag (ENCODED_STRING) for existing
String datatype? It's possible to use existing STRING flag at negligible
performance cost.

Currently, utf-8-encoded string looks like

byteFlag nonNegativeIntStrLen bytes

This format can be backward compatibly extended to

byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes

Next, I suggest to add new BinaryConfiguration property for encoding to use
instead of using global property. It seems to be more convenient for user.

I'll appreciate your feedback.

2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>:

> Hi Igniters,I'd like to discuss future changes related to IGNITE-5655
> <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really good
> idea to introduce new flag (ENCODED_STRING) for existing String datatype?
> It's possible to use existing STRING flag at negligible performance cost.
> Currently, utf-8-encoded string looks like
> byteFlag nonNegativeIntStrLen bytes
> This format can be backward compatibly extended to
> byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> Next, I suggest to add new BinaryConfiguration property for encoding to use
> instead of using global property. It seems to be more convenient for
> user.I'll appreciate your feedback.
>
>
>
> -----
> Best regards,
> Andrey Kuznetsov.
> --
> View this message in context: http://apache-ignite-
> developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding-
> support-in-BinaryMarshaller-IGNITE-5655-tp20024.html
> Sent from the Apache Ignite Developers mailing list archive at Nabble.com.

--
Best regards,
Andrey Kuznetsov.

Vladimir Ozerov

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Andrey,

You cannot have optional part in the middle as it will break compatibility
in dangerous way, probably leading to node crash. Also having INT (4 bytes)
looks too much for me.

Instead, I would add new type "encoded string":
1 byte - type
1 byte - encoding code, map frequently used encodings to some byte value;
also have a special value, meaning that encoding will be written as string
afterwards, this way we will support any encoding out of the box
[optional] encoding name
4 bytes - string length
Finally - string bytes

Vladimir.

вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>:

> I apologize for damaged formatting. Below is my message as it should be.
>
>
> Hi Igniters,
>
> I'd like to discuss future changes related to https://issues.apache.org/
> jira/browse/IGNITE-5655
> <https://issues.apache.org/jira/browse/IGNITE-5655>.
>
> Is it really good idea to introduce new flag (ENCODED_STRING) for existing
> String datatype? It's possible to use existing STRING flag at negligible
> performance cost.
>
> Currently, utf-8-encoded string looks like
>
> byteFlag nonNegativeIntStrLen bytes
>
> This format can be backward compatibly extended to
>
> byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
>
> Next, I suggest to add new BinaryConfiguration property for encoding to use
> instead of using global property. It seems to be more convenient for user.
>
> I'll appreciate your feedback.
>
> 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>:
>
> > Hi Igniters,I'd like to discuss future changes related to IGNITE-5655
> > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really good
> > idea to introduce new flag (ENCODED_STRING) for existing String datatype?
> > It's possible to use existing STRING flag at negligible performance cost.
> > Currently, utf-8-encoded string looks like
> > byteFlag nonNegativeIntStrLen bytes
> > This format can be backward compatibly extended to
> > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > Next, I suggest to add new BinaryConfiguration property for encoding to
> use
> > instead of using global property. It seems to be more convenient for
> > user.I'll appreciate your feedback.
> >
> >
> >
> > -----
> > Best regards,
> > Andrey Kuznetsov.
> > --
> > View this message in context: http://apache-ignite-
> > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding-
> > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html
> > Sent from the Apache Ignite Developers mailing list archive at
> Nabble.com.
>
>
>
>
> --
> Best regards,
> Andrey Kuznetsov.
>

daradurvs

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Hi Andrey.

Sound very useful.

We can save one byte if will use controlled overflow on
[nonNegativeIntStrLen]:
If [nonNegativeIntStrLen < 0] then [string is encoded]

I have some questions:
Will there any public API, e.g. "Encoder" interace?
Will user have the opportunity to define own encoding format?

2017-07-25 20:13 GMT+03:00 Vladimir Ozerov <[hidden email]>:

> Andrey,
>
> You cannot have optional part in the middle as it will break compatibility
> in dangerous way, probably leading to node crash. Also having INT (4 bytes)
> looks too much for me.
>
> Instead, I would add new type "encoded string":
> 1 byte - type
> 1 byte - encoding code, map frequently used encodings to some byte value;
> also have a special value, meaning that encoding will be written as string
> afterwards, this way we will support any encoding out of the box
> [optional] encoding name
> 4 bytes - string length
> Finally - string bytes
>
> Vladimir.
>
> вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>:
>
> > I apologize for damaged formatting. Below is my message as it should be.
> >
> >
> > Hi Igniters,
> >
> > I'd like to discuss future changes related to https://issues.apache.org/
> > jira/browse/IGNITE-5655
> > <https://issues.apache.org/jira/browse/IGNITE-5655>.
> >
> > Is it really good idea to introduce new flag (ENCODED_STRING) for
> existing
> > String datatype? It's possible to use existing STRING flag at negligible
> > performance cost.
> >
> > Currently, utf-8-encoded string looks like
> >
> > byteFlag nonNegativeIntStrLen bytes
> >
> > This format can be backward compatibly extended to
> >
> > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> >
> > Next, I suggest to add new BinaryConfiguration property for encoding to
> use
> > instead of using global property. It seems to be more convenient for
> user.
> >
> > I'll appreciate your feedback.
> >
> > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>:
> >
> > > Hi Igniters,I'd like to discuss future changes related to IGNITE-5655
> > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really
> good
> > > idea to introduce new flag (ENCODED_STRING) for existing String
> datatype?
> > > It's possible to use existing STRING flag at negligible performance
> cost.
> > > Currently, utf-8-encoded string looks like
> > > byteFlag nonNegativeIntStrLen bytes
> > > This format can be backward compatibly extended to
> > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > > Next, I suggest to add new BinaryConfiguration property for encoding to
> > use
> > > instead of using global property. It seems to be more convenient for
> > > user.I'll appreciate your feedback.
> > >
> > >
> > >
> > > -----
> > > Best regards,
> > > Andrey Kuznetsov.
> > > --
> > > View this message in context: http://apache-ignite-
> > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding-
> > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html
> > > Sent from the Apache Ignite Developers mailing list archive at
> > Nabble.com.
> >
> >
> >
> >
> > --
> > Best regards,
> > Andrey Kuznetsov.
> >
>

--
Best Regards, Vyacheslav D.

dsetrakyan

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

In reply to this post by Vladimir Ozerov

I don't understand why this encoding is done on per-object and not on
per-cache level. Shouldn't the column-to-encoding mapping be defined at
cache level configuration?

On Tue, Jul 25, 2017 at 12:13 PM, Vladimir Ozerov <[hidden email]>
wrote:

Vladimir Ozerov

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Vyacheslav,
When we finish varlen optimization for string lengths, I am afraid we could
end up with very messy protocol, should we mix encoded length and encoding.

Dima,
Encoding must be set on per field basis. This will give us as most flexible
solution at the cost of 1-byte overhead.

вт, 25 июля 2017 г. в 20:23, Dmitriy Setrakyan <[hidden email]>:

> I don't understand why this encoding is done on per-object and not on
> per-cache level. Shouldn't the column-to-encoding mapping be defined at
> cache level configuration?
>
> On Tue, Jul 25, 2017 at 12:13 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > Andrey,
> >
> > You cannot have optional part in the middle as it will break
> compatibility
> > in dangerous way, probably leading to node crash. Also having INT (4
> bytes)
> > looks too much for me.
> >
> > Instead, I would add new type "encoded string":
> > 1 byte - type
> > 1 byte - encoding code, map frequently used encodings to some byte value;
> > also have a special value, meaning that encoding will be written as
> string
> > afterwards, this way we will support any encoding out of the box
> > [optional] encoding name
> > 4 bytes - string length
> > Finally - string bytes
> >
> > Vladimir.
> >
> > вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>:
> >
> > > I apologize for damaged formatting. Below is my message as it should
> be.
> > >
> > >
> > > Hi Igniters,
> > >
> > > I'd like to discuss future changes related to
> https://issues.apache.org/
> > > jira/browse/IGNITE-5655
> > > <https://issues.apache.org/jira/browse/IGNITE-5655>.
> > >
> > > Is it really good idea to introduce new flag (ENCODED_STRING) for
> > existing
> > > String datatype? It's possible to use existing STRING flag at
> negligible
> > > performance cost.
> > >
> > > Currently, utf-8-encoded string looks like
> > >
> > > byteFlag nonNegativeIntStrLen bytes
> > >
> > > This format can be backward compatibly extended to
> > >
> > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > >
> > > Next, I suggest to add new BinaryConfiguration property for encoding to
> > use
> > > instead of using global property. It seems to be more convenient for
> > user.
> > >
> > > I'll appreciate your feedback.
> > >
> > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>:
> > >
> > > > Hi Igniters,I'd like to discuss future changes related to
> IGNITE-5655
> > > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really
> > good
> > > > idea to introduce new flag (ENCODED_STRING) for existing String
> > datatype?
> > > > It's possible to use existing STRING flag at negligible performance
> > cost.
> > > > Currently, utf-8-encoded string looks like
> > > > byteFlag nonNegativeIntStrLen bytes
> > > > This format can be backward compatibly extended to
> > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > > > Next, I suggest to add new BinaryConfiguration property for encoding
> to
> > > use
> > > > instead of using global property. It seems to be more convenient for
> > > > user.I'll appreciate your feedback.
> > > >
> > > >
> > > >
> > > > -----
> > > > Best regards,
> > > > Andrey Kuznetsov.
> > > > --
> > > > View this message in context: http://apache-ignite-
> > > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding-
> > > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html
> > > > Sent from the Apache Ignite Developers mailing list archive at
> > > Nabble.com.
> > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey Kuznetsov.
> > >
> >
>

Andrey Kuznetsov

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

In reply to this post by Vladimir Ozerov

Vladimir,

Thanks for reply. In any case, we'll break compatibility by introducing new
feature in marshalling. But both approaches preserve backward
compatibility.

I deemed it's unusual to make two differerent type markers (flags) for
single datatype. I can't see the source right now, but I'm unsure whether
it's possible to map two flags to single type in marshaller implementation.

25 июля 2017 г. 20:13 пользователь "Vladimir Ozerov" <[hidden email]>
написал:

> Andrey,
>
> You cannot have optional part in the middle as it will break compatibility
> in dangerous way, probably leading to node crash. Also having INT (4
> bytes)
> looks too much for me.
>
> Instead, I would add new type "encoded string":
> 1 byte - type
> 1 byte - encoding code, map frequently used encodings to some byte value;
> also have a special value, meaning that encoding will be written as string
> afterwards, this way we will support any encoding out of the box
> [optional] encoding name
> 4 bytes - string length
> Finally - string bytes
>
> Vladimir.
>
> вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=20039&i=0>>:
>
> > I apologize for damaged formatting. Below is my message as it should be.
> >
> >
> > Hi Igniters,
> >
> > I'd like to discuss future changes related to https://issues.apache.org/
> > jira/browse/IGNITE-5655
> > <https://issues.apache.org/jira/browse/IGNITE-5655>.
> >
> > Is it really good idea to introduce new flag (ENCODED_STRING) for
> existing
> > String datatype? It's possible to use existing STRING flag at negligible
> > performance cost.
> >
> > Currently, utf-8-encoded string looks like
> >
> > byteFlag nonNegativeIntStrLen bytes
> >
> > This format can be backward compatibly extended to
> >
> > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> >
> > Next, I suggest to add new BinaryConfiguration property for encoding to
> use
> > instead of using global property. It seems to be more convenient for
> user.
> >
> > I'll appreciate your feedback.
> >
> > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=20039&i=1>>:
> >
> > > Hi Igniters,I'd like to discuss future changes related to IGNITE-5655
> > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really
> good
> > > idea to introduce new flag (ENCODED_STRING) for existing String
> datatype?
> > > It's possible to use existing STRING flag at negligible performance
> cost.
> > > Currently, utf-8-encoded string looks like
> > > byteFlag nonNegativeIntStrLen bytes
> > > This format can be backward compatibly extended to
> > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > > Next, I suggest to add new BinaryConfiguration property for encoding
> to
> > use
> > > instead of using global property. It seems to be more convenient for
> > > user.I'll appreciate your feedback.
> > >
> > >
> > >
> > > -----
> > > Best regards,
> > > Andrey Kuznetsov.
> > > --
> > > View this message in context: http://apache-ignite-
> > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding-
> > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html
> > > Sent from the Apache Ignite Developers mailing list archive at
> > Nabble.com.
> >
> >
> >
> >
> > --
> > Best regards,
> > Andrey Kuznetsov.
> >
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-ignite-developers.2346864.n4.nabble.
> com/Non-UTF-8-string-encoding-support-in-BinaryMarshaller-
> IGNITE-5655-tp20024p20039.html
> To unsubscribe from Non-UTF-8 string encoding support in BinaryMarshaller
> (IGNITE-5655), click here
> <
> .
> NAML
> <http://apache-ignite-developers.2346864.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

dsetrakyan

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

In reply to this post by Vladimir Ozerov

On Tue, Jul 25, 2017 at 12:36 PM, Vladimir Ozerov <[hidden email]>
wrote:

> Vyacheslav,
> When we finish varlen optimization for string lengths, I am afraid we could
> end up with very messy protocol, should we mix encoded length and encoding.
>
> Dima,
> Encoding must be set on per field basis. This will give us as most flexible
> solution at the cost of 1-byte overhead.
>

Vova, I agree that the encoding should be set on per-field basis, but at
the table level, not at a cell level. I cannot foresee a situation where we
would have different encodings in the same column. If that ever happens,
then user can provide already encoded values.

>
> вт, 25 июля 2017 г. в 20:23, Dmitriy Setrakyan <[hidden email]>:
>
> > I don't understand why this encoding is done on per-object and not on
> > per-cache level. Shouldn't the column-to-encoding mapping be defined at
> > cache level configuration?
> >
> > On Tue, Jul 25, 2017 at 12:13 PM, Vladimir Ozerov <[hidden email]>
> > wrote:
> >
> > > Andrey,
> > >
> > > You cannot have optional part in the middle as it will break
> > compatibility
> > > in dangerous way, probably leading to node crash. Also having INT (4
> > bytes)
> > > looks too much for me.
> > >
> > > Instead, I would add new type "encoded string":
> > > 1 byte - type
> > > 1 byte - encoding code, map frequently used encodings to some byte
> value;
> > > also have a special value, meaning that encoding will be written as
> > string
> > > afterwards, this way we will support any encoding out of the box
> > > [optional] encoding name
> > > 4 bytes - string length
> > > Finally - string bytes
> > >
> > > Vladimir.
> > >
> > > вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>:
> > >
> > > > I apologize for damaged formatting. Below is my message as it should
> > be.
> > > >
> > > >
> > > > Hi Igniters,
> > > >
> > > > I'd like to discuss future changes related to
> > https://issues.apache.org/
> > > > jira/browse/IGNITE-5655
> > > > <https://issues.apache.org/jira/browse/IGNITE-5655>.
> > > >
> > > > Is it really good idea to introduce new flag (ENCODED_STRING) for
> > > existing
> > > > String datatype? It's possible to use existing STRING flag at
> > negligible
> > > > performance cost.
> > > >
> > > > Currently, utf-8-encoded string looks like
> > > >
> > > > byteFlag nonNegativeIntStrLen bytes
> > > >
> > > > This format can be backward compatibly extended to
> > > >
> > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > > >
> > > > Next, I suggest to add new BinaryConfiguration property for encoding
> to
> > > use
> > > > instead of using global property. It seems to be more convenient for
> > > user.
> > > >
> > > > I'll appreciate your feedback.
> > > >
> > > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>:
> > > >
> > > > > Hi Igniters,I'd like to discuss future changes related to
> > IGNITE-5655
> > > > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it
> really
> > > good
> > > > > idea to introduce new flag (ENCODED_STRING) for existing String
> > > datatype?
> > > > > It's possible to use existing STRING flag at negligible performance
> > > cost.
> > > > > Currently, utf-8-encoded string looks like
> > > > > byteFlag nonNegativeIntStrLen bytes
> > > > > This format can be backward compatibly extended to
> > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > > > > Next, I suggest to add new BinaryConfiguration property for
> encoding
> > to
> > > > use
> > > > > instead of using global property. It seems to be more convenient
> for
> > > > > user.I'll appreciate your feedback.
> > > > >
> > > > >
> > > > >
> > > > > -----
> > > > > Best regards,
> > > > > Andrey Kuznetsov.
> > > > > --
> > > > > View this message in context: http://apache-ignite-
> > > > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding-
> > > > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html
> > > > > Sent from the Apache Ignite Developers mailing list archive at
> > > > Nabble.com.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrey Kuznetsov.
> > > >
> > >
> >
>

daradurvs

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Vladimir,
> When we finish varlen optimization for string lengths, I am afraid we
could
> end up with very messy protocol, should we mix encoded length and
encoding.
I agree, we shouldn't mix it.

> I deemed it's unusual to make two differerent type markers (flags) for
> single datatype. I can't see the source right now
Theoretically, you can combine GridBinaryMarshaller.STRING with
BinaryWriteMode.
I agree with Vladimir, way of addition of new type is the the most clear
for me.

> Encoding must be set on per field basis. This will give us as most
flexible
> solution at the cost of 1-byte overhead.

> Vova, I agree that the encoding should be set on per-field basis, but at
> the table level, not at a cell level.

Dmitriy, Vladimir,
Let's use both approaches :-)
We can add parameter to CacheConfiguration.
If parameter specifie to use cache level encoding then marshaller will use
encoding in a cache,
otherwise marshaller will use per-field encoding.
Of course only if it doesn't complicate the solution.

2017-07-25 20:44 GMT+03:00 Dmitriy Setrakyan <[hidden email]>:

> On Tue, Jul 25, 2017 at 12:36 PM, Vladimir Ozerov <[hidden email]>
> wrote:
>
> > Vyacheslav,
> > When we finish varlen optimization for string lengths, I am afraid we
> could
> > end up with very messy protocol, should we mix encoded length and
> encoding.
> >
> > Dima,
> > Encoding must be set on per field basis. This will give us as most
> flexible
> > solution at the cost of 1-byte overhead.
> >
>
> Vova, I agree that the encoding should be set on per-field basis, but at
> the table level, not at a cell level. I cannot foresee a situation where we
> would have different encodings in the same column. If that ever happens,
> then user can provide already encoded values.
>
>
> >
> > вт, 25 июля 2017 г. в 20:23, Dmitriy Setrakyan <[hidden email]>:
> >
> > > I don't understand why this encoding is done on per-object and not on
> > > per-cache level. Shouldn't the column-to-encoding mapping be defined at
> > > cache level configuration?
> > >
> > > On Tue, Jul 25, 2017 at 12:13 PM, Vladimir Ozerov <
> [hidden email]>
> > > wrote:
> > >
> > > > Andrey,
> > > >
> > > > You cannot have optional part in the middle as it will break
> > > compatibility
> > > > in dangerous way, probably leading to node crash. Also having INT (4
> > > bytes)
> > > > looks too much for me.
> > > >
> > > > Instead, I would add new type "encoded string":
> > > > 1 byte - type
> > > > 1 byte - encoding code, map frequently used encodings to some byte
> > value;
> > > > also have a special value, meaning that encoding will be written as
> > > string
> > > > afterwards, this way we will support any encoding out of the box
> > > > [optional] encoding name
> > > > 4 bytes - string length
> > > > Finally - string bytes
> > > >
> > > > Vladimir.
> > > >
> > > > вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>:
> > > >
> > > > > I apologize for damaged formatting. Below is my message as it
> should
> > > be.
> > > > >
> > > > >
> > > > > Hi Igniters,
> > > > >
> > > > > I'd like to discuss future changes related to
> > > https://issues.apache.org/
> > > > > jira/browse/IGNITE-5655
> > > > > <https://issues.apache.org/jira/browse/IGNITE-5655>.
> > > > >
> > > > > Is it really good idea to introduce new flag (ENCODED_STRING) for
> > > > existing
> > > > > String datatype? It's possible to use existing STRING flag at
> > > negligible
> > > > > performance cost.
> > > > >
> > > > > Currently, utf-8-encoded string looks like
> > > > >
> > > > > byteFlag nonNegativeIntStrLen bytes
> > > > >
> > > > > This format can be backward compatibly extended to
> > > > >
> > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > > > >
> > > > > Next, I suggest to add new BinaryConfiguration property for
> encoding
> > to
> > > > use
> > > > > instead of using global property. It seems to be more convenient
> for
> > > > user.
> > > > >
> > > > > I'll appreciate your feedback.
> > > > >
> > > > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>:
> > > > >
> > > > > > Hi Igniters,I'd like to discuss future changes related to
> > > IGNITE-5655
> > > > > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it
> > really
> > > > good
> > > > > > idea to introduce new flag (ENCODED_STRING) for existing String
> > > > datatype?
> > > > > > It's possible to use existing STRING flag at negligible
> performance
> > > > cost.
> > > > > > Currently, utf-8-encoded string looks like
> > > > > > byteFlag nonNegativeIntStrLen bytes
> > > > > > This format can be backward compatibly extended to
> > > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > > > > > Next, I suggest to add new BinaryConfiguration property for
> > encoding
> > > to
> > > > > use
> > > > > > instead of using global property. It seems to be more convenient
> > for
> > > > > > user.I'll appreciate your feedback.
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----
> > > > > > Best regards,
> > > > > > Andrey Kuznetsov.
> > > > > > --
> > > > > > View this message in context: http://apache-ignite-
> > > > > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding-
> > > > > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html
> > > > > > Sent from the Apache Ignite Developers mailing list archive at
> > > > > Nabble.com.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey Kuznetsov.
> > > > >
> > > >
> > >
> >
>

--
Best Regards, Vyacheslav D.

Andrey Kuznetsov

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

In reply to this post by Vladimir Ozerov

Vladimir,

It's rather simple to support string encoding by setting it in BinaryConfiguration. But I'm unsure whether it's a desired change. We need to express our goal more precisely: should we control encoding at cache level, field level, or binary configuration level? Currently, BinaryMarshaller is controlled only by BinaryConfiguration and it's hard for me to estimate changes to bring string encoding, say, to per-cache basis.

2017-07-25 20:17 GMT+03:00 Vladimir Ozerov [via Apache Ignite Developers] <[hidden email]>:

Vyacheslav,
When we finish varlen optimization for string lengths, I am afraid we could
end up with very messy protocol, should we mix encoded length and encoding.

Dima,
Encoding must be set on per field basis. This will give us as most flexible
solution at the cost of 1-byte overhead.

вт, 25 июля 2017 г. в 20:23, Dmitriy Setrakyan <[hidden email]>:

> I don't understand why this encoding is done on per-object and not on
> per-cache level. Shouldn't the column-to-encoding mapping be defined at
> cache level configuration?
>
> On Tue, Jul 25, 2017 at 12:13 PM, Vladimir Ozerov <[hidden email]>
> wrote:

>
> > Andrey,
> >
> > You cannot have optional part in the middle as it will break
> compatibility
> > in dangerous way, probably leading to node crash. Also having INT (4
> bytes)
> > looks too much for me.
> >
> > Instead, I would add new type "encoded string":
> > 1 byte - type
> > 1 byte - encoding code, map frequently used encodings to some byte value;
> > also have a special value, meaning that encoding will be written as
> string
> > afterwards, this way we will support any encoding out of the box
> > [optional] encoding name
> > 4 bytes - string length
> > Finally - string bytes
> >
> > Vladimir.
> >
> > вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>:
> >

> > > I apologize for damaged formatting. Below is my message as it should
> be.
> > >
> > >
> > > Hi Igniters,
> > >
> > > I'd like to discuss future changes related to
> https://issues.apache.org/
> > > jira/browse/IGNITE-5655
> > > <https://issues.apache.org/jira/browse/IGNITE-5655>.
> > >
> > > Is it really good idea to introduce new flag (ENCODED_STRING) for
> > existing
> > > String datatype? It's possible to use existing STRING flag at
> negligible
> > > performance cost.
> > >
> > > Currently, utf-8-encoded string looks like
> > >
> > > byteFlag nonNegativeIntStrLen bytes
> > >
> > > This format can be backward compatibly extended to
> > >
> > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > >
> > > Next, I suggest to add new BinaryConfiguration property for encoding to
> > use
> > > instead of using global property. It seems to be more convenient for
> > user.
> > >
> > > I'll appreciate your feedback.
> > >
> > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>:

> > >
> > > > Hi Igniters,I'd like to discuss future changes related to
> IGNITE-5655
> > > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really
> > good
> > > > idea to introduce new flag (ENCODED_STRING) for existing String
> > datatype?
> > > > It's possible to use existing STRING flag at negligible performance
> > cost.
> > > > Currently, utf-8-encoded string looks like
> > > > byteFlag nonNegativeIntStrLen bytes
> > > > This format can be backward compatibly extended to
> > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes
> > > > Next, I suggest to add new BinaryConfiguration property for encoding
> to
> > > use
> > > > instead of using global property. It seems to be more convenient for
> > > > user.I'll appreciate your feedback.
> > > >
> > > >
> > > >
> > > > -----
> > > > Best regards,
> > > > Andrey Kuznetsov.
> > > > --
> > > > View this message in context: http://apache-ignite-
> > > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding-
> > > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html
> > > > Sent from the Apache Ignite Developers mailing list archive at
> > > Nabble.com.
> > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey Kuznetsov.
> > >
> >
>

If you reply to this email, your message will be added to the discussion below:
http://apache-ignite-developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding-support-in-BinaryMarshaller-IGNITE-5655-tp20024p20046.html

To unsubscribe from Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655), click here.
NAML

Best regards,

Andrey Kuznetsov.

dsetrakyan

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

In reply to this post by daradurvs

On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <[hidden email]>
wrote:

>
> > Encoding must be set on per field basis. This will give us as most
> flexible
> > solution at the cost of 1-byte overhead.
>
> > Vova, I agree that the encoding should be set on per-field basis, but at
> > the table level, not at a cell level.
>
> Dmitriy, Vladimir,
> Let's use both approaches :-)
> We can add parameter to CacheConfiguration.
> If parameter specifie to use cache level encoding then marshaller will use
> encoding in a cache,
> otherwise marshaller will use per-field encoding.
> Of course only if it doesn't complicate the solution.
>
>

I think that it will complicate the solution and will complicate the
marshalling protocol. The advantage of specifying the encoding at
table/cache level is that we don't need to add extra encoding bytes to the
marshalling protocol.

I think Vova was suggesting encoding at the cell level, not at the field
level, which seems to be redundant to me.

Vova, do you agree?

Igor Sapego-2

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Just a note from the platforms guy:

Solution with table-level configuration is going to be significantly
harder to implement for platforms and ODBC then field-level one.

Also, what about binary objects, which are not stored in cache,
but being marshalled?

Best Regards,
Igor

On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <[hidden email]>
wrote:

> On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <[hidden email]>
> wrote:
>
> >
> > > Encoding must be set on per field basis. This will give us as most
> > flexible
> > > solution at the cost of 1-byte overhead.
> >
> > > Vova, I agree that the encoding should be set on per-field basis, but
> at
> > > the table level, not at a cell level.
> >
> > Dmitriy, Vladimir,
> > Let's use both approaches :-)
> > We can add parameter to CacheConfiguration.
> > If parameter specifie to use cache level encoding then marshaller will
> use
> > encoding in a cache,
> > otherwise marshaller will use per-field encoding.
> > Of course only if it doesn't complicate the solution.
> >
> >
> I think that it will complicate the solution and will complicate the
> marshalling protocol. The advantage of specifying the encoding at
> table/cache level is that we don't need to add extra encoding bytes to the
> marshalling protocol.
>
> I think Vova was suggesting encoding at the cell level, not at the field
> level, which seems to be redundant to me.
>
> Vova, do you agree?
>

dsetrakyan

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]> wrote:

> Just a note from the platforms guy:
>
> Solution with table-level configuration is going to be significantly
> harder to implement for platforms and ODBC then field-level one.
>

Igor, it seems like you are advocating the per-cell configuration, not
per-field one. The per-field configuration can be defined at the
table/cache level.

I see your point about C++ and .NET integrations however. Can't we provide
this info at node-join time or table-creation time? This way all nodes will
receive it and you will be able to grab it on different platforms.

>
> Also, what about binary objects, which are not stored in cache,
> but being marshalled?
>

I think the default system encoding should be used here. If we don't have
configuration for default encoding, we should add it.

>
>
> Best Regards,
> Igor
>
> On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <[hidden email]>
> wrote:
>
> > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <[hidden email]
> >
> > wrote:
> >
> > >
> > > > Encoding must be set on per field basis. This will give us as most
> > > flexible
> > > > solution at the cost of 1-byte overhead.
> > >
> > > > Vova, I agree that the encoding should be set on per-field basis, but
> > at
> > > > the table level, not at a cell level.
> > >
> > > Dmitriy, Vladimir,
> > > Let's use both approaches :-)
> > > We can add parameter to CacheConfiguration.
> > > If parameter specifie to use cache level encoding then marshaller will
> > use
> > > encoding in a cache,
> > > otherwise marshaller will use per-field encoding.
> > > Of course only if it doesn't complicate the solution.
> > >
> > >
> > I think that it will complicate the solution and will complicate the
> > marshalling protocol. The advantage of specifying the encoding at
> > table/cache level is that we don't need to add extra encoding bytes to
> the
> > marshalling protocol.
> >
> > I think Vova was suggesting encoding at the cell level, not at the field
> > level, which seems to be redundant to me.
> >
> > Vova, do you agree?
> >
>

Pavel Tupitsyn

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

I'm not sure I uderstand how this "per field" configuration is supposed to
be implemented.
* Marshaller is not tied to a cache. It serializes all kinds of things,
like compute job parameters and results.
* Raw mode does not involve field names.

Also it seems like a complicated and expensive solution - looking up string
format somewhere in the metadata will be slow.

"encoded string" data type suggestion from Vladimir looks better to me from
performance and implementation standpoint.

Thanks,
Pavel

On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <[hidden email]>
wrote:

> On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]> wrote:
>
> > Just a note from the platforms guy:
> >
> > Solution with table-level configuration is going to be significantly
> > harder to implement for platforms and ODBC then field-level one.
> >
>
> Igor, it seems like you are advocating the per-cell configuration, not
> per-field one. The per-field configuration can be defined at the
> table/cache level.
>
> I see your point about C++ and .NET integrations however. Can't we provide
> this info at node-join time or table-creation time? This way all nodes will
> receive it and you will be able to grab it on different platforms.
>
>
> >
> > Also, what about binary objects, which are not stored in cache,
> > but being marshalled?
> >
>
> I think the default system encoding should be used here. If we don't have
> configuration for default encoding, we should add it.
>
>
> >
> >
> > Best Regards,
> > Igor
> >
> > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <
> [hidden email]>
> > wrote:
> >
> > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <
> [hidden email]
> > >
> > > wrote:
> > >
> > > >
> > > > > Encoding must be set on per field basis. This will give us as most
> > > > flexible
> > > > > solution at the cost of 1-byte overhead.
> > > >
> > > > > Vova, I agree that the encoding should be set on per-field basis,
> but
> > > at
> > > > > the table level, not at a cell level.
> > > >
> > > > Dmitriy, Vladimir,
> > > > Let's use both approaches :-)
> > > > We can add parameter to CacheConfiguration.
> > > > If parameter specifie to use cache level encoding then marshaller
> will
> > > use
> > > > encoding in a cache,
> > > > otherwise marshaller will use per-field encoding.
> > > > Of course only if it doesn't complicate the solution.
> > > >
> > > >
> > > I think that it will complicate the solution and will complicate the
> > > marshalling protocol. The advantage of specifying the encoding at
> > > table/cache level is that we don't need to add extra encoding bytes to
> > the
> > > marshalling protocol.
> > >
> > > I think Vova was suggesting encoding at the cell level, not at the
> field
> > > level, which seems to be redundant to me.
> > >
> > > Vova, do you agree?
> > >
> >
>

Igor Sapego

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

In reply to this post by dsetrakyan

> Igor, it seems like you are advocating the per-cell configuration, not
> per-field one.

True, some terms mismatch here.

> I see your point about C++ and .NET integrations however. Can't we provide
> this info at node-join time or table-creation time? This way all nodes
will
> receive it and you will be able to grab it on different platforms.

This issue can be solved in different ways, I just say that it will be
significantly
more complicated. Just something we may want to consider when we choose
a solution here.

Best Regards,
Igor

On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <[hidden email]>
wrote:

dsetrakyan

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

In reply to this post by Pavel Tupitsyn

Pavel, what would be the size overhead? Are we adding 1 byte for every
field just for this? If you would like to have this info in the binary
object directly, can we in this case have some bitmap of field-to-encoding?

D.

On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <[hidden email]>
wrote:

> I'm not sure I uderstand how this "per field" configuration is supposed to
> be implemented.
> * Marshaller is not tied to a cache. It serializes all kinds of things,
> like compute job parameters and results.
> * Raw mode does not involve field names.
>
> Also it seems like a complicated and expensive solution - looking up string
> format somewhere in the metadata will be slow.
>
> "encoded string" data type suggestion from Vladimir looks better to me from
> performance and implementation standpoint.
>
> Thanks,
> Pavel
>
>
>
> On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <[hidden email]>
> wrote:
>
> > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]> wrote:
> >
> > > Just a note from the platforms guy:
> > >
> > > Solution with table-level configuration is going to be significantly
> > > harder to implement for platforms and ODBC then field-level one.
> > >
> >
> > Igor, it seems like you are advocating the per-cell configuration, not
> > per-field one. The per-field configuration can be defined at the
> > table/cache level.
> >
> > I see your point about C++ and .NET integrations however. Can't we
> provide
> > this info at node-join time or table-creation time? This way all nodes
> will
> > receive it and you will be able to grab it on different platforms.
> >
> >
> > >
> > > Also, what about binary objects, which are not stored in cache,
> > > but being marshalled?
> > >
> >
> > I think the default system encoding should be used here. If we don't have
> > configuration for default encoding, we should add it.
> >
> >
> > >
> > >
> > > Best Regards,
> > > Igor
> > >
> > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <
> > [hidden email]>
> > > wrote:
> > >
> > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > >
> > > > > > Encoding must be set on per field basis. This will give us as
> most
> > > > > flexible
> > > > > > solution at the cost of 1-byte overhead.
> > > > >
> > > > > > Vova, I agree that the encoding should be set on per-field basis,
> > but
> > > > at
> > > > > > the table level, not at a cell level.
> > > > >
> > > > > Dmitriy, Vladimir,
> > > > > Let's use both approaches :-)
> > > > > We can add parameter to CacheConfiguration.
> > > > > If parameter specifie to use cache level encoding then marshaller
> > will
> > > > use
> > > > > encoding in a cache,
> > > > > otherwise marshaller will use per-field encoding.
> > > > > Of course only if it doesn't complicate the solution.
> > > > >
> > > > >
> > > > I think that it will complicate the solution and will complicate the
> > > > marshalling protocol. The advantage of specifying the encoding at
> > > > table/cache level is that we don't need to add extra encoding bytes
> to
> > > the
> > > > marshalling protocol.
> > > >
> > > > I think Vova was suggesting encoding at the cell level, not at the
> > field
> > > > level, which seems to be redundant to me.
> > > >
> > > > Vova, do you agree?
> > > >
> > >
> >
>

Pavel Tupitsyn

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

> 1 byte for every field just for this
GridBinaryMarshaller.STRING data type remains untouched.
We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte for
encoding type.

This means no overhead for existing code.
I think the most common use case is English, which uses 1 byte per char in
UTF-8.
This is already as fast and compact as possible, and we don't want to
introduce any lookup overhead here.

And when user knows that their data will be more compact in some specific
encoding,
they use some BinaryWriter.writeString overload, which writes a different
type code.

Yes, it also writes an extra byte, but you save a byte per char of the
actual string
(for example, when using Windows-1251 for Russian text), so this does not
matter.

On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan <[hidden email]>
wrote:

> Pavel, what would be the size overhead? Are we adding 1 byte for every
> field just for this? If you would like to have this info in the binary
> object directly, can we in this case have some bitmap of field-to-encoding?
>
> D.
>
> On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <[hidden email]>
> wrote:
>
> > I'm not sure I uderstand how this "per field" configuration is supposed
> to
> > be implemented.
> > * Marshaller is not tied to a cache. It serializes all kinds of things,
> > like compute job parameters and results.
> > * Raw mode does not involve field names.
> >
> > Also it seems like a complicated and expensive solution - looking up
> string
> > format somewhere in the metadata will be slow.
> >
> > "encoded string" data type suggestion from Vladimir looks better to me
> from
> > performance and implementation standpoint.
> >
> > Thanks,
> > Pavel
> >
> >
> >
> > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <
> [hidden email]>
> > wrote:
> >
> > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]>
> wrote:
> > >
> > > > Just a note from the platforms guy:
> > > >
> > > > Solution with table-level configuration is going to be significantly
> > > > harder to implement for platforms and ODBC then field-level one.
> > > >
> > >
> > > Igor, it seems like you are advocating the per-cell configuration, not
> > > per-field one. The per-field configuration can be defined at the
> > > table/cache level.
> > >
> > > I see your point about C++ and .NET integrations however. Can't we
> > provide
> > > this info at node-join time or table-creation time? This way all nodes
> > will
> > > receive it and you will be able to grab it on different platforms.
> > >
> > >
> > > >
> > > > Also, what about binary objects, which are not stored in cache,
> > > > but being marshalled?
> > > >
> > >
> > > I think the default system encoding should be used here. If we don't
> have
> > > configuration for default encoding, we should add it.
> > >
> > >
> > > >
> > > >
> > > > Best Regards,
> > > > Igor
> > > >
> > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <
> > > [hidden email]
> > > > >
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > > Encoding must be set on per field basis. This will give us as
> > most
> > > > > > flexible
> > > > > > > solution at the cost of 1-byte overhead.
> > > > > >
> > > > > > > Vova, I agree that the encoding should be set on per-field
> basis,
> > > but
> > > > > at
> > > > > > > the table level, not at a cell level.
> > > > > >
> > > > > > Dmitriy, Vladimir,
> > > > > > Let's use both approaches :-)
> > > > > > We can add parameter to CacheConfiguration.
> > > > > > If parameter specifie to use cache level encoding then marshaller
> > > will
> > > > > use
> > > > > > encoding in a cache,
> > > > > > otherwise marshaller will use per-field encoding.
> > > > > > Of course only if it doesn't complicate the solution.
> > > > > >
> > > > > >
> > > > > I think that it will complicate the solution and will complicate
> the
> > > > > marshalling protocol. The advantage of specifying the encoding at
> > > > > table/cache level is that we don't need to add extra encoding bytes
> > to
> > > > the
> > > > > marshalling protocol.
> > > > >
> > > > > I think Vova was suggesting encoding at the cell level, not at the
> > > field
> > > > > level, which seems to be redundant to me.
> > > > >
> > > > > Vova, do you agree?
> > > > >
> > > >
> > >
> >
>

Valentin Kulichenko

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Pavel,

This forces user to implement Binarylizable for whole type in case they
want to change encoding for one-two fields, right? I really don't like it,
why not add default encoding to BinaryTypeConfiguration?

-Val

On Thu, Jul 27, 2017 at 7:54 AM, Pavel Tupitsyn <[hidden email]>
wrote:

> > 1 byte for every field just for this
> GridBinaryMarshaller.STRING data type remains untouched.
> We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte for
> encoding type.
>
> This means no overhead for existing code.
> I think the most common use case is English, which uses 1 byte per char in
> UTF-8.
> This is already as fast and compact as possible, and we don't want to
> introduce any lookup overhead here.
>
> And when user knows that their data will be more compact in some specific
> encoding,
> they use some BinaryWriter.writeString overload, which writes a different
> type code.
>
> Yes, it also writes an extra byte, but you save a byte per char of the
> actual string
> (for example, when using Windows-1251 for Russian text), so this does not
> matter.
>
> On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan <[hidden email]>
> wrote:
>
> > Pavel, what would be the size overhead? Are we adding 1 byte for every
> > field just for this? If you would like to have this info in the binary
> > object directly, can we in this case have some bitmap of
> field-to-encoding?
> >
> > D.
> >
> > On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <[hidden email]>
> > wrote:
> >
> > > I'm not sure I uderstand how this "per field" configuration is supposed
> > to
> > > be implemented.
> > > * Marshaller is not tied to a cache. It serializes all kinds of things,
> > > like compute job parameters and results.
> > > * Raw mode does not involve field names.
> > >
> > > Also it seems like a complicated and expensive solution - looking up
> > string
> > > format somewhere in the metadata will be slow.
> > >
> > > "encoded string" data type suggestion from Vladimir looks better to me
> > from
> > > performance and implementation standpoint.
> > >
> > > Thanks,
> > > Pavel
> > >
> > >
> > >
> > > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <
> > [hidden email]>
> > > wrote:
> > >
> > > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]>
> > wrote:
> > > >
> > > > > Just a note from the platforms guy:
> > > > >
> > > > > Solution with table-level configuration is going to be
> significantly
> > > > > harder to implement for platforms and ODBC then field-level one.
> > > > >
> > > >
> > > > Igor, it seems like you are advocating the per-cell configuration,
> not
> > > > per-field one. The per-field configuration can be defined at the
> > > > table/cache level.
> > > >
> > > > I see your point about C++ and .NET integrations however. Can't we
> > > provide
> > > > this info at node-join time or table-creation time? This way all
> nodes
> > > will
> > > > receive it and you will be able to grab it on different platforms.
> > > >
> > > >
> > > > >
> > > > > Also, what about binary objects, which are not stored in cache,
> > > > > but being marshalled?
> > > > >
> > > >
> > > > I think the default system encoding should be used here. If we don't
> > have
> > > > configuration for default encoding, we should add it.
> > > >
> > > >
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > Igor
> > > > >
> > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <
> > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <
> > > > [hidden email]
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > > Encoding must be set on per field basis. This will give us as
> > > most
> > > > > > > flexible
> > > > > > > > solution at the cost of 1-byte overhead.
> > > > > > >
> > > > > > > > Vova, I agree that the encoding should be set on per-field
> > basis,
> > > > but
> > > > > > at
> > > > > > > > the table level, not at a cell level.
> > > > > > >
> > > > > > > Dmitriy, Vladimir,
> > > > > > > Let's use both approaches :-)
> > > > > > > We can add parameter to CacheConfiguration.
> > > > > > > If parameter specifie to use cache level encoding then
> marshaller
> > > > will
> > > > > > use
> > > > > > > encoding in a cache,
> > > > > > > otherwise marshaller will use per-field encoding.
> > > > > > > Of course only if it doesn't complicate the solution.
> > > > > > >
> > > > > > >
> > > > > > I think that it will complicate the solution and will complicate
> > the
> > > > > > marshalling protocol. The advantage of specifying the encoding at
> > > > > > table/cache level is that we don't need to add extra encoding
> bytes
> > > to
> > > > > the
> > > > > > marshalling protocol.
> > > > > >
> > > > > > I think Vova was suggesting encoding at the cell level, not at
> the
> > > > field
> > > > > > level, which seems to be redundant to me.
> > > > > >
> > > > > > Vova, do you agree?
> > > > > >
> > > > >
> > > >
> > >
> >
>

Pavel Tupitsyn

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

Val, of course other options should be available, such as
BinaryTypeConfiguration,
and maybe field-level and class-level annotations.

On Thu, Jul 27, 2017 at 9:07 PM, Valentin Kulichenko <
[hidden email]> wrote:

> Pavel,
>
> This forces user to implement Binarylizable for whole type in case they
> want to change encoding for one-two fields, right? I really don't like it,
> why not add default encoding to BinaryTypeConfiguration?
>
> -Val
>
> On Thu, Jul 27, 2017 at 7:54 AM, Pavel Tupitsyn <[hidden email]>
> wrote:
>
> > > 1 byte for every field just for this
> > GridBinaryMarshaller.STRING data type remains untouched.
> > We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte
> for
> > encoding type.
> >
> > This means no overhead for existing code.
> > I think the most common use case is English, which uses 1 byte per char
> in
> > UTF-8.
> > This is already as fast and compact as possible, and we don't want to
> > introduce any lookup overhead here.
> >
> > And when user knows that their data will be more compact in some specific
> > encoding,
> > they use some BinaryWriter.writeString overload, which writes a different
> > type code.
> >
> > Yes, it also writes an extra byte, but you save a byte per char of the
> > actual string
> > (for example, when using Windows-1251 for Russian text), so this does not
> > matter.
> >
> > On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan <
> [hidden email]>
> > wrote:
> >
> > > Pavel, what would be the size overhead? Are we adding 1 byte for every
> > > field just for this? If you would like to have this info in the binary
> > > object directly, can we in this case have some bitmap of
> > field-to-encoding?
> > >
> > > D.
> > >
> > > On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <[hidden email]>
> > > wrote:
> > >
> > > > I'm not sure I uderstand how this "per field" configuration is
> supposed
> > > to
> > > > be implemented.
> > > > * Marshaller is not tied to a cache. It serializes all kinds of
> things,
> > > > like compute job parameters and results.
> > > > * Raw mode does not involve field names.
> > > >
> > > > Also it seems like a complicated and expensive solution - looking up
> > > string
> > > > format somewhere in the metadata will be slow.
> > > >
> > > > "encoded string" data type suggestion from Vladimir looks better to
> me
> > > from
> > > > performance and implementation standpoint.
> > > >
> > > > Thanks,
> > > > Pavel
> > > >
> > > >
> > > >
> > > > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <
> > > [hidden email]>
> > > > wrote:
> > > >
> > > > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Just a note from the platforms guy:
> > > > > >
> > > > > > Solution with table-level configuration is going to be
> > significantly
> > > > > > harder to implement for platforms and ODBC then field-level one.
> > > > > >
> > > > >
> > > > > Igor, it seems like you are advocating the per-cell configuration,
> > not
> > > > > per-field one. The per-field configuration can be defined at the
> > > > > table/cache level.
> > > > >
> > > > > I see your point about C++ and .NET integrations however. Can't we
> > > > provide
> > > > > this info at node-join time or table-creation time? This way all
> > nodes
> > > > will
> > > > > receive it and you will be able to grab it on different platforms.
> > > > >
> > > > >
> > > > > >
> > > > > > Also, what about binary objects, which are not stored in cache,
> > > > > > but being marshalled?
> > > > > >
> > > > >
> > > > > I think the default system encoding should be used here. If we
> don't
> > > have
> > > > > configuration for default encoding, we should add it.
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > Best Regards,
> > > > > > Igor
> > > > > >
> > > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <
> > > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <
> > > > > [hidden email]
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > > Encoding must be set on per field basis. This will give us
> as
> > > > most
> > > > > > > > flexible
> > > > > > > > > solution at the cost of 1-byte overhead.
> > > > > > > >
> > > > > > > > > Vova, I agree that the encoding should be set on per-field
> > > basis,
> > > > > but
> > > > > > > at
> > > > > > > > > the table level, not at a cell level.
> > > > > > > >
> > > > > > > > Dmitriy, Vladimir,
> > > > > > > > Let's use both approaches :-)
> > > > > > > > We can add parameter to CacheConfiguration.
> > > > > > > > If parameter specifie to use cache level encoding then
> > marshaller
> > > > > will
> > > > > > > use
> > > > > > > > encoding in a cache,
> > > > > > > > otherwise marshaller will use per-field encoding.
> > > > > > > > Of course only if it doesn't complicate the solution.
> > > > > > > >
> > > > > > > >
> > > > > > > I think that it will complicate the solution and will
> complicate
> > > the
> > > > > > > marshalling protocol. The advantage of specifying the encoding
> at
> > > > > > > table/cache level is that we don't need to add extra encoding
> > bytes
> > > > to
> > > > > > the
> > > > > > > marshalling protocol.
> > > > > > >
> > > > > > > I think Vova was suggesting encoding at the cell level, not at
> > the
> > > > > field
> > > > > > > level, which seems to be redundant to me.
> > > > > > >
> > > > > > > Vova, do you agree?
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Vladimir Ozerov

Re: Non-UTF-8 string encoding support in BinaryMarshaller (IGNITE-5655)

As Pavel mentioned, Marshaller should not be tied to cache, BinaryObject
should be self-explanatory, i.e. containing all information necessary for
unmarshalling. This is an absolute requirement.

We will have one extra byte for in serialized form, meaning that advantage
of custom encoding will become evident for all strings with length >= 1,
which is perfectly fine. I do not quite understand what are we arguing
about.

As far as configuration, we can do it as follows:

1) Add global encoding, UTF8 by default.
2) Add per-cache encoding.
3) Add encoding to JDBC and ODBC driver properties.

This should be enough.

пт, 28 июля 2017 г. в 11:45, Pavel Tupitsyn <[hidden email]>:

> Val, of course other options should be available, such as
> BinaryTypeConfiguration,
> and maybe field-level and class-level annotations.
>
> On Thu, Jul 27, 2017 at 9:07 PM, Valentin Kulichenko <
> [hidden email]> wrote:
>
> > Pavel,
> >
> > This forces user to implement Binarylizable for whole type in case they
> > want to change encoding for one-two fields, right? I really don't like
> it,
> > why not add default encoding to BinaryTypeConfiguration?
> >
> > -Val
> >
> > On Thu, Jul 27, 2017 at 7:54 AM, Pavel Tupitsyn <[hidden email]>
> > wrote:
> >
> > > > 1 byte for every field just for this
> > > GridBinaryMarshaller.STRING data type remains untouched.
> > > We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte
> > for
> > > encoding type.
> > >
> > > This means no overhead for existing code.
> > > I think the most common use case is English, which uses 1 byte per char
> > in
> > > UTF-8.
> > > This is already as fast and compact as possible, and we don't want to
> > > introduce any lookup overhead here.
> > >
> > > And when user knows that their data will be more compact in some
> specific
> > > encoding,
> > > they use some BinaryWriter.writeString overload, which writes a
> different
> > > type code.
> > >
> > > Yes, it also writes an extra byte, but you save a byte per char of the
> > > actual string
> > > (for example, when using Windows-1251 for Russian text), so this does
> not
> > > matter.
> > >
> > > On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan <
> > [hidden email]>
> > > wrote:
> > >
> > > > Pavel, what would be the size overhead? Are we adding 1 byte for
> every
> > > > field just for this? If you would like to have this info in the
> binary
> > > > object directly, can we in this case have some bitmap of
> > > field-to-encoding?
> > > >
> > > > D.
> > > >
> > > > On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > I'm not sure I uderstand how this "per field" configuration is
> > supposed
> > > > to
> > > > > be implemented.
> > > > > * Marshaller is not tied to a cache. It serializes all kinds of
> > things,
> > > > > like compute job parameters and results.
> > > > > * Raw mode does not involve field names.
> > > > >
> > > > > Also it seems like a complicated and expensive solution - looking
> up
> > > > string
> > > > > format somewhere in the metadata will be slow.
> > > > >
> > > > > "encoded string" data type suggestion from Vladimir looks better to
> > me
> > > > from
> > > > > performance and implementation standpoint.
> > > > >
> > > > > Thanks,
> > > > > Pavel
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <
> > > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]
> >
> > > > wrote:
> > > > > >
> > > > > > > Just a note from the platforms guy:
> > > > > > >
> > > > > > > Solution with table-level configuration is going to be
> > > significantly
> > > > > > > harder to implement for platforms and ODBC then field-level
> one.
> > > > > > >
> > > > > >
> > > > > > Igor, it seems like you are advocating the per-cell
> configuration,
> > > not
> > > > > > per-field one. The per-field configuration can be defined at the
> > > > > > table/cache level.
> > > > > >
> > > > > > I see your point about C++ and .NET integrations however. Can't
> we
> > > > > provide
> > > > > > this info at node-join time or table-creation time? This way all
> > > nodes
> > > > > will
> > > > > > receive it and you will be able to grab it on different
> platforms.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Also, what about binary objects, which are not stored in cache,
> > > > > > > but being marshalled?
> > > > > > >
> > > > > >
> > > > > > I think the default system encoding should be used here. If we
> > don't
> > > > have
> > > > > > configuration for default encoding, we should add it.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > Igor
> > > > > > >
> > > > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <
> > > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <
> > > > > > [hidden email]
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > Encoding must be set on per field basis. This will give
> us
> > as
> > > > > most
> > > > > > > > > flexible
> > > > > > > > > > solution at the cost of 1-byte overhead.
> > > > > > > > >
> > > > > > > > > > Vova, I agree that the encoding should be set on
> per-field
> > > > basis,
> > > > > > but
> > > > > > > > at
> > > > > > > > > > the table level, not at a cell level.
> > > > > > > > >
> > > > > > > > > Dmitriy, Vladimir,
> > > > > > > > > Let's use both approaches :-)
> > > > > > > > > We can add parameter to CacheConfiguration.
> > > > > > > > > If parameter specifie to use cache level encoding then
> > > marshaller
> > > > > > will
> > > > > > > > use
> > > > > > > > > encoding in a cache,
> > > > > > > > > otherwise marshaller will use per-field encoding.
> > > > > > > > > Of course only if it doesn't complicate the solution.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > I think that it will complicate the solution and will
> > complicate
> > > > the
> > > > > > > > marshalling protocol. The advantage of specifying the
> encoding
> > at
> > > > > > > > table/cache level is that we don't need to add extra encoding
> > > bytes
> > > > > to
> > > > > > > the
> > > > > > > > marshalling protocol.
> > > > > > > >
> > > > > > > > I think Vova was suggesting encoding at the cell level, not
> at
> > > the
> > > > > > field
> > > > > > > > level, which seems to be redundant to me.
> > > > > > > >
> > > > > > > > Vova, do you agree?
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>