This post was updated on .
Hi Igniters,
I'd like to discuss future changes related to IGNITE-5655. Is it really good idea to introduce new flag (ENCODED_STRING) for existing String datatype? It's possible to use existing STRING flag at negligible performance cost. Currently, utf-8-encoded string looks like byteFlag nonNegativeIntStrLen bytesThis format can be backward compatibly extended to byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytesNext, I suggest to add new BinaryConfiguration property for encoding to use instead of using global property. It seems to be more convenient for user. I'll appreciate your feedback. |
I apologize for damaged formatting. Below is my message as it should be.
Hi Igniters, I'd like to discuss future changes related to https://issues.apache.org/ jira/browse/IGNITE-5655. Is it really good idea to introduce new flag (ENCODED_STRING) for existing String datatype? It's possible to use existing STRING flag at negligible performance cost. Currently, utf-8-encoded string looks like byteFlag nonNegativeIntStrLen bytes This format can be backward compatibly extended to byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes Next, I suggest to add new BinaryConfiguration property for encoding to use instead of using global property. It seems to be more convenient for user. I'll appreciate your feedback. 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>: > Hi Igniters,I'd like to discuss future changes related to IGNITE-5655 > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really good > idea to introduce new flag (ENCODED_STRING) for existing String datatype? > It's possible to use existing STRING flag at negligible performance cost. > Currently, utf-8-encoded string looks like > byteFlag nonNegativeIntStrLen bytes > This format can be backward compatibly extended to > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > Next, I suggest to add new BinaryConfiguration property for encoding to use > instead of using global property. It seems to be more convenient for > user.I'll appreciate your feedback. > > > > ----- > Best regards, > Andrey Kuznetsov. > -- > View this message in context: http://apache-ignite- > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding- > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html > Sent from the Apache Ignite Developers mailing list archive at Nabble.com. -- Best regards, Andrey Kuznetsov. |
Andrey,
You cannot have optional part in the middle as it will break compatibility in dangerous way, probably leading to node crash. Also having INT (4 bytes) looks too much for me. Instead, I would add new type "encoded string": 1 byte - type 1 byte - encoding code, map frequently used encodings to some byte value; also have a special value, meaning that encoding will be written as string afterwards, this way we will support any encoding out of the box [optional] encoding name 4 bytes - string length Finally - string bytes Vladimir. вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>: > I apologize for damaged formatting. Below is my message as it should be. > > > Hi Igniters, > > I'd like to discuss future changes related to https://issues.apache.org/ > jira/browse/IGNITE-5655 > <https://issues.apache.org/jira/browse/IGNITE-5655>. > > Is it really good idea to introduce new flag (ENCODED_STRING) for existing > String datatype? It's possible to use existing STRING flag at negligible > performance cost. > > Currently, utf-8-encoded string looks like > > byteFlag nonNegativeIntStrLen bytes > > This format can be backward compatibly extended to > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > Next, I suggest to add new BinaryConfiguration property for encoding to use > instead of using global property. It seems to be more convenient for user. > > I'll appreciate your feedback. > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>: > > > Hi Igniters,I'd like to discuss future changes related to IGNITE-5655 > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really good > > idea to introduce new flag (ENCODED_STRING) for existing String datatype? > > It's possible to use existing STRING flag at negligible performance cost. > > Currently, utf-8-encoded string looks like > > byteFlag nonNegativeIntStrLen bytes > > This format can be backward compatibly extended to > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > Next, I suggest to add new BinaryConfiguration property for encoding to > use > > instead of using global property. It seems to be more convenient for > > user.I'll appreciate your feedback. > > > > > > > > ----- > > Best regards, > > Andrey Kuznetsov. > > -- > > View this message in context: http://apache-ignite- > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding- > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html > > Sent from the Apache Ignite Developers mailing list archive at > Nabble.com. > > > > > -- > Best regards, > Andrey Kuznetsov. > |
Hi Andrey.
Sound very useful. We can save one byte if will use controlled overflow on [nonNegativeIntStrLen]: If [nonNegativeIntStrLen < 0] then [string is encoded] I have some questions: Will there any public API, e.g. "Encoder" interace? Will user have the opportunity to define own encoding format? 2017-07-25 20:13 GMT+03:00 Vladimir Ozerov <[hidden email]>: > Andrey, > > You cannot have optional part in the middle as it will break compatibility > in dangerous way, probably leading to node crash. Also having INT (4 bytes) > looks too much for me. > > Instead, I would add new type "encoded string": > 1 byte - type > 1 byte - encoding code, map frequently used encodings to some byte value; > also have a special value, meaning that encoding will be written as string > afterwards, this way we will support any encoding out of the box > [optional] encoding name > 4 bytes - string length > Finally - string bytes > > Vladimir. > > вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>: > > > I apologize for damaged formatting. Below is my message as it should be. > > > > > > Hi Igniters, > > > > I'd like to discuss future changes related to https://issues.apache.org/ > > jira/browse/IGNITE-5655 > > <https://issues.apache.org/jira/browse/IGNITE-5655>. > > > > Is it really good idea to introduce new flag (ENCODED_STRING) for > existing > > String datatype? It's possible to use existing STRING flag at negligible > > performance cost. > > > > Currently, utf-8-encoded string looks like > > > > byteFlag nonNegativeIntStrLen bytes > > > > This format can be backward compatibly extended to > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > > Next, I suggest to add new BinaryConfiguration property for encoding to > use > > instead of using global property. It seems to be more convenient for > user. > > > > I'll appreciate your feedback. > > > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>: > > > > > Hi Igniters,I'd like to discuss future changes related to IGNITE-5655 > > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really > good > > > idea to introduce new flag (ENCODED_STRING) for existing String > datatype? > > > It's possible to use existing STRING flag at negligible performance > cost. > > > Currently, utf-8-encoded string looks like > > > byteFlag nonNegativeIntStrLen bytes > > > This format can be backward compatibly extended to > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > Next, I suggest to add new BinaryConfiguration property for encoding to > > use > > > instead of using global property. It seems to be more convenient for > > > user.I'll appreciate your feedback. > > > > > > > > > > > > ----- > > > Best regards, > > > Andrey Kuznetsov. > > > -- > > > View this message in context: http://apache-ignite- > > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding- > > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html > > > Sent from the Apache Ignite Developers mailing list archive at > > Nabble.com. > > > > > > > > > > -- > > Best regards, > > Andrey Kuznetsov. > > > -- Best Regards, Vyacheslav D. |
In reply to this post by Vladimir Ozerov
I don't understand why this encoding is done on per-object and not on
per-cache level. Shouldn't the column-to-encoding mapping be defined at cache level configuration? On Tue, Jul 25, 2017 at 12:13 PM, Vladimir Ozerov <[hidden email]> wrote: > Andrey, > > You cannot have optional part in the middle as it will break compatibility > in dangerous way, probably leading to node crash. Also having INT (4 bytes) > looks too much for me. > > Instead, I would add new type "encoded string": > 1 byte - type > 1 byte - encoding code, map frequently used encodings to some byte value; > also have a special value, meaning that encoding will be written as string > afterwards, this way we will support any encoding out of the box > [optional] encoding name > 4 bytes - string length > Finally - string bytes > > Vladimir. > > вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>: > > > I apologize for damaged formatting. Below is my message as it should be. > > > > > > Hi Igniters, > > > > I'd like to discuss future changes related to https://issues.apache.org/ > > jira/browse/IGNITE-5655 > > <https://issues.apache.org/jira/browse/IGNITE-5655>. > > > > Is it really good idea to introduce new flag (ENCODED_STRING) for > existing > > String datatype? It's possible to use existing STRING flag at negligible > > performance cost. > > > > Currently, utf-8-encoded string looks like > > > > byteFlag nonNegativeIntStrLen bytes > > > > This format can be backward compatibly extended to > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > > Next, I suggest to add new BinaryConfiguration property for encoding to > use > > instead of using global property. It seems to be more convenient for > user. > > > > I'll appreciate your feedback. > > > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>: > > > > > Hi Igniters,I'd like to discuss future changes related to IGNITE-5655 > > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really > good > > > idea to introduce new flag (ENCODED_STRING) for existing String > datatype? > > > It's possible to use existing STRING flag at negligible performance > cost. > > > Currently, utf-8-encoded string looks like > > > byteFlag nonNegativeIntStrLen bytes > > > This format can be backward compatibly extended to > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > Next, I suggest to add new BinaryConfiguration property for encoding to > > use > > > instead of using global property. It seems to be more convenient for > > > user.I'll appreciate your feedback. > > > > > > > > > > > > ----- > > > Best regards, > > > Andrey Kuznetsov. > > > -- > > > View this message in context: http://apache-ignite- > > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding- > > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html > > > Sent from the Apache Ignite Developers mailing list archive at > > Nabble.com. > > > > > > > > > > -- > > Best regards, > > Andrey Kuznetsov. > > > |
Vyacheslav,
When we finish varlen optimization for string lengths, I am afraid we could end up with very messy protocol, should we mix encoded length and encoding. Dima, Encoding must be set on per field basis. This will give us as most flexible solution at the cost of 1-byte overhead. вт, 25 июля 2017 г. в 20:23, Dmitriy Setrakyan <[hidden email]>: > I don't understand why this encoding is done on per-object and not on > per-cache level. Shouldn't the column-to-encoding mapping be defined at > cache level configuration? > > On Tue, Jul 25, 2017 at 12:13 PM, Vladimir Ozerov <[hidden email]> > wrote: > > > Andrey, > > > > You cannot have optional part in the middle as it will break > compatibility > > in dangerous way, probably leading to node crash. Also having INT (4 > bytes) > > looks too much for me. > > > > Instead, I would add new type "encoded string": > > 1 byte - type > > 1 byte - encoding code, map frequently used encodings to some byte value; > > also have a special value, meaning that encoding will be written as > string > > afterwards, this way we will support any encoding out of the box > > [optional] encoding name > > 4 bytes - string length > > Finally - string bytes > > > > Vladimir. > > > > вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>: > > > > > I apologize for damaged formatting. Below is my message as it should > be. > > > > > > > > > Hi Igniters, > > > > > > I'd like to discuss future changes related to > https://issues.apache.org/ > > > jira/browse/IGNITE-5655 > > > <https://issues.apache.org/jira/browse/IGNITE-5655>. > > > > > > Is it really good idea to introduce new flag (ENCODED_STRING) for > > existing > > > String datatype? It's possible to use existing STRING flag at > negligible > > > performance cost. > > > > > > Currently, utf-8-encoded string looks like > > > > > > byteFlag nonNegativeIntStrLen bytes > > > > > > This format can be backward compatibly extended to > > > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > > > > Next, I suggest to add new BinaryConfiguration property for encoding to > > use > > > instead of using global property. It seems to be more convenient for > > user. > > > > > > I'll appreciate your feedback. > > > > > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>: > > > > > > > Hi Igniters,I'd like to discuss future changes related to > IGNITE-5655 > > > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really > > good > > > > idea to introduce new flag (ENCODED_STRING) for existing String > > datatype? > > > > It's possible to use existing STRING flag at negligible performance > > cost. > > > > Currently, utf-8-encoded string looks like > > > > byteFlag nonNegativeIntStrLen bytes > > > > This format can be backward compatibly extended to > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > > Next, I suggest to add new BinaryConfiguration property for encoding > to > > > use > > > > instead of using global property. It seems to be more convenient for > > > > user.I'll appreciate your feedback. > > > > > > > > > > > > > > > > ----- > > > > Best regards, > > > > Andrey Kuznetsov. > > > > -- > > > > View this message in context: http://apache-ignite- > > > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding- > > > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html > > > > Sent from the Apache Ignite Developers mailing list archive at > > > Nabble.com. > > > > > > > > > > > > > > > -- > > > Best regards, > > > Andrey Kuznetsov. > > > > > > |
In reply to this post by Vladimir Ozerov
Vladimir,
Thanks for reply. In any case, we'll break compatibility by introducing new feature in marshalling. But both approaches preserve backward compatibility. I deemed it's unusual to make two differerent type markers (flags) for single datatype. I can't see the source right now, but I'm unsure whether it's possible to map two flags to single type in marshaller implementation. 25 июля 2017 г. 20:13 пользователь "Vladimir Ozerov" <[hidden email]> написал: > Andrey, > > You cannot have optional part in the middle as it will break compatibility > in dangerous way, probably leading to node crash. Also having INT (4 > bytes) > looks too much for me. > > Instead, I would add new type "encoded string": > 1 byte - type > 1 byte - encoding code, map frequently used encodings to some byte value; > also have a special value, meaning that encoding will be written as string > afterwards, this way we will support any encoding out of the box > [optional] encoding name > 4 bytes - string length > Finally - string bytes > > Vladimir. > > вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=20039&i=0>>: > > > I apologize for damaged formatting. Below is my message as it should be. > > > > > > Hi Igniters, > > > > I'd like to discuss future changes related to https://issues.apache.org/ > > jira/browse/IGNITE-5655 > > <https://issues.apache.org/jira/browse/IGNITE-5655>. > > > > Is it really good idea to introduce new flag (ENCODED_STRING) for > existing > > String datatype? It's possible to use existing STRING flag at negligible > > performance cost. > > > > Currently, utf-8-encoded string looks like > > > > byteFlag nonNegativeIntStrLen bytes > > > > This format can be backward compatibly extended to > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > > Next, I suggest to add new BinaryConfiguration property for encoding to > use > > instead of using global property. It seems to be more convenient for > user. > > > > I'll appreciate your feedback. > > > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=20039&i=1>>: > > > > > Hi Igniters,I'd like to discuss future changes related to IGNITE-5655 > > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it really > good > > > idea to introduce new flag (ENCODED_STRING) for existing String > datatype? > > > It's possible to use existing STRING flag at negligible performance > cost. > > > Currently, utf-8-encoded string looks like > > > byteFlag nonNegativeIntStrLen bytes > > > This format can be backward compatibly extended to > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > Next, I suggest to add new BinaryConfiguration property for encoding > to > > use > > > instead of using global property. It seems to be more convenient for > > > user.I'll appreciate your feedback. > > > > > > > > > > > > ----- > > > Best regards, > > > Andrey Kuznetsov. > > > -- > > > View this message in context: http://apache-ignite- > > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding- > > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html > > > Sent from the Apache Ignite Developers mailing list archive at > > Nabble.com. > > > > > > > > > > -- > > Best regards, > > Andrey Kuznetsov. > > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-ignite-developers.2346864.n4.nabble. > com/Non-UTF-8-string-encoding-support-in-BinaryMarshaller- > IGNITE-5655-tp20024p20039.html > To unsubscribe from Non-UTF-8 string encoding support in BinaryMarshaller > (IGNITE-5655), click here > < > . > NAML > <http://apache-ignite-developers.2346864.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > |
In reply to this post by Vladimir Ozerov
On Tue, Jul 25, 2017 at 12:36 PM, Vladimir Ozerov <[hidden email]>
wrote: > Vyacheslav, > When we finish varlen optimization for string lengths, I am afraid we could > end up with very messy protocol, should we mix encoded length and encoding. > > Dima, > Encoding must be set on per field basis. This will give us as most flexible > solution at the cost of 1-byte overhead. > Vova, I agree that the encoding should be set on per-field basis, but at the table level, not at a cell level. I cannot foresee a situation where we would have different encodings in the same column. If that ever happens, then user can provide already encoded values. > > вт, 25 июля 2017 г. в 20:23, Dmitriy Setrakyan <[hidden email]>: > > > I don't understand why this encoding is done on per-object and not on > > per-cache level. Shouldn't the column-to-encoding mapping be defined at > > cache level configuration? > > > > On Tue, Jul 25, 2017 at 12:13 PM, Vladimir Ozerov <[hidden email]> > > wrote: > > > > > Andrey, > > > > > > You cannot have optional part in the middle as it will break > > compatibility > > > in dangerous way, probably leading to node crash. Also having INT (4 > > bytes) > > > looks too much for me. > > > > > > Instead, I would add new type "encoded string": > > > 1 byte - type > > > 1 byte - encoding code, map frequently used encodings to some byte > value; > > > also have a special value, meaning that encoding will be written as > > string > > > afterwards, this way we will support any encoding out of the box > > > [optional] encoding name > > > 4 bytes - string length > > > Finally - string bytes > > > > > > Vladimir. > > > > > > вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>: > > > > > > > I apologize for damaged formatting. Below is my message as it should > > be. > > > > > > > > > > > > Hi Igniters, > > > > > > > > I'd like to discuss future changes related to > > https://issues.apache.org/ > > > > jira/browse/IGNITE-5655 > > > > <https://issues.apache.org/jira/browse/IGNITE-5655>. > > > > > > > > Is it really good idea to introduce new flag (ENCODED_STRING) for > > > existing > > > > String datatype? It's possible to use existing STRING flag at > > negligible > > > > performance cost. > > > > > > > > Currently, utf-8-encoded string looks like > > > > > > > > byteFlag nonNegativeIntStrLen bytes > > > > > > > > This format can be backward compatibly extended to > > > > > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > > > > > > Next, I suggest to add new BinaryConfiguration property for encoding > to > > > use > > > > instead of using global property. It seems to be more convenient for > > > user. > > > > > > > > I'll appreciate your feedback. > > > > > > > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>: > > > > > > > > > Hi Igniters,I'd like to discuss future changes related to > > IGNITE-5655 > > > > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it > really > > > good > > > > > idea to introduce new flag (ENCODED_STRING) for existing String > > > datatype? > > > > > It's possible to use existing STRING flag at negligible performance > > > cost. > > > > > Currently, utf-8-encoded string looks like > > > > > byteFlag nonNegativeIntStrLen bytes > > > > > This format can be backward compatibly extended to > > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > > > Next, I suggest to add new BinaryConfiguration property for > encoding > > to > > > > use > > > > > instead of using global property. It seems to be more convenient > for > > > > > user.I'll appreciate your feedback. > > > > > > > > > > > > > > > > > > > > ----- > > > > > Best regards, > > > > > Andrey Kuznetsov. > > > > > -- > > > > > View this message in context: http://apache-ignite- > > > > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding- > > > > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html > > > > > Sent from the Apache Ignite Developers mailing list archive at > > > > Nabble.com. > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Andrey Kuznetsov. > > > > > > > > > > |
Vladimir,
> When we finish varlen optimization for string lengths, I am afraid we could > end up with very messy protocol, should we mix encoded length and encoding. I agree, we shouldn't mix it. > I deemed it's unusual to make two differerent type markers (flags) for > single datatype. I can't see the source right now Theoretically, you can combine GridBinaryMarshaller.STRING with BinaryWriteMode. I agree with Vladimir, way of addition of new type is the the most clear for me. > Encoding must be set on per field basis. This will give us as most flexible > solution at the cost of 1-byte overhead. > Vova, I agree that the encoding should be set on per-field basis, but at > the table level, not at a cell level. Dmitriy, Vladimir, Let's use both approaches :-) We can add parameter to CacheConfiguration. If parameter specifie to use cache level encoding then marshaller will use encoding in a cache, otherwise marshaller will use per-field encoding. Of course only if it doesn't complicate the solution. 2017-07-25 20:44 GMT+03:00 Dmitriy Setrakyan <[hidden email]>: > On Tue, Jul 25, 2017 at 12:36 PM, Vladimir Ozerov <[hidden email]> > wrote: > > > Vyacheslav, > > When we finish varlen optimization for string lengths, I am afraid we > could > > end up with very messy protocol, should we mix encoded length and > encoding. > > > > Dima, > > Encoding must be set on per field basis. This will give us as most > flexible > > solution at the cost of 1-byte overhead. > > > > Vova, I agree that the encoding should be set on per-field basis, but at > the table level, not at a cell level. I cannot foresee a situation where we > would have different encodings in the same column. If that ever happens, > then user can provide already encoded values. > > > > > > вт, 25 июля 2017 г. в 20:23, Dmitriy Setrakyan <[hidden email]>: > > > > > I don't understand why this encoding is done on per-object and not on > > > per-cache level. Shouldn't the column-to-encoding mapping be defined at > > > cache level configuration? > > > > > > On Tue, Jul 25, 2017 at 12:13 PM, Vladimir Ozerov < > [hidden email]> > > > wrote: > > > > > > > Andrey, > > > > > > > > You cannot have optional part in the middle as it will break > > > compatibility > > > > in dangerous way, probably leading to node crash. Also having INT (4 > > > bytes) > > > > looks too much for me. > > > > > > > > Instead, I would add new type "encoded string": > > > > 1 byte - type > > > > 1 byte - encoding code, map frequently used encodings to some byte > > value; > > > > also have a special value, meaning that encoding will be written as > > > string > > > > afterwards, this way we will support any encoding out of the box > > > > [optional] encoding name > > > > 4 bytes - string length > > > > Finally - string bytes > > > > > > > > Vladimir. > > > > > > > > вт, 25 июля 2017 г. в 18:24, Andrey Kuznetsov <[hidden email]>: > > > > > > > > > I apologize for damaged formatting. Below is my message as it > should > > > be. > > > > > > > > > > > > > > > Hi Igniters, > > > > > > > > > > I'd like to discuss future changes related to > > > https://issues.apache.org/ > > > > > jira/browse/IGNITE-5655 > > > > > <https://issues.apache.org/jira/browse/IGNITE-5655>. > > > > > > > > > > Is it really good idea to introduce new flag (ENCODED_STRING) for > > > > existing > > > > > String datatype? It's possible to use existing STRING flag at > > > negligible > > > > > performance cost. > > > > > > > > > > Currently, utf-8-encoded string looks like > > > > > > > > > > byteFlag nonNegativeIntStrLen bytes > > > > > > > > > > This format can be backward compatibly extended to > > > > > > > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > > > > > > > > Next, I suggest to add new BinaryConfiguration property for > encoding > > to > > > > use > > > > > instead of using global property. It seems to be more convenient > for > > > > user. > > > > > > > > > > I'll appreciate your feedback. > > > > > > > > > > 2017-07-25 16:13 GMT+03:00 Andrey Kuznetsov <[hidden email]>: > > > > > > > > > > > Hi Igniters,I'd like to discuss future changes related to > > > IGNITE-5655 > > > > > > <https://issues.apache.org/jira/browse/IGNITE-5655> . Is it > > really > > > > good > > > > > > idea to introduce new flag (ENCODED_STRING) for existing String > > > > datatype? > > > > > > It's possible to use existing STRING flag at negligible > performance > > > > cost. > > > > > > Currently, utf-8-encoded string looks like > > > > > > byteFlag nonNegativeIntStrLen bytes > > > > > > This format can be backward compatibly extended to > > > > > > byteFlag [negativeIntCharsetCode] nonNegativeIntStrLen bytes > > > > > > Next, I suggest to add new BinaryConfiguration property for > > encoding > > > to > > > > > use > > > > > > instead of using global property. It seems to be more convenient > > for > > > > > > user.I'll appreciate your feedback. > > > > > > > > > > > > > > > > > > > > > > > > ----- > > > > > > Best regards, > > > > > > Andrey Kuznetsov. > > > > > > -- > > > > > > View this message in context: http://apache-ignite- > > > > > > developers.2346864.n4.nabble.com/Non-UTF-8-string-encoding- > > > > > > support-in-BinaryMarshaller-IGNITE-5655-tp20024.html > > > > > > Sent from the Apache Ignite Developers mailing list archive at > > > > > Nabble.com. > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best regards, > > > > > Andrey Kuznetsov. > > > > > > > > > > > > > > > -- Best Regards, Vyacheslav D. |
In reply to this post by Vladimir Ozerov
Vladimir, It's rather simple to support string encoding by setting it in BinaryConfiguration. But I'm unsure whether it's a desired change. We need to express our goal more precisely: should we control encoding at cache level, field level, or binary configuration level? Currently, BinaryMarshaller is controlled only by BinaryConfiguration and it's hard for me to estimate changes to bring string encoding, say, to per-cache basis.2017-07-25 20:17 GMT+03:00 Vladimir Ozerov [via Apache Ignite Developers] <[hidden email]>: Vyacheslav, -- Best regards, Andrey Kuznetsov. |
In reply to this post by daradurvs
On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <[hidden email]>
wrote: > > > Encoding must be set on per field basis. This will give us as most > flexible > > solution at the cost of 1-byte overhead. > > > Vova, I agree that the encoding should be set on per-field basis, but at > > the table level, not at a cell level. > > Dmitriy, Vladimir, > Let's use both approaches :-) > We can add parameter to CacheConfiguration. > If parameter specifie to use cache level encoding then marshaller will use > encoding in a cache, > otherwise marshaller will use per-field encoding. > Of course only if it doesn't complicate the solution. > > marshalling protocol. The advantage of specifying the encoding at table/cache level is that we don't need to add extra encoding bytes to the marshalling protocol. I think Vova was suggesting encoding at the cell level, not at the field level, which seems to be redundant to me. Vova, do you agree? |
Just a note from the platforms guy:
Solution with table-level configuration is going to be significantly harder to implement for platforms and ODBC then field-level one. Also, what about binary objects, which are not stored in cache, but being marshalled? Best Regards, Igor On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <[hidden email]> wrote: > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <[hidden email]> > wrote: > > > > > > Encoding must be set on per field basis. This will give us as most > > flexible > > > solution at the cost of 1-byte overhead. > > > > > Vova, I agree that the encoding should be set on per-field basis, but > at > > > the table level, not at a cell level. > > > > Dmitriy, Vladimir, > > Let's use both approaches :-) > > We can add parameter to CacheConfiguration. > > If parameter specifie to use cache level encoding then marshaller will > use > > encoding in a cache, > > otherwise marshaller will use per-field encoding. > > Of course only if it doesn't complicate the solution. > > > > > I think that it will complicate the solution and will complicate the > marshalling protocol. The advantage of specifying the encoding at > table/cache level is that we don't need to add extra encoding bytes to the > marshalling protocol. > > I think Vova was suggesting encoding at the cell level, not at the field > level, which seems to be redundant to me. > > Vova, do you agree? > |
On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]> wrote:
> Just a note from the platforms guy: > > Solution with table-level configuration is going to be significantly > harder to implement for platforms and ODBC then field-level one. > Igor, it seems like you are advocating the per-cell configuration, not per-field one. The per-field configuration can be defined at the table/cache level. I see your point about C++ and .NET integrations however. Can't we provide this info at node-join time or table-creation time? This way all nodes will receive it and you will be able to grab it on different platforms. > > Also, what about binary objects, which are not stored in cache, > but being marshalled? > I think the default system encoding should be used here. If we don't have configuration for default encoding, we should add it. > > > Best Regards, > Igor > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan <[hidden email]> > wrote: > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur <[hidden email] > > > > wrote: > > > > > > > > > Encoding must be set on per field basis. This will give us as most > > > flexible > > > > solution at the cost of 1-byte overhead. > > > > > > > Vova, I agree that the encoding should be set on per-field basis, but > > at > > > > the table level, not at a cell level. > > > > > > Dmitriy, Vladimir, > > > Let's use both approaches :-) > > > We can add parameter to CacheConfiguration. > > > If parameter specifie to use cache level encoding then marshaller will > > use > > > encoding in a cache, > > > otherwise marshaller will use per-field encoding. > > > Of course only if it doesn't complicate the solution. > > > > > > > > I think that it will complicate the solution and will complicate the > > marshalling protocol. The advantage of specifying the encoding at > > table/cache level is that we don't need to add extra encoding bytes to > the > > marshalling protocol. > > > > I think Vova was suggesting encoding at the cell level, not at the field > > level, which seems to be redundant to me. > > > > Vova, do you agree? > > > |
I'm not sure I uderstand how this "per field" configuration is supposed to
be implemented. * Marshaller is not tied to a cache. It serializes all kinds of things, like compute job parameters and results. * Raw mode does not involve field names. Also it seems like a complicated and expensive solution - looking up string format somewhere in the metadata will be slow. "encoded string" data type suggestion from Vladimir looks better to me from performance and implementation standpoint. Thanks, Pavel On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <[hidden email]> wrote: > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]> wrote: > > > Just a note from the platforms guy: > > > > Solution with table-level configuration is going to be significantly > > harder to implement for platforms and ODBC then field-level one. > > > > Igor, it seems like you are advocating the per-cell configuration, not > per-field one. The per-field configuration can be defined at the > table/cache level. > > I see your point about C++ and .NET integrations however. Can't we provide > this info at node-join time or table-creation time? This way all nodes will > receive it and you will be able to grab it on different platforms. > > > > > > Also, what about binary objects, which are not stored in cache, > > but being marshalled? > > > > I think the default system encoding should be used here. If we don't have > configuration for default encoding, we should add it. > > > > > > > > Best Regards, > > Igor > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan < > [hidden email]> > > wrote: > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur < > [hidden email] > > > > > > wrote: > > > > > > > > > > > > Encoding must be set on per field basis. This will give us as most > > > > flexible > > > > > solution at the cost of 1-byte overhead. > > > > > > > > > Vova, I agree that the encoding should be set on per-field basis, > but > > > at > > > > > the table level, not at a cell level. > > > > > > > > Dmitriy, Vladimir, > > > > Let's use both approaches :-) > > > > We can add parameter to CacheConfiguration. > > > > If parameter specifie to use cache level encoding then marshaller > will > > > use > > > > encoding in a cache, > > > > otherwise marshaller will use per-field encoding. > > > > Of course only if it doesn't complicate the solution. > > > > > > > > > > > I think that it will complicate the solution and will complicate the > > > marshalling protocol. The advantage of specifying the encoding at > > > table/cache level is that we don't need to add extra encoding bytes to > > the > > > marshalling protocol. > > > > > > I think Vova was suggesting encoding at the cell level, not at the > field > > > level, which seems to be redundant to me. > > > > > > Vova, do you agree? > > > > > > |
In reply to this post by dsetrakyan
> Igor, it seems like you are advocating the per-cell configuration, not
> per-field one. True, some terms mismatch here. > I see your point about C++ and .NET integrations however. Can't we provide > this info at node-join time or table-creation time? This way all nodes will > receive it and you will be able to grab it on different platforms. This issue can be solved in different ways, I just say that it will be significantly more complicated. Just something we may want to consider when we choose a solution here. Best Regards, Igor On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <[hidden email]> wrote: > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]> wrote: > > > Just a note from the platforms guy: > > > > Solution with table-level configuration is going to be significantly > > harder to implement for platforms and ODBC then field-level one. > > > > Igor, it seems like you are advocating the per-cell configuration, not > per-field one. The per-field configuration can be defined at the > table/cache level. > > I see your point about C++ and .NET integrations however. Can't we provide > this info at node-join time or table-creation time? This way all nodes will > receive it and you will be able to grab it on different platforms. > > > > > > Also, what about binary objects, which are not stored in cache, > > but being marshalled? > > > > I think the default system encoding should be used here. If we don't have > configuration for default encoding, we should add it. > > > > > > > > Best Regards, > > Igor > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan < > [hidden email]> > > wrote: > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur < > [hidden email] > > > > > > wrote: > > > > > > > > > > > > Encoding must be set on per field basis. This will give us as most > > > > flexible > > > > > solution at the cost of 1-byte overhead. > > > > > > > > > Vova, I agree that the encoding should be set on per-field basis, > but > > > at > > > > > the table level, not at a cell level. > > > > > > > > Dmitriy, Vladimir, > > > > Let's use both approaches :-) > > > > We can add parameter to CacheConfiguration. > > > > If parameter specifie to use cache level encoding then marshaller > will > > > use > > > > encoding in a cache, > > > > otherwise marshaller will use per-field encoding. > > > > Of course only if it doesn't complicate the solution. > > > > > > > > > > > I think that it will complicate the solution and will complicate the > > > marshalling protocol. The advantage of specifying the encoding at > > > table/cache level is that we don't need to add extra encoding bytes to > > the > > > marshalling protocol. > > > > > > I think Vova was suggesting encoding at the cell level, not at the > field > > > level, which seems to be redundant to me. > > > > > > Vova, do you agree? > > > > > > |
In reply to this post by Pavel Tupitsyn
Pavel, what would be the size overhead? Are we adding 1 byte for every
field just for this? If you would like to have this info in the binary object directly, can we in this case have some bitmap of field-to-encoding? D. On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <[hidden email]> wrote: > I'm not sure I uderstand how this "per field" configuration is supposed to > be implemented. > * Marshaller is not tied to a cache. It serializes all kinds of things, > like compute job parameters and results. > * Raw mode does not involve field names. > > Also it seems like a complicated and expensive solution - looking up string > format somewhere in the metadata will be slow. > > "encoded string" data type suggestion from Vladimir looks better to me from > performance and implementation standpoint. > > Thanks, > Pavel > > > > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan <[hidden email]> > wrote: > > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]> wrote: > > > > > Just a note from the platforms guy: > > > > > > Solution with table-level configuration is going to be significantly > > > harder to implement for platforms and ODBC then field-level one. > > > > > > > Igor, it seems like you are advocating the per-cell configuration, not > > per-field one. The per-field configuration can be defined at the > > table/cache level. > > > > I see your point about C++ and .NET integrations however. Can't we > provide > > this info at node-join time or table-creation time? This way all nodes > will > > receive it and you will be able to grab it on different platforms. > > > > > > > > > > Also, what about binary objects, which are not stored in cache, > > > but being marshalled? > > > > > > > I think the default system encoding should be used here. If we don't have > > configuration for default encoding, we should add it. > > > > > > > > > > > > > Best Regards, > > > Igor > > > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan < > > [hidden email]> > > > wrote: > > > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur < > > [hidden email] > > > > > > > > wrote: > > > > > > > > > > > > > > > Encoding must be set on per field basis. This will give us as > most > > > > > flexible > > > > > > solution at the cost of 1-byte overhead. > > > > > > > > > > > Vova, I agree that the encoding should be set on per-field basis, > > but > > > > at > > > > > > the table level, not at a cell level. > > > > > > > > > > Dmitriy, Vladimir, > > > > > Let's use both approaches :-) > > > > > We can add parameter to CacheConfiguration. > > > > > If parameter specifie to use cache level encoding then marshaller > > will > > > > use > > > > > encoding in a cache, > > > > > otherwise marshaller will use per-field encoding. > > > > > Of course only if it doesn't complicate the solution. > > > > > > > > > > > > > > I think that it will complicate the solution and will complicate the > > > > marshalling protocol. The advantage of specifying the encoding at > > > > table/cache level is that we don't need to add extra encoding bytes > to > > > the > > > > marshalling protocol. > > > > > > > > I think Vova was suggesting encoding at the cell level, not at the > > field > > > > level, which seems to be redundant to me. > > > > > > > > Vova, do you agree? > > > > > > > > > > |
> 1 byte for every field just for this
GridBinaryMarshaller.STRING data type remains untouched. We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte for encoding type. This means no overhead for existing code. I think the most common use case is English, which uses 1 byte per char in UTF-8. This is already as fast and compact as possible, and we don't want to introduce any lookup overhead here. And when user knows that their data will be more compact in some specific encoding, they use some BinaryWriter.writeString overload, which writes a different type code. Yes, it also writes an extra byte, but you save a byte per char of the actual string (for example, when using Windows-1251 for Russian text), so this does not matter. On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan <[hidden email]> wrote: > Pavel, what would be the size overhead? Are we adding 1 byte for every > field just for this? If you would like to have this info in the binary > object directly, can we in this case have some bitmap of field-to-encoding? > > D. > > On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <[hidden email]> > wrote: > > > I'm not sure I uderstand how this "per field" configuration is supposed > to > > be implemented. > > * Marshaller is not tied to a cache. It serializes all kinds of things, > > like compute job parameters and results. > > * Raw mode does not involve field names. > > > > Also it seems like a complicated and expensive solution - looking up > string > > format somewhere in the metadata will be slow. > > > > "encoded string" data type suggestion from Vladimir looks better to me > from > > performance and implementation standpoint. > > > > Thanks, > > Pavel > > > > > > > > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan < > [hidden email]> > > wrote: > > > > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]> > wrote: > > > > > > > Just a note from the platforms guy: > > > > > > > > Solution with table-level configuration is going to be significantly > > > > harder to implement for platforms and ODBC then field-level one. > > > > > > > > > > Igor, it seems like you are advocating the per-cell configuration, not > > > per-field one. The per-field configuration can be defined at the > > > table/cache level. > > > > > > I see your point about C++ and .NET integrations however. Can't we > > provide > > > this info at node-join time or table-creation time? This way all nodes > > will > > > receive it and you will be able to grab it on different platforms. > > > > > > > > > > > > > > Also, what about binary objects, which are not stored in cache, > > > > but being marshalled? > > > > > > > > > > I think the default system encoding should be used here. If we don't > have > > > configuration for default encoding, we should add it. > > > > > > > > > > > > > > > > > > Best Regards, > > > > Igor > > > > > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan < > > > [hidden email]> > > > > wrote: > > > > > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur < > > > [hidden email] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Encoding must be set on per field basis. This will give us as > > most > > > > > > flexible > > > > > > > solution at the cost of 1-byte overhead. > > > > > > > > > > > > > Vova, I agree that the encoding should be set on per-field > basis, > > > but > > > > > at > > > > > > > the table level, not at a cell level. > > > > > > > > > > > > Dmitriy, Vladimir, > > > > > > Let's use both approaches :-) > > > > > > We can add parameter to CacheConfiguration. > > > > > > If parameter specifie to use cache level encoding then marshaller > > > will > > > > > use > > > > > > encoding in a cache, > > > > > > otherwise marshaller will use per-field encoding. > > > > > > Of course only if it doesn't complicate the solution. > > > > > > > > > > > > > > > > > I think that it will complicate the solution and will complicate > the > > > > > marshalling protocol. The advantage of specifying the encoding at > > > > > table/cache level is that we don't need to add extra encoding bytes > > to > > > > the > > > > > marshalling protocol. > > > > > > > > > > I think Vova was suggesting encoding at the cell level, not at the > > > field > > > > > level, which seems to be redundant to me. > > > > > > > > > > Vova, do you agree? > > > > > > > > > > > > > > > |
Pavel,
This forces user to implement Binarylizable for whole type in case they want to change encoding for one-two fields, right? I really don't like it, why not add default encoding to BinaryTypeConfiguration? -Val On Thu, Jul 27, 2017 at 7:54 AM, Pavel Tupitsyn <[hidden email]> wrote: > > 1 byte for every field just for this > GridBinaryMarshaller.STRING data type remains untouched. > We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte for > encoding type. > > This means no overhead for existing code. > I think the most common use case is English, which uses 1 byte per char in > UTF-8. > This is already as fast and compact as possible, and we don't want to > introduce any lookup overhead here. > > And when user knows that their data will be more compact in some specific > encoding, > they use some BinaryWriter.writeString overload, which writes a different > type code. > > Yes, it also writes an extra byte, but you save a byte per char of the > actual string > (for example, when using Windows-1251 for Russian text), so this does not > matter. > > On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan <[hidden email]> > wrote: > > > Pavel, what would be the size overhead? Are we adding 1 byte for every > > field just for this? If you would like to have this info in the binary > > object directly, can we in this case have some bitmap of > field-to-encoding? > > > > D. > > > > On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <[hidden email]> > > wrote: > > > > > I'm not sure I uderstand how this "per field" configuration is supposed > > to > > > be implemented. > > > * Marshaller is not tied to a cache. It serializes all kinds of things, > > > like compute job parameters and results. > > > * Raw mode does not involve field names. > > > > > > Also it seems like a complicated and expensive solution - looking up > > string > > > format somewhere in the metadata will be slow. > > > > > > "encoded string" data type suggestion from Vladimir looks better to me > > from > > > performance and implementation standpoint. > > > > > > Thanks, > > > Pavel > > > > > > > > > > > > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan < > > [hidden email]> > > > wrote: > > > > > > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]> > > wrote: > > > > > > > > > Just a note from the platforms guy: > > > > > > > > > > Solution with table-level configuration is going to be > significantly > > > > > harder to implement for platforms and ODBC then field-level one. > > > > > > > > > > > > > Igor, it seems like you are advocating the per-cell configuration, > not > > > > per-field one. The per-field configuration can be defined at the > > > > table/cache level. > > > > > > > > I see your point about C++ and .NET integrations however. Can't we > > > provide > > > > this info at node-join time or table-creation time? This way all > nodes > > > will > > > > receive it and you will be able to grab it on different platforms. > > > > > > > > > > > > > > > > > > Also, what about binary objects, which are not stored in cache, > > > > > but being marshalled? > > > > > > > > > > > > > I think the default system encoding should be used here. If we don't > > have > > > > configuration for default encoding, we should add it. > > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > Igor > > > > > > > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan < > > > > [hidden email]> > > > > > wrote: > > > > > > > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur < > > > > [hidden email] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Encoding must be set on per field basis. This will give us as > > > most > > > > > > > flexible > > > > > > > > solution at the cost of 1-byte overhead. > > > > > > > > > > > > > > > Vova, I agree that the encoding should be set on per-field > > basis, > > > > but > > > > > > at > > > > > > > > the table level, not at a cell level. > > > > > > > > > > > > > > Dmitriy, Vladimir, > > > > > > > Let's use both approaches :-) > > > > > > > We can add parameter to CacheConfiguration. > > > > > > > If parameter specifie to use cache level encoding then > marshaller > > > > will > > > > > > use > > > > > > > encoding in a cache, > > > > > > > otherwise marshaller will use per-field encoding. > > > > > > > Of course only if it doesn't complicate the solution. > > > > > > > > > > > > > > > > > > > > I think that it will complicate the solution and will complicate > > the > > > > > > marshalling protocol. The advantage of specifying the encoding at > > > > > > table/cache level is that we don't need to add extra encoding > bytes > > > to > > > > > the > > > > > > marshalling protocol. > > > > > > > > > > > > I think Vova was suggesting encoding at the cell level, not at > the > > > > field > > > > > > level, which seems to be redundant to me. > > > > > > > > > > > > Vova, do you agree? > > > > > > > > > > > > > > > > > > > > > |
Val, of course other options should be available, such as
BinaryTypeConfiguration, and maybe field-level and class-level annotations. On Thu, Jul 27, 2017 at 9:07 PM, Valentin Kulichenko < [hidden email]> wrote: > Pavel, > > This forces user to implement Binarylizable for whole type in case they > want to change encoding for one-two fields, right? I really don't like it, > why not add default encoding to BinaryTypeConfiguration? > > -Val > > On Thu, Jul 27, 2017 at 7:54 AM, Pavel Tupitsyn <[hidden email]> > wrote: > > > > 1 byte for every field just for this > > GridBinaryMarshaller.STRING data type remains untouched. > > We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte > for > > encoding type. > > > > This means no overhead for existing code. > > I think the most common use case is English, which uses 1 byte per char > in > > UTF-8. > > This is already as fast and compact as possible, and we don't want to > > introduce any lookup overhead here. > > > > And when user knows that their data will be more compact in some specific > > encoding, > > they use some BinaryWriter.writeString overload, which writes a different > > type code. > > > > Yes, it also writes an extra byte, but you save a byte per char of the > > actual string > > (for example, when using Windows-1251 for Russian text), so this does not > > matter. > > > > On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan < > [hidden email]> > > wrote: > > > > > Pavel, what would be the size overhead? Are we adding 1 byte for every > > > field just for this? If you would like to have this info in the binary > > > object directly, can we in this case have some bitmap of > > field-to-encoding? > > > > > > D. > > > > > > On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn <[hidden email]> > > > wrote: > > > > > > > I'm not sure I uderstand how this "per field" configuration is > supposed > > > to > > > > be implemented. > > > > * Marshaller is not tied to a cache. It serializes all kinds of > things, > > > > like compute job parameters and results. > > > > * Raw mode does not involve field names. > > > > > > > > Also it seems like a complicated and expensive solution - looking up > > > string > > > > format somewhere in the metadata will be slow. > > > > > > > > "encoded string" data type suggestion from Vladimir looks better to > me > > > from > > > > performance and implementation standpoint. > > > > > > > > Thanks, > > > > Pavel > > > > > > > > > > > > > > > > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan < > > > [hidden email]> > > > > wrote: > > > > > > > > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email]> > > > wrote: > > > > > > > > > > > Just a note from the platforms guy: > > > > > > > > > > > > Solution with table-level configuration is going to be > > significantly > > > > > > harder to implement for platforms and ODBC then field-level one. > > > > > > > > > > > > > > > > Igor, it seems like you are advocating the per-cell configuration, > > not > > > > > per-field one. The per-field configuration can be defined at the > > > > > table/cache level. > > > > > > > > > > I see your point about C++ and .NET integrations however. Can't we > > > > provide > > > > > this info at node-join time or table-creation time? This way all > > nodes > > > > will > > > > > receive it and you will be able to grab it on different platforms. > > > > > > > > > > > > > > > > > > > > > > Also, what about binary objects, which are not stored in cache, > > > > > > but being marshalled? > > > > > > > > > > > > > > > > I think the default system encoding should be used here. If we > don't > > > have > > > > > configuration for default encoding, we should add it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > Igor > > > > > > > > > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan < > > > > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur < > > > > > [hidden email] > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Encoding must be set on per field basis. This will give us > as > > > > most > > > > > > > > flexible > > > > > > > > > solution at the cost of 1-byte overhead. > > > > > > > > > > > > > > > > > Vova, I agree that the encoding should be set on per-field > > > basis, > > > > > but > > > > > > > at > > > > > > > > > the table level, not at a cell level. > > > > > > > > > > > > > > > > Dmitriy, Vladimir, > > > > > > > > Let's use both approaches :-) > > > > > > > > We can add parameter to CacheConfiguration. > > > > > > > > If parameter specifie to use cache level encoding then > > marshaller > > > > > will > > > > > > > use > > > > > > > > encoding in a cache, > > > > > > > > otherwise marshaller will use per-field encoding. > > > > > > > > Of course only if it doesn't complicate the solution. > > > > > > > > > > > > > > > > > > > > > > > I think that it will complicate the solution and will > complicate > > > the > > > > > > > marshalling protocol. The advantage of specifying the encoding > at > > > > > > > table/cache level is that we don't need to add extra encoding > > bytes > > > > to > > > > > > the > > > > > > > marshalling protocol. > > > > > > > > > > > > > > I think Vova was suggesting encoding at the cell level, not at > > the > > > > > field > > > > > > > level, which seems to be redundant to me. > > > > > > > > > > > > > > Vova, do you agree? > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
As Pavel mentioned, Marshaller should not be tied to cache, BinaryObject
should be self-explanatory, i.e. containing all information necessary for unmarshalling. This is an absolute requirement. We will have one extra byte for in serialized form, meaning that advantage of custom encoding will become evident for all strings with length >= 1, which is perfectly fine. I do not quite understand what are we arguing about. As far as configuration, we can do it as follows: 1) Add global encoding, UTF8 by default. 2) Add per-cache encoding. 3) Add encoding to JDBC and ODBC driver properties. This should be enough. пт, 28 июля 2017 г. в 11:45, Pavel Tupitsyn <[hidden email]>: > Val, of course other options should be available, such as > BinaryTypeConfiguration, > and maybe field-level and class-level annotations. > > On Thu, Jul 27, 2017 at 9:07 PM, Valentin Kulichenko < > [hidden email]> wrote: > > > Pavel, > > > > This forces user to implement Binarylizable for whole type in case they > > want to change encoding for one-two fields, right? I really don't like > it, > > why not add default encoding to BinaryTypeConfiguration? > > > > -Val > > > > On Thu, Jul 27, 2017 at 7:54 AM, Pavel Tupitsyn <[hidden email]> > > wrote: > > > > > > 1 byte for every field just for this > > > GridBinaryMarshaller.STRING data type remains untouched. > > > We add GridBinaryMarshaller.STRING_ENCODED, which has additional byte > > for > > > encoding type. > > > > > > This means no overhead for existing code. > > > I think the most common use case is English, which uses 1 byte per char > > in > > > UTF-8. > > > This is already as fast and compact as possible, and we don't want to > > > introduce any lookup overhead here. > > > > > > And when user knows that their data will be more compact in some > specific > > > encoding, > > > they use some BinaryWriter.writeString overload, which writes a > different > > > type code. > > > > > > Yes, it also writes an extra byte, but you save a byte per char of the > > > actual string > > > (for example, when using Windows-1251 for Russian text), so this does > not > > > matter. > > > > > > On Thu, Jul 27, 2017 at 5:35 PM, Dmitriy Setrakyan < > > [hidden email]> > > > wrote: > > > > > > > Pavel, what would be the size overhead? Are we adding 1 byte for > every > > > > field just for this? If you would like to have this info in the > binary > > > > object directly, can we in this case have some bitmap of > > > field-to-encoding? > > > > > > > > D. > > > > > > > > On Thu, Jul 27, 2017 at 9:22 AM, Pavel Tupitsyn < > [hidden email]> > > > > wrote: > > > > > > > > > I'm not sure I uderstand how this "per field" configuration is > > supposed > > > > to > > > > > be implemented. > > > > > * Marshaller is not tied to a cache. It serializes all kinds of > > things, > > > > > like compute job parameters and results. > > > > > * Raw mode does not involve field names. > > > > > > > > > > Also it seems like a complicated and expensive solution - looking > up > > > > string > > > > > format somewhere in the metadata will be slow. > > > > > > > > > > "encoded string" data type suggestion from Vladimir looks better to > > me > > > > from > > > > > performance and implementation standpoint. > > > > > > > > > > Thanks, > > > > > Pavel > > > > > > > > > > > > > > > > > > > > On Thu, Jul 27, 2017 at 5:10 PM, Dmitriy Setrakyan < > > > > [hidden email]> > > > > > wrote: > > > > > > > > > > > On Thu, Jul 27, 2017 at 9:04 AM, Igor Sapego <[hidden email] > > > > > > wrote: > > > > > > > > > > > > > Just a note from the platforms guy: > > > > > > > > > > > > > > Solution with table-level configuration is going to be > > > significantly > > > > > > > harder to implement for platforms and ODBC then field-level > one. > > > > > > > > > > > > > > > > > > > Igor, it seems like you are advocating the per-cell > configuration, > > > not > > > > > > per-field one. The per-field configuration can be defined at the > > > > > > table/cache level. > > > > > > > > > > > > I see your point about C++ and .NET integrations however. Can't > we > > > > > provide > > > > > > this info at node-join time or table-creation time? This way all > > > nodes > > > > > will > > > > > > receive it and you will be able to grab it on different > platforms. > > > > > > > > > > > > > > > > > > > > > > > > > > Also, what about binary objects, which are not stored in cache, > > > > > > > but being marshalled? > > > > > > > > > > > > > > > > > > > I think the default system encoding should be used here. If we > > don't > > > > have > > > > > > configuration for default encoding, we should add it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > Igor > > > > > > > > > > > > > > On Wed, Jul 26, 2017 at 7:22 PM, Dmitriy Setrakyan < > > > > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > On Wed, Jul 26, 2017 at 3:40 AM, Vyacheslav Daradur < > > > > > > [hidden email] > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Encoding must be set on per field basis. This will give > us > > as > > > > > most > > > > > > > > > flexible > > > > > > > > > > solution at the cost of 1-byte overhead. > > > > > > > > > > > > > > > > > > > Vova, I agree that the encoding should be set on > per-field > > > > basis, > > > > > > but > > > > > > > > at > > > > > > > > > > the table level, not at a cell level. > > > > > > > > > > > > > > > > > > Dmitriy, Vladimir, > > > > > > > > > Let's use both approaches :-) > > > > > > > > > We can add parameter to CacheConfiguration. > > > > > > > > > If parameter specifie to use cache level encoding then > > > marshaller > > > > > > will > > > > > > > > use > > > > > > > > > encoding in a cache, > > > > > > > > > otherwise marshaller will use per-field encoding. > > > > > > > > > Of course only if it doesn't complicate the solution. > > > > > > > > > > > > > > > > > > > > > > > > > > I think that it will complicate the solution and will > > complicate > > > > the > > > > > > > > marshalling protocol. The advantage of specifying the > encoding > > at > > > > > > > > table/cache level is that we don't need to add extra encoding > > > bytes > > > > > to > > > > > > > the > > > > > > > > marshalling protocol. > > > > > > > > > > > > > > > > I think Vova was suggesting encoding at the cell level, not > at > > > the > > > > > > field > > > > > > > > level, which seems to be redundant to me. > > > > > > > > > > > > > > > > Vova, do you agree? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |