Apache Ignite Developers - Legacy Mail Archive

[IMPORTANT] Future of Binary Objects

Classic

List

Threaded

29 messages Options

dmagda

Re: [IMPORTANT] Future of Binary Objects

Vladimir,

Could you educate me a little bit, why the current format is bad for SQL
and why another one is more suitable?

Also, if we introduce the new format then why would we keep the binary one?
Is the new format just a next version of the binary one.

2.3) Remove restrictions on changing field type
> I do not know why we did that in the first place. This restriction prevents
> type evolution and confuses users.

That is a hot requirement shared by those who use Ignite SQL in production.
+1.

--
Denis

On Mon, Nov 19, 2018 at 11:05 PM Vladimir Ozerov <[hidden email]>
wrote:

> Igniters,
>
> It is very likely that Apache Ignite 3.0 will be released next year. So we
> need to start thinking about major product improvements. I'd like to start
> with binary objects.
>
> Currently they are one of the main limiting factors for the product. They
> are fat - 30+ bytes overhead on average, high TCO of Apache Ignite
> comparing to other vendors. They are slow - not suitable for SQL at all.
>
> I would like to ask all of you who worked with binary objects to share your
> feedback and ideas, so that we understand how they should look like in AI
> 3.0. This is a brain storm - let's accumulate ideas first and minimize
> critics. Then we will work on ideas in separate topics.
>
> 1) Historical background
>
> BO were implemented around 2014 (Apache Ignite 1.5) when we started working
> on .NET and CPP clients. During design we had several ideas in mind:
> - ability to read object fields in O(1) without deserialization
> - interoperabillty between Java, .NET and CPP.
>
> Since then a number of other concepts were mixed to the cocktail:
> - Affinity key fields
> - Strict typing for existing fields (aka metadata)
> - Binary Object as storage format
>
> 2) My proposals
>
> 2.1) Introduce "Data Row Format" interface
> Binary Objects are terrible candidates for storage. Too fat, too slow.
> Efficient storage typically has <10 bytes overhead per row (no metadata, no
> length, no hash code, etc), allow supper-fast field access, support
> different string formats (ASCII, UTF-8, etc), support different temporal
> types (date, time, timestamp, timestamp with timezone, etc), and store
> these types as efficiently as possible.
>
> What we need is to introduce an interface which will convert a pair of
> key-value objects into a row. This row will be used to store data and to
> get fields from it. Care about memory consumption, need SQL and strict
> schema - use one format. Need flexibility and prefer key-value access - use
> another format which will store binary objects unchanged (current
> behavior).
>
> interface DataRowFormat {
> DataRow create(Object key, Object value); // primitives or binary
> objects
> DataRowMetadata metadata();
> }
>
> 2.2) Remove affinity field from metadata
> Affinity rules are governed by cache, not type. We should remove
> "affintiyFieldName" from metadata.
>
> 2.3) Remove restrictions on changing field type
> I do not know why we did that in the first place. This restriction prevents
> type evolution and confuses users.
>
> 2.4) Use bitmaps for "null" and default values and for fixed-length fields,
> put fixed-length fields before variable-length.
> Motivation: to save space.
>
> What else? Please share your ideas.
>
> Vladimir.
>

Valentin Kulichenko

Re: [IMPORTANT] Future of Binary Objects

We should definitely allow to change type of field/column to another
compatible type. The fact that we do not allow to change Int to Long is
pretty insane. However, there are cases when it's much more complicated.
How are we going to replace Int with a String, for example? I believe this
should require certain migration procedure anyway. How do other databases
handle that?

-Val

On Wed, Nov 21, 2018 at 4:32 PM Denis Magda <[hidden email]> wrote:

> Vladimir,
>
> Could you educate me a little bit, why the current format is bad for SQL
> and why another one is more suitable?
>
> Also, if we introduce the new format then why would we keep the binary one?
> Is the new format just a next version of the binary one.
>
> 2.3) Remove restrictions on changing field type
> > I do not know why we did that in the first place. This restriction
> prevents
> > type evolution and confuses users.
>
>
> That is a hot requirement shared by those who use Ignite SQL in production.
> +1.
>
> --
> Denis
>
> On Mon, Nov 19, 2018 at 11:05 PM Vladimir Ozerov <[hidden email]>
> wrote:
>
> > Igniters,
> >
> > It is very likely that Apache Ignite 3.0 will be released next year. So
> we
> > need to start thinking about major product improvements. I'd like to
> start
> > with binary objects.
> >
> > Currently they are one of the main limiting factors for the product. They
> > are fat - 30+ bytes overhead on average, high TCO of Apache Ignite
> > comparing to other vendors. They are slow - not suitable for SQL at all.
> >
> > I would like to ask all of you who worked with binary objects to share
> your
> > feedback and ideas, so that we understand how they should look like in AI
> > 3.0. This is a brain storm - let's accumulate ideas first and minimize
> > critics. Then we will work on ideas in separate topics.
> >
> > 1) Historical background
> >
> > BO were implemented around 2014 (Apache Ignite 1.5) when we started
> working
> > on .NET and CPP clients. During design we had several ideas in mind:
> > - ability to read object fields in O(1) without deserialization
> > - interoperabillty between Java, .NET and CPP.
> >
> > Since then a number of other concepts were mixed to the cocktail:
> > - Affinity key fields
> > - Strict typing for existing fields (aka metadata)
> > - Binary Object as storage format
> >
> > 2) My proposals
> >
> > 2.1) Introduce "Data Row Format" interface
> > Binary Objects are terrible candidates for storage. Too fat, too slow.
> > Efficient storage typically has <10 bytes overhead per row (no metadata,
> no
> > length, no hash code, etc), allow supper-fast field access, support
> > different string formats (ASCII, UTF-8, etc), support different temporal
> > types (date, time, timestamp, timestamp with timezone, etc), and store
> > these types as efficiently as possible.
> >
> > What we need is to introduce an interface which will convert a pair of
> > key-value objects into a row. This row will be used to store data and to
> > get fields from it. Care about memory consumption, need SQL and strict
> > schema - use one format. Need flexibility and prefer key-value access -
> use
> > another format which will store binary objects unchanged (current
> > behavior).
> >
> > interface DataRowFormat {
> > DataRow create(Object key, Object value); // primitives or binary
> > objects
> > DataRowMetadata metadata();
> > }
> >
> > 2.2) Remove affinity field from metadata
> > Affinity rules are governed by cache, not type. We should remove
> > "affintiyFieldName" from metadata.
> >
> > 2.3) Remove restrictions on changing field type
> > I do not know why we did that in the first place. This restriction
> prevents
> > type evolution and confuses users.
> >
> > 2.4) Use bitmaps for "null" and default values and for fixed-length
> fields,
> > put fixed-length fields before variable-length.
> > Motivation: to save space.
> >
> > What else? Please share your ideas.
> >
> > Vladimir.
> >
>

Vladimir Ozerov

Re: [IMPORTANT] Future of Binary Objects

In reply to this post by Ilya Kasnacheev

Ilya,

Currently binary objects already works almost as you proposed. We have 4
bytes types (type name hash) and we have 4 bytes schema ID (hash of all
field names). We do not write field IDs in the object itself. What we do
not have is separation of fixed and varlen fields. Agree, that we should
implement it and remove offsets of fixed fields from the binary object.

On Wed, Nov 21, 2018 at 7:18 PM Ilya Kasnacheev <[hidden email]>
wrote:

> Hello!
>
> I would like to propose the following changes:
>
> - Let's allow multiple BinaryType's per Class. Make typeId = cksum(list of
> class types + fields) as opposed of cksum(class name) as we have it
> currently. Note that we only have to compute that once per class loaded in
> JVM.
> - BinaryType has a list of fixed length fields (numbers, datetimes, flags)
> and list of variable length fields. We can put all fixed length fields at
> start of BinaryObject so that we can access them by offset as per typeId.
> - Likewise we don't need to encode field id in BinaryObject anymore, save 4
> bytes per field. We already know their order from BinaryType.
> - This means when you ALTER TABLE we add a BinaryType to existing Class (or
> pseudo-Class type name) and we can use it for new data, and eventually
> update existing data to have this field.
> - On top of BinaryType's we can have checks that run them against SQL table
> columns list to see if there are any mismatches.
>
> To Illustrate, previously we had it like:
> [ Type id | String field id | String field value | Long field id | Long
> field value | Datetime field id | Datetime field value ]
> But now it will be
> [ Type id | Long field value | Datetime field value | String field value ]
> ^------------------^---- can be accessed by offset
>
> Regards,
> Ilya.
>
> --
> Ilya Kasnacheev
>
>
> вт, 20 нояб. 2018 г. в 10:05, Vladimir Ozerov <[hidden email]>:
>
> > Igniters,
> >
> > It is very likely that Apache Ignite 3.0 will be released next year. So
> we
> > need to start thinking about major product improvements. I'd like to
> start
> > with binary objects.
> >
> > Currently they are one of the main limiting factors for the product. They
> > are fat - 30+ bytes overhead on average, high TCO of Apache Ignite
> > comparing to other vendors. They are slow - not suitable for SQL at all.
> >
> > I would like to ask all of you who worked with binary objects to share
> your
> > feedback and ideas, so that we understand how they should look like in AI
> > 3.0. This is a brain storm - let's accumulate ideas first and minimize
> > critics. Then we will work on ideas in separate topics.
> >
> > 1) Historical background
> >
> > BO were implemented around 2014 (Apache Ignite 1.5) when we started
> working
> > on .NET and CPP clients. During design we had several ideas in mind:
> > - ability to read object fields in O(1) without deserialization
> > - interoperabillty between Java, .NET and CPP.
> >
> > Since then a number of other concepts were mixed to the cocktail:
> > - Affinity key fields
> > - Strict typing for existing fields (aka metadata)
> > - Binary Object as storage format
> >
> > 2) My proposals
> >
> > 2.1) Introduce "Data Row Format" interface
> > Binary Objects are terrible candidates for storage. Too fat, too slow.
> > Efficient storage typically has <10 bytes overhead per row (no metadata,
> no
> > length, no hash code, etc), allow supper-fast field access, support
> > different string formats (ASCII, UTF-8, etc), support different temporal
> > types (date, time, timestamp, timestamp with timezone, etc), and store
> > these types as efficiently as possible.
> >
> > What we need is to introduce an interface which will convert a pair of
> > key-value objects into a row. This row will be used to store data and to
> > get fields from it. Care about memory consumption, need SQL and strict
> > schema - use one format. Need flexibility and prefer key-value access -
> use
> > another format which will store binary objects unchanged (current
> > behavior).
> >
> > interface DataRowFormat {
> > DataRow create(Object key, Object value); // primitives or binary
> > objects
> > DataRowMetadata metadata();
> > }
> >
> > 2.2) Remove affinity field from metadata
> > Affinity rules are governed by cache, not type. We should remove
> > "affintiyFieldName" from metadata.
> >
> > 2.3) Remove restrictions on changing field type
> > I do not know why we did that in the first place. This restriction
> prevents
> > type evolution and confuses users.
> >
> > 2.4) Use bitmaps for "null" and default values and for fixed-length
> fields,
> > put fixed-length fields before variable-length.
> > Motivation: to save space.
> >
> > What else? Please share your ideas.
> >
> > Vladimir.
> >
>

Vladimir Ozerov

Re: [IMPORTANT] Future of Binary Objects

In reply to this post by dmagda

Denis,

Several examples:
1) DEFAULT values - in SQL you may avoid storing default value in the table
and store it in metadata instead. Not applicable for BinaryObject because
the same binary object may be saved to two SQL tables with different
defaults
2) DATE and other temporal types - in SQL you want to store it in special
format to be able to extract date parts quickly (typically - 11 bytes). But
in Java and some other languages the best format is plain long. this is why
we use it BinaryObject
3) String charset - in SQL you may choose different charsets for different
tables. E.g. UTF-8 for one, ASCII for another. In BinaryObject we store
everything in UTF-8, and this is fine for most cases, well ... except of
SQL :-)

The key thing here is that you cannot define a format which will be good
for both SQL, and native API. They are very different. This is why I
propose to define additional interface on cache level defining how to store
values, which will be very different from binary objects.

Vladimir.

On Thu, Nov 22, 2018 at 3:32 AM Denis Magda <[hidden email]> wrote:

Vladimir Ozerov

Re: [IMPORTANT] Future of Binary Objects

In reply to this post by Valentin Kulichenko

Val,

If we treat binary object as a plain container of fields with certain names
and types, we do not care how to convert Int to String. This is up to user
to decide how to migrate.
Ignite could help users in some cases. E.g. for SQL caches we may provide
ALTER TABLE command, which will do necessary conversions on storage layer.

On Thu, Nov 22, 2018 at 4:27 AM Valentin Kulichenko <
[hidden email]> wrote:

> We should definitely allow to change type of field/column to another
> compatible type. The fact that we do not allow to change Int to Long is
> pretty insane. However, there are cases when it's much more complicated.
> How are we going to replace Int with a String, for example? I believe this
> should require certain migration procedure anyway. How do other databases
> handle that?
>
> -Val
>
> On Wed, Nov 21, 2018 at 4:32 PM Denis Magda <[hidden email]> wrote:
>
> > Vladimir,
> >
> > Could you educate me a little bit, why the current format is bad for SQL
> > and why another one is more suitable?
> >
> > Also, if we introduce the new format then why would we keep the binary
> one?
> > Is the new format just a next version of the binary one.
> >
> > 2.3) Remove restrictions on changing field type
> > > I do not know why we did that in the first place. This restriction
> > prevents
> > > type evolution and confuses users.
> >
> >
> > That is a hot requirement shared by those who use Ignite SQL in
> production.
> > +1.
> >
> > --
> > Denis
> >
> > On Mon, Nov 19, 2018 at 11:05 PM Vladimir Ozerov <[hidden email]>
> > wrote:
> >
> > > Igniters,
> > >
> > > It is very likely that Apache Ignite 3.0 will be released next year. So
> > we
> > > need to start thinking about major product improvements. I'd like to
> > start
> > > with binary objects.
> > >
> > > Currently they are one of the main limiting factors for the product.
> They
> > > are fat - 30+ bytes overhead on average, high TCO of Apache Ignite
> > > comparing to other vendors. They are slow - not suitable for SQL at
> all.
> > >
> > > I would like to ask all of you who worked with binary objects to share
> > your
> > > feedback and ideas, so that we understand how they should look like in
> AI
> > > 3.0. This is a brain storm - let's accumulate ideas first and minimize
> > > critics. Then we will work on ideas in separate topics.
> > >
> > > 1) Historical background
> > >
> > > BO were implemented around 2014 (Apache Ignite 1.5) when we started
> > working
> > > on .NET and CPP clients. During design we had several ideas in mind:
> > > - ability to read object fields in O(1) without deserialization
> > > - interoperabillty between Java, .NET and CPP.
> > >
> > > Since then a number of other concepts were mixed to the cocktail:
> > > - Affinity key fields
> > > - Strict typing for existing fields (aka metadata)
> > > - Binary Object as storage format
> > >
> > > 2) My proposals
> > >
> > > 2.1) Introduce "Data Row Format" interface
> > > Binary Objects are terrible candidates for storage. Too fat, too slow.
> > > Efficient storage typically has <10 bytes overhead per row (no
> metadata,
> > no
> > > length, no hash code, etc), allow supper-fast field access, support
> > > different string formats (ASCII, UTF-8, etc), support different
> temporal
> > > types (date, time, timestamp, timestamp with timezone, etc), and store
> > > these types as efficiently as possible.
> > >
> > > What we need is to introduce an interface which will convert a pair of
> > > key-value objects into a row. This row will be used to store data and
> to
> > > get fields from it. Care about memory consumption, need SQL and strict
> > > schema - use one format. Need flexibility and prefer key-value access -
> > use
> > > another format which will store binary objects unchanged (current
> > > behavior).
> > >
> > > interface DataRowFormat {
> > > DataRow create(Object key, Object value); // primitives or binary
> > > objects
> > > DataRowMetadata metadata();
> > > }
> > >
> > > 2.2) Remove affinity field from metadata
> > > Affinity rules are governed by cache, not type. We should remove
> > > "affintiyFieldName" from metadata.
> > >
> > > 2.3) Remove restrictions on changing field type
> > > I do not know why we did that in the first place. This restriction
> > prevents
> > > type evolution and confuses users.
> > >
> > > 2.4) Use bitmaps for "null" and default values and for fixed-length
> > fields,
> > > put fixed-length fields before variable-length.
> > > Motivation: to save space.
> > >
> > > What else? Please share your ideas.
> > >
> > > Vladimir.
> > >
> >
>

Sergi

Re: [IMPORTANT] Future of Binary Objects

In reply to this post by Vladimir Ozerov

It may be OK to extend compatible field types (like from Int to Long).

In Protobuf for example this is allowed just because there is no difference
between Int and Long in binary format: they all are equally varlen encoded
and Longs just will occupy up to 9 bytes, while Ints up to 5.

But for every other case, where binary representation is type dependent, I
would be against. This will either require to migrate the whole dataset to
a new model (which is always risky, since you may need to rollback to
previous version of your code) or it will require type checks/conversions
for each field access, which is a hard to reason complication and possible
performance penalty.

Sergi

чт, 22 нояб. 2018 г. в 09:23, Vladimir Ozerov <[hidden email]>:

> Denis,
>
> Several examples:
> 1) DEFAULT values - in SQL you may avoid storing default value in the table
> and store it in metadata instead. Not applicable for BinaryObject because
> the same binary object may be saved to two SQL tables with different
> defaults
> 2) DATE and other temporal types - in SQL you want to store it in special
> format to be able to extract date parts quickly (typically - 11 bytes). But
> in Java and some other languages the best format is plain long. this is why
> we use it BinaryObject
> 3) String charset - in SQL you may choose different charsets for different
> tables. E.g. UTF-8 for one, ASCII for another. In BinaryObject we store
> everything in UTF-8, and this is fine for most cases, well ... except of
> SQL :-)
>
> The key thing here is that you cannot define a format which will be good
> for both SQL, and native API. They are very different. This is why I
> propose to define additional interface on cache level defining how to store
> values, which will be very different from binary objects.
>
> Vladimir.
>
> On Thu, Nov 22, 2018 at 3:32 AM Denis Magda <[hidden email]> wrote:
>
> > Vladimir,
> >
> > Could you educate me a little bit, why the current format is bad for SQL
> > and why another one is more suitable?
> >
> > Also, if we introduce the new format then why would we keep the binary
> one?
> > Is the new format just a next version of the binary one.
> >
> > 2.3) Remove restrictions on changing field type
> > > I do not know why we did that in the first place. This restriction
> > prevents
> > > type evolution and confuses users.
> >
> >
> > That is a hot requirement shared by those who use Ignite SQL in
> production.
> > +1.
> >
> > --
> > Denis
> >
> > On Mon, Nov 19, 2018 at 11:05 PM Vladimir Ozerov <[hidden email]>
> > wrote:
> >
> > > Igniters,
> > >
> > > It is very likely that Apache Ignite 3.0 will be released next year. So
> > we
> > > need to start thinking about major product improvements. I'd like to
> > start
> > > with binary objects.
> > >
> > > Currently they are one of the main limiting factors for the product.
> They
> > > are fat - 30+ bytes overhead on average, high TCO of Apache Ignite
> > > comparing to other vendors. They are slow - not suitable for SQL at
> all.
> > >
> > > I would like to ask all of you who worked with binary objects to share
> > your
> > > feedback and ideas, so that we understand how they should look like in
> AI
> > > 3.0. This is a brain storm - let's accumulate ideas first and minimize
> > > critics. Then we will work on ideas in separate topics.
> > >
> > > 1) Historical background
> > >
> > > BO were implemented around 2014 (Apache Ignite 1.5) when we started
> > working
> > > on .NET and CPP clients. During design we had several ideas in mind:
> > > - ability to read object fields in O(1) without deserialization
> > > - interoperabillty between Java, .NET and CPP.
> > >
> > > Since then a number of other concepts were mixed to the cocktail:
> > > - Affinity key fields
> > > - Strict typing for existing fields (aka metadata)
> > > - Binary Object as storage format
> > >
> > > 2) My proposals
> > >
> > > 2.1) Introduce "Data Row Format" interface
> > > Binary Objects are terrible candidates for storage. Too fat, too slow.
> > > Efficient storage typically has <10 bytes overhead per row (no
> metadata,
> > no
> > > length, no hash code, etc), allow supper-fast field access, support
> > > different string formats (ASCII, UTF-8, etc), support different
> temporal
> > > types (date, time, timestamp, timestamp with timezone, etc), and store
> > > these types as efficiently as possible.
> > >
> > > What we need is to introduce an interface which will convert a pair of
> > > key-value objects into a row. This row will be used to store data and
> to
> > > get fields from it. Care about memory consumption, need SQL and strict
> > > schema - use one format. Need flexibility and prefer key-value access -
> > use
> > > another format which will store binary objects unchanged (current
> > > behavior).
> > >
> > > interface DataRowFormat {
> > > DataRow create(Object key, Object value); // primitives or binary
> > > objects
> > > DataRowMetadata metadata();
> > > }
> > >
> > > 2.2) Remove affinity field from metadata
> > > Affinity rules are governed by cache, not type. We should remove
> > > "affintiyFieldName" from metadata.
> > >
> > > 2.3) Remove restrictions on changing field type
> > > I do not know why we did that in the first place. This restriction
> > prevents
> > > type evolution and confuses users.
> > >
> > > 2.4) Use bitmaps for "null" and default values and for fixed-length
> > fields,
> > > put fixed-length fields before variable-length.
> > > Motivation: to save space.
> > >
> > > What else? Please share your ideas.
> > >
> > > Vladimir.
> > >
> >
>

Vladimir Ozerov

Re: [IMPORTANT] Future of Binary Objects

Sergi,

I think we should not guess for users what is right or wrong for them. It
is up to user to decide what is valid. For example, consider a user who
operates on a list of Integers, and to optimize memory consumption he
decide to save in the same field either List<Integer>, or plain Integer in
case only single element exists. Another example - a kind of data lake or
data cleansing application, which may receive the same field in different
forms. E.g. age in the form of Integer or String. Does it work for user or
not? We do not know. Will he need to migrate the whole data set? We do not
know either.

The only place in the product where we case is SQL. But in this case
instead of adding checks on binary level, we should validate data on cache
level. In fact, Ignite already works this way. E.g. nullability checks are
performed on cache level rather than binary. All we need is to move all
checks to cache level from binary level.

On Thu, Nov 22, 2018 at 9:41 AM Sergi Vladykin <[hidden email]>
wrote:

> It may be OK to extend compatible field types (like from Int to Long).
>
> In Protobuf for example this is allowed just because there is no difference
> between Int and Long in binary format: they all are equally varlen encoded
> and Longs just will occupy up to 9 bytes, while Ints up to 5.
>
> But for every other case, where binary representation is type dependent, I
> would be against. This will either require to migrate the whole dataset to
> a new model (which is always risky, since you may need to rollback to
> previous version of your code) or it will require type checks/conversions
> for each field access, which is a hard to reason complication and possible
> performance penalty.
>
> Sergi
>
>
>
> чт, 22 нояб. 2018 г. в 09:23, Vladimir Ozerov <[hidden email]>:
>
> > Denis,
> >
> > Several examples:
> > 1) DEFAULT values - in SQL you may avoid storing default value in the
> table
> > and store it in metadata instead. Not applicable for BinaryObject because
> > the same binary object may be saved to two SQL tables with different
> > defaults
> > 2) DATE and other temporal types - in SQL you want to store it in special
> > format to be able to extract date parts quickly (typically - 11 bytes).
> But
> > in Java and some other languages the best format is plain long. this is
> why
> > we use it BinaryObject
> > 3) String charset - in SQL you may choose different charsets for
> different
> > tables. E.g. UTF-8 for one, ASCII for another. In BinaryObject we store
> > everything in UTF-8, and this is fine for most cases, well ... except of
> > SQL :-)
> >
> > The key thing here is that you cannot define a format which will be good
> > for both SQL, and native API. They are very different. This is why I
> > propose to define additional interface on cache level defining how to
> store
> > values, which will be very different from binary objects.
> >
> > Vladimir.
> >
> > On Thu, Nov 22, 2018 at 3:32 AM Denis Magda <[hidden email]> wrote:
> >
> > > Vladimir,
> > >
> > > Could you educate me a little bit, why the current format is bad for
> SQL
> > > and why another one is more suitable?
> > >
> > > Also, if we introduce the new format then why would we keep the binary
> > one?
> > > Is the new format just a next version of the binary one.
> > >
> > > 2.3) Remove restrictions on changing field type
> > > > I do not know why we did that in the first place. This restriction
> > > prevents
> > > > type evolution and confuses users.
> > >
> > >
> > > That is a hot requirement shared by those who use Ignite SQL in
> > production.
> > > +1.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Nov 19, 2018 at 11:05 PM Vladimir Ozerov <[hidden email]
> >
> > > wrote:
> > >
> > > > Igniters,
> > > >
> > > > It is very likely that Apache Ignite 3.0 will be released next year.
> So
> > > we
> > > > need to start thinking about major product improvements. I'd like to
> > > start
> > > > with binary objects.
> > > >
> > > > Currently they are one of the main limiting factors for the product.
> > They
> > > > are fat - 30+ bytes overhead on average, high TCO of Apache Ignite
> > > > comparing to other vendors. They are slow - not suitable for SQL at
> > all.
> > > >
> > > > I would like to ask all of you who worked with binary objects to
> share
> > > your
> > > > feedback and ideas, so that we understand how they should look like
> in
> > AI
> > > > 3.0. This is a brain storm - let's accumulate ideas first and
> minimize
> > > > critics. Then we will work on ideas in separate topics.
> > > >
> > > > 1) Historical background
> > > >
> > > > BO were implemented around 2014 (Apache Ignite 1.5) when we started
> > > working
> > > > on .NET and CPP clients. During design we had several ideas in mind:
> > > > - ability to read object fields in O(1) without deserialization
> > > > - interoperabillty between Java, .NET and CPP.
> > > >
> > > > Since then a number of other concepts were mixed to the cocktail:
> > > > - Affinity key fields
> > > > - Strict typing for existing fields (aka metadata)
> > > > - Binary Object as storage format
> > > >
> > > > 2) My proposals
> > > >
> > > > 2.1) Introduce "Data Row Format" interface
> > > > Binary Objects are terrible candidates for storage. Too fat, too
> slow.
> > > > Efficient storage typically has <10 bytes overhead per row (no
> > metadata,
> > > no
> > > > length, no hash code, etc), allow supper-fast field access, support
> > > > different string formats (ASCII, UTF-8, etc), support different
> > temporal
> > > > types (date, time, timestamp, timestamp with timezone, etc), and
> store
> > > > these types as efficiently as possible.
> > > >
> > > > What we need is to introduce an interface which will convert a pair
> of
> > > > key-value objects into a row. This row will be used to store data and
> > to
> > > > get fields from it. Care about memory consumption, need SQL and
> strict
> > > > schema - use one format. Need flexibility and prefer key-value
> access -
> > > use
> > > > another format which will store binary objects unchanged (current
> > > > behavior).
> > > >
> > > > interface DataRowFormat {
> > > > DataRow create(Object key, Object value); // primitives or binary
> > > > objects
> > > > DataRowMetadata metadata();
> > > > }
> > > >
> > > > 2.2) Remove affinity field from metadata
> > > > Affinity rules are governed by cache, not type. We should remove
> > > > "affintiyFieldName" from metadata.
> > > >
> > > > 2.3) Remove restrictions on changing field type
> > > > I do not know why we did that in the first place. This restriction
> > > prevents
> > > > type evolution and confuses users.
> > > >
> > > > 2.4) Use bitmaps for "null" and default values and for fixed-length
> > > fields,
> > > > put fixed-length fields before variable-length.
> > > > Motivation: to save space.
> > > >
> > > > What else? Please share your ideas.
> > > >
> > > > Vladimir.
> > > >
> > >
> >
>

Sergi

Re: [IMPORTANT] Future of Binary Objects

If we are developing a product for users, we already guessing what is right
and what is wrong for them. So let's avoid these sophistic statements.

In the end it is always our responsibility to provide a balanced set of
trade-offs between
usability, performance and safety.

Let me repeat, I'm not against any possible type conversions, but I'm
strongly against binary incompatible ones.
If we always store List.of(1) as 1 and make them binary interchangeable,
I'm OK with that.

And still for good practices I'd suggest to look at what Protobuf allows
and what not:
https://developers.google.com/protocol-buffers/docs/proto3#updating

Sergi

чт, 22 нояб. 2018 г. в 11:04, Vladimir Ozerov <[hidden email]>:

> Sergi,
>
> I think we should not guess for users what is right or wrong for them. It
> is up to user to decide what is valid. For example, consider a user who
> operates on a list of Integers, and to optimize memory consumption he
> decide to save in the same field either List<Integer>, or plain Integer in
> case only single element exists. Another example - a kind of data lake or
> data cleansing application, which may receive the same field in different
> forms. E.g. age in the form of Integer or String. Does it work for user or
> not? We do not know. Will he need to migrate the whole data set? We do not
> know either.
>
> The only place in the product where we case is SQL. But in this case
> instead of adding checks on binary level, we should validate data on cache
> level. In fact, Ignite already works this way. E.g. nullability checks are
> performed on cache level rather than binary. All we need is to move all
> checks to cache level from binary level.
>
>
> On Thu, Nov 22, 2018 at 9:41 AM Sergi Vladykin <[hidden email]>
> wrote:
>
> > It may be OK to extend compatible field types (like from Int to Long).
> >
> > In Protobuf for example this is allowed just because there is no
> difference
> > between Int and Long in binary format: they all are equally varlen
> encoded
> > and Longs just will occupy up to 9 bytes, while Ints up to 5.
> >
> > But for every other case, where binary representation is type dependent,
> I
> > would be against. This will either require to migrate the whole dataset
> to
> > a new model (which is always risky, since you may need to rollback to
> > previous version of your code) or it will require type checks/conversions
> > for each field access, which is a hard to reason complication and
> possible
> > performance penalty.
> >
> > Sergi
> >
> >
> >
> > чт, 22 нояб. 2018 г. в 09:23, Vladimir Ozerov <[hidden email]>:
> >
> > > Denis,
> > >
> > > Several examples:
> > > 1) DEFAULT values - in SQL you may avoid storing default value in the
> > table
> > > and store it in metadata instead. Not applicable for BinaryObject
> because
> > > the same binary object may be saved to two SQL tables with different
> > > defaults
> > > 2) DATE and other temporal types - in SQL you want to store it in
> special
> > > format to be able to extract date parts quickly (typically - 11 bytes).
> > But
> > > in Java and some other languages the best format is plain long. this is
> > why
> > > we use it BinaryObject
> > > 3) String charset - in SQL you may choose different charsets for
> > different
> > > tables. E.g. UTF-8 for one, ASCII for another. In BinaryObject we store
> > > everything in UTF-8, and this is fine for most cases, well ... except
> of
> > > SQL :-)
> > >
> > > The key thing here is that you cannot define a format which will be
> good
> > > for both SQL, and native API. They are very different. This is why I
> > > propose to define additional interface on cache level defining how to
> > store
> > > values, which will be very different from binary objects.
> > >
> > > Vladimir.
> > >
> > > On Thu, Nov 22, 2018 at 3:32 AM Denis Magda <[hidden email]> wrote:
> > >
> > > > Vladimir,
> > > >
> > > > Could you educate me a little bit, why the current format is bad for
> > SQL
> > > > and why another one is more suitable?
> > > >
> > > > Also, if we introduce the new format then why would we keep the
> binary
> > > one?
> > > > Is the new format just a next version of the binary one.
> > > >
> > > > 2.3) Remove restrictions on changing field type
> > > > > I do not know why we did that in the first place. This restriction
> > > > prevents
> > > > > type evolution and confuses users.
> > > >
> > > >
> > > > That is a hot requirement shared by those who use Ignite SQL in
> > > production.
> > > > +1.
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Mon, Nov 19, 2018 at 11:05 PM Vladimir Ozerov <
> [hidden email]
> > >
> > > > wrote:
> > > >
> > > > > Igniters,
> > > > >
> > > > > It is very likely that Apache Ignite 3.0 will be released next
> year.
> > So
> > > > we
> > > > > need to start thinking about major product improvements. I'd like
> to
> > > > start
> > > > > with binary objects.
> > > > >
> > > > > Currently they are one of the main limiting factors for the
> product.
> > > They
> > > > > are fat - 30+ bytes overhead on average, high TCO of Apache Ignite
> > > > > comparing to other vendors. They are slow - not suitable for SQL at
> > > all.
> > > > >
> > > > > I would like to ask all of you who worked with binary objects to
> > share
> > > > your
> > > > > feedback and ideas, so that we understand how they should look like
> > in
> > > AI
> > > > > 3.0. This is a brain storm - let's accumulate ideas first and
> > minimize
> > > > > critics. Then we will work on ideas in separate topics.
> > > > >
> > > > > 1) Historical background
> > > > >
> > > > > BO were implemented around 2014 (Apache Ignite 1.5) when we started
> > > > working
> > > > > on .NET and CPP clients. During design we had several ideas in
> mind:
> > > > > - ability to read object fields in O(1) without deserialization
> > > > > - interoperabillty between Java, .NET and CPP.
> > > > >
> > > > > Since then a number of other concepts were mixed to the cocktail:
> > > > > - Affinity key fields
> > > > > - Strict typing for existing fields (aka metadata)
> > > > > - Binary Object as storage format
> > > > >
> > > > > 2) My proposals
> > > > >
> > > > > 2.1) Introduce "Data Row Format" interface
> > > > > Binary Objects are terrible candidates for storage. Too fat, too
> > slow.
> > > > > Efficient storage typically has <10 bytes overhead per row (no
> > > metadata,
> > > > no
> > > > > length, no hash code, etc), allow supper-fast field access, support
> > > > > different string formats (ASCII, UTF-8, etc), support different
> > > temporal
> > > > > types (date, time, timestamp, timestamp with timezone, etc), and
> > store
> > > > > these types as efficiently as possible.
> > > > >
> > > > > What we need is to introduce an interface which will convert a pair
> > of
> > > > > key-value objects into a row. This row will be used to store data
> and
> > > to
> > > > > get fields from it. Care about memory consumption, need SQL and
> > strict
> > > > > schema - use one format. Need flexibility and prefer key-value
> > access -
> > > > use
> > > > > another format which will store binary objects unchanged (current
> > > > > behavior).
> > > > >
> > > > > interface DataRowFormat {
> > > > > DataRow create(Object key, Object value); // primitives or
> binary
> > > > > objects
> > > > > DataRowMetadata metadata();
> > > > > }
> > > > >
> > > > > 2.2) Remove affinity field from metadata
> > > > > Affinity rules are governed by cache, not type. We should remove
> > > > > "affintiyFieldName" from metadata.
> > > > >
> > > > > 2.3) Remove restrictions on changing field type
> > > > > I do not know why we did that in the first place. This restriction
> > > > prevents
> > > > > type evolution and confuses users.
> > > > >
> > > > > 2.4) Use bitmaps for "null" and default values and for fixed-length
> > > > fields,
> > > > > put fixed-length fields before variable-length.
> > > > > Motivation: to save space.
> > > > >
> > > > > What else? Please share your ideas.
> > > > >
> > > > > Vladimir.
> > > > >
> > > >
> > >
> >
>

Vladimir Ozerov

Re: [IMPORTANT] Future of Binary Objects

Ok, let's agree on the fact that we would like to make schema change rules
less restrictive. But how less - is separate topic. Use case which annoys
me the most is DROP/ADD COLUMN commands.

On Thu, Nov 22, 2018 at 12:25 PM Sergi Vladykin <[hidden email]>
wrote:

> If we are developing a product for users, we already guessing what is right
> and what is wrong for them. So let's avoid these sophistic statements.
>
> In the end it is always our responsibility to provide a balanced set of
> trade-offs between
> usability, performance and safety.
>
> Let me repeat, I'm not against any possible type conversions, but I'm
> strongly against binary incompatible ones.
> If we always store List.of(1) as 1 and make them binary interchangeable,
> I'm OK with that.
>
> And still for good practices I'd suggest to look at what Protobuf allows
> and what not:
> https://developers.google.com/protocol-buffers/docs/proto3#updating
>
> Sergi
>
> чт, 22 нояб. 2018 г. в 11:04, Vladimir Ozerov <[hidden email]>:
>
> > Sergi,
> >
> > I think we should not guess for users what is right or wrong for them. It
> > is up to user to decide what is valid. For example, consider a user who
> > operates on a list of Integers, and to optimize memory consumption he
> > decide to save in the same field either List<Integer>, or plain Integer
> in
> > case only single element exists. Another example - a kind of data lake or
> > data cleansing application, which may receive the same field in different
> > forms. E.g. age in the form of Integer or String. Does it work for user
> or
> > not? We do not know. Will he need to migrate the whole data set? We do
> not
> > know either.
> >
> > The only place in the product where we case is SQL. But in this case
> > instead of adding checks on binary level, we should validate data on
> cache
> > level. In fact, Ignite already works this way. E.g. nullability checks
> are
> > performed on cache level rather than binary. All we need is to move all
> > checks to cache level from binary level.
> >
> >
> > On Thu, Nov 22, 2018 at 9:41 AM Sergi Vladykin <[hidden email]
> >
> > wrote:
> >
> > > It may be OK to extend compatible field types (like from Int to Long).
> > >
> > > In Protobuf for example this is allowed just because there is no
> > difference
> > > between Int and Long in binary format: they all are equally varlen
> > encoded
> > > and Longs just will occupy up to 9 bytes, while Ints up to 5.
> > >
> > > But for every other case, where binary representation is type
> dependent,
> > I
> > > would be against. This will either require to migrate the whole dataset
> > to
> > > a new model (which is always risky, since you may need to rollback to
> > > previous version of your code) or it will require type
> checks/conversions
> > > for each field access, which is a hard to reason complication and
> > possible
> > > performance penalty.
> > >
> > > Sergi
> > >
> > >
> > >
> > > чт, 22 нояб. 2018 г. в 09:23, Vladimir Ozerov <[hidden email]>:
> > >
> > > > Denis,
> > > >
> > > > Several examples:
> > > > 1) DEFAULT values - in SQL you may avoid storing default value in the
> > > table
> > > > and store it in metadata instead. Not applicable for BinaryObject
> > because
> > > > the same binary object may be saved to two SQL tables with different
> > > > defaults
> > > > 2) DATE and other temporal types - in SQL you want to store it in
> > special
> > > > format to be able to extract date parts quickly (typically - 11
> bytes).
> > > But
> > > > in Java and some other languages the best format is plain long. this
> is
> > > why
> > > > we use it BinaryObject
> > > > 3) String charset - in SQL you may choose different charsets for
> > > different
> > > > tables. E.g. UTF-8 for one, ASCII for another. In BinaryObject we
> store
> > > > everything in UTF-8, and this is fine for most cases, well ... except
> > of
> > > > SQL :-)
> > > >
> > > > The key thing here is that you cannot define a format which will be
> > good
> > > > for both SQL, and native API. They are very different. This is why I
> > > > propose to define additional interface on cache level defining how to
> > > store
> > > > values, which will be very different from binary objects.
> > > >
> > > > Vladimir.
> > > >
> > > > On Thu, Nov 22, 2018 at 3:32 AM Denis Magda <[hidden email]>
> wrote:
> > > >
> > > > > Vladimir,
> > > > >
> > > > > Could you educate me a little bit, why the current format is bad
> for
> > > SQL
> > > > > and why another one is more suitable?
> > > > >
> > > > > Also, if we introduce the new format then why would we keep the
> > binary
> > > > one?
> > > > > Is the new format just a next version of the binary one.
> > > > >
> > > > > 2.3) Remove restrictions on changing field type
> > > > > > I do not know why we did that in the first place. This
> restriction
> > > > > prevents
> > > > > > type evolution and confuses users.
> > > > >
> > > > >
> > > > > That is a hot requirement shared by those who use Ignite SQL in
> > > > production.
> > > > > +1.
> > > > >
> > > > > --
> > > > > Denis
> > > > >
> > > > > On Mon, Nov 19, 2018 at 11:05 PM Vladimir Ozerov <
> > [hidden email]
> > > >
> > > > > wrote:
> > > > >
> > > > > > Igniters,
> > > > > >
> > > > > > It is very likely that Apache Ignite 3.0 will be released next
> > year.
> > > So
> > > > > we
> > > > > > need to start thinking about major product improvements. I'd like
> > to
> > > > > start
> > > > > > with binary objects.
> > > > > >
> > > > > > Currently they are one of the main limiting factors for the
> > product.
> > > > They
> > > > > > are fat - 30+ bytes overhead on average, high TCO of Apache
> Ignite
> > > > > > comparing to other vendors. They are slow - not suitable for SQL
> at
> > > > all.
> > > > > >
> > > > > > I would like to ask all of you who worked with binary objects to
> > > share
> > > > > your
> > > > > > feedback and ideas, so that we understand how they should look
> like
> > > in
> > > > AI
> > > > > > 3.0. This is a brain storm - let's accumulate ideas first and
> > > minimize
> > > > > > critics. Then we will work on ideas in separate topics.
> > > > > >
> > > > > > 1) Historical background
> > > > > >
> > > > > > BO were implemented around 2014 (Apache Ignite 1.5) when we
> started
> > > > > working
> > > > > > on .NET and CPP clients. During design we had several ideas in
> > mind:
> > > > > > - ability to read object fields in O(1) without deserialization
> > > > > > - interoperabillty between Java, .NET and CPP.
> > > > > >
> > > > > > Since then a number of other concepts were mixed to the cocktail:
> > > > > > - Affinity key fields
> > > > > > - Strict typing for existing fields (aka metadata)
> > > > > > - Binary Object as storage format
> > > > > >
> > > > > > 2) My proposals
> > > > > >
> > > > > > 2.1) Introduce "Data Row Format" interface
> > > > > > Binary Objects are terrible candidates for storage. Too fat, too
> > > slow.
> > > > > > Efficient storage typically has <10 bytes overhead per row (no
> > > > metadata,
> > > > > no
> > > > > > length, no hash code, etc), allow supper-fast field access,
> support
> > > > > > different string formats (ASCII, UTF-8, etc), support different
> > > > temporal
> > > > > > types (date, time, timestamp, timestamp with timezone, etc), and
> > > store
> > > > > > these types as efficiently as possible.
> > > > > >
> > > > > > What we need is to introduce an interface which will convert a
> pair
> > > of
> > > > > > key-value objects into a row. This row will be used to store data
> > and
> > > > to
> > > > > > get fields from it. Care about memory consumption, need SQL and
> > > strict
> > > > > > schema - use one format. Need flexibility and prefer key-value
> > > access -
> > > > > use
> > > > > > another format which will store binary objects unchanged (current
> > > > > > behavior).
> > > > > >
> > > > > > interface DataRowFormat {
> > > > > > DataRow create(Object key, Object value); // primitives or
> > binary
> > > > > > objects
> > > > > > DataRowMetadata metadata();
> > > > > > }
> > > > > >
> > > > > > 2.2) Remove affinity field from metadata
> > > > > > Affinity rules are governed by cache, not type. We should remove
> > > > > > "affintiyFieldName" from metadata.
> > > > > >
> > > > > > 2.3) Remove restrictions on changing field type
> > > > > > I do not know why we did that in the first place. This
> restriction
> > > > > prevents
> > > > > > type evolution and confuses users.
> > > > > >
> > > > > > 2.4) Use bitmaps for "null" and default values and for
> fixed-length
> > > > > fields,
> > > > > > put fixed-length fields before variable-length.
> > > > > > Motivation: to save space.
> > > > > >
> > > > > > What else? Please share your ideas.
> > > > > >
> > > > > > Vladimir.
> > > > > >
> > > > >
> > > >
> > >
> >
>