Distributed MetaStorage discussion

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Distributed MetaStorage discussion

Ivan Bessonov
Hello, Igniters!

Here's more info "Distributed MetaStorage" feature [1]. It is a part of
Phase II for
IEP-4 (Baseline topology) [2] and was mentioned in recent "Baseline
auto-adjust`s
discuss" topic. I'll partially duplicate that message here.

One of key requirements is the ability to store configuration data (or any
other data)
consistently and cluster-wide. There are also other tickets that require
similar
mechanisms, for example [3]. Ignite doesn't have any specific API for such
configurations and we don't want to have many similar implementations of the
same feature across the code.

There are several API methods required for the feature:

 - read(key) / iterate(keyPrefix) - access to the distributed data. Should
be
   consistent for all nodes in cluster when it's in active state.
 - write / remove - modify data in distributed metastorage. Should
guarantee that
   every node in cluster will have this update after the method is finished.
 - writeAsync / removeAsync (not yet implemented) - same as above, but
async.
   Might be useful if one needs to update several values one after another.
 - compareAndWrite / compareAndRemove - helpful to reduce number of data
   updates (more on that later).
 - listen(keyPredicate) - a way of being notified when some data was
changed.
   Normally it is triggered on "write/remove" operation or node activation.
Listener
   itself will be notified with <key, oldValue, newValue>.

Now some implementation details:

First implementation is based on existing local metastorage API for
persistent
clusters (in-memory clusters will store data in memory). Write/remove
operation
use Discovery SPI to send updates to the cluster, it guarantees updates
order
and the fact that all existing (alive) nodes have handled the update
message.

As a way to find out which node has the latest data there is a "version"
value of
distributed metastorage, which is basically <number of all updates, hash of
all
updates>. Whole updates history until some point in the past is stored
along with
the data, so when an outdated node connects to the cluster it will receive
all the
missing data and apply it locally. Listeners will also be invoked after
such updates.
If there's not enough history stored or joining node is clear then it'll
receive
shapshot of distributed metastorage so there won't be inconsistencies.
"compareAndWrite" / "compareAndRemove" API might help reducing the size of
the history, especially for Boolean or other primitive values.

There are, of course, many more details, feel free to ask about them. First
implementation is in master, but there are already known improvements that
can
be done and I'm working on them right now.

See package "org.apache.ignite.internal.processors.metastorage" for the new
interfaces and comment your opinion or questions. Thank you!

[1] https://issues.apache.org/jira/browse/IGNITE-10640
[2]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-4+Baseline+topology+for+caches
[3] https://issues.apache.org/jira/browse/IGNITE-8717

--
Sincerely yours,
Ivan Bessonov
Reply | Threaded
Open this post in threaded view
|

Re: Distributed MetaStorage discussion

Vladimir Ozerov
Ivan,

The change you describe is extremely valuable thing as it allows to detect
changes into global configuration which is of great importance for SQL.
Will topology and affinity changes be reflected in metastore history as
well? From SQL perspective it is important for us to be able to understand
whether cluster topology, data distribution or SQL schema has changed
between two versions. Is it possible to have a kind of composite version
instead of hashed counter? E.g.

class ConfigurationVersion {
    long globalVer; // Global counter
    long topVer; // Increasing topology version
    long affVer; // Increasing affinity version which is incremented every
time data distribution is changed (node join/leave, baseline changes, late
affinity assignment)
    long sqlSchemaVer; // Incremented every time SQL schema changes
}

Vladimir.


On Fri, Jan 25, 2019 at 11:45 AM Ivan Bessonov <[hidden email]>
wrote:

> Hello, Igniters!
>
> Here's more info "Distributed MetaStorage" feature [1]. It is a part of
> Phase II for
> IEP-4 (Baseline topology) [2] and was mentioned in recent "Baseline
> auto-adjust`s
> discuss" topic. I'll partially duplicate that message here.
>
> One of key requirements is the ability to store configuration data (or any
> other data)
> consistently and cluster-wide. There are also other tickets that require
> similar
> mechanisms, for example [3]. Ignite doesn't have any specific API for such
> configurations and we don't want to have many similar implementations of
> the
> same feature across the code.
>
> There are several API methods required for the feature:
>
>  - read(key) / iterate(keyPrefix) - access to the distributed data. Should
> be
>    consistent for all nodes in cluster when it's in active state.
>  - write / remove - modify data in distributed metastorage. Should
> guarantee that
>    every node in cluster will have this update after the method is
> finished.
>  - writeAsync / removeAsync (not yet implemented) - same as above, but
> async.
>    Might be useful if one needs to update several values one after another.
>  - compareAndWrite / compareAndRemove - helpful to reduce number of data
>    updates (more on that later).
>  - listen(keyPredicate) - a way of being notified when some data was
> changed.
>    Normally it is triggered on "write/remove" operation or node activation.
> Listener
>    itself will be notified with <key, oldValue, newValue>.
>
> Now some implementation details:
>
> First implementation is based on existing local metastorage API for
> persistent
> clusters (in-memory clusters will store data in memory). Write/remove
> operation
> use Discovery SPI to send updates to the cluster, it guarantees updates
> order
> and the fact that all existing (alive) nodes have handled the update
> message.
>
> As a way to find out which node has the latest data there is a "version"
> value of
> distributed metastorage, which is basically <number of all updates, hash of
> all
> updates>. Whole updates history until some point in the past is stored
> along with
> the data, so when an outdated node connects to the cluster it will receive
> all the
> missing data and apply it locally. Listeners will also be invoked after
> such updates.
> If there's not enough history stored or joining node is clear then it'll
> receive
> shapshot of distributed metastorage so there won't be inconsistencies.
> "compareAndWrite" / "compareAndRemove" API might help reducing the size of
> the history, especially for Boolean or other primitive values.
>
> There are, of course, many more details, feel free to ask about them. First
> implementation is in master, but there are already known improvements that
> can
> be done and I'm working on them right now.
>
> See package "org.apache.ignite.internal.processors.metastorage" for the new
> interfaces and comment your opinion or questions. Thank you!
>
> [1] https://issues.apache.org/jira/browse/IGNITE-10640
> [2]
>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-4+Baseline+topology+for+caches
> [3] https://issues.apache.org/jira/browse/IGNITE-8717
>
> --
> Sincerely yours,
> Ivan Bessonov
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed MetaStorage discussion

Ivan Bessonov
Vladimir,

thank you for the reply. Topology and affinity changes are not reflected in
distributed metastorage, we didn't touch baseline history at all. I believe
that what you really need it just distributed property "sqlSchemaVer" that
is updated on each schema update. It could be achieved by creating
corresponding key in distributed metastorage without any specific treatment
from the API standpoint.

Same thing applies to topology and affinity versions, but motivation here
is not that clear for me to be honest.

I think that the most common approach with single incrementing version
is much simpler then several counters and I would prefer to leave it that
way.


пт, 25 янв. 2019 г. в 16:39, Vladimir Ozerov <[hidden email]>:

> Ivan,
>
> The change you describe is extremely valuable thing as it allows to detect
> changes into global configuration which is of great importance for SQL.
> Will topology and affinity changes be reflected in metastore history as
> well? From SQL perspective it is important for us to be able to understand
> whether cluster topology, data distribution or SQL schema has changed
> between two versions. Is it possible to have a kind of composite version
> instead of hashed counter? E.g.
>
> class ConfigurationVersion {
>     long globalVer; // Global counter
>     long topVer; // Increasing topology version
>     long affVer; // Increasing affinity version which is incremented every
> time data distribution is changed (node join/leave, baseline changes, late
> affinity assignment)
>     long sqlSchemaVer; // Incremented every time SQL schema changes
> }
>
> Vladimir.
>
>
> On Fri, Jan 25, 2019 at 11:45 AM Ivan Bessonov <[hidden email]>
> wrote:
>
> > Hello, Igniters!
> >
> > Here's more info "Distributed MetaStorage" feature [1]. It is a part of
> > Phase II for
> > IEP-4 (Baseline topology) [2] and was mentioned in recent "Baseline
> > auto-adjust`s
> > discuss" topic. I'll partially duplicate that message here.
> >
> > One of key requirements is the ability to store configuration data (or
> any
> > other data)
> > consistently and cluster-wide. There are also other tickets that require
> > similar
> > mechanisms, for example [3]. Ignite doesn't have any specific API for
> such
> > configurations and we don't want to have many similar implementations of
> > the
> > same feature across the code.
> >
> > There are several API methods required for the feature:
> >
> >  - read(key) / iterate(keyPrefix) - access to the distributed data.
> Should
> > be
> >    consistent for all nodes in cluster when it's in active state.
> >  - write / remove - modify data in distributed metastorage. Should
> > guarantee that
> >    every node in cluster will have this update after the method is
> > finished.
> >  - writeAsync / removeAsync (not yet implemented) - same as above, but
> > async.
> >    Might be useful if one needs to update several values one after
> another.
> >  - compareAndWrite / compareAndRemove - helpful to reduce number of data
> >    updates (more on that later).
> >  - listen(keyPredicate) - a way of being notified when some data was
> > changed.
> >    Normally it is triggered on "write/remove" operation or node
> activation.
> > Listener
> >    itself will be notified with <key, oldValue, newValue>.
> >
> > Now some implementation details:
> >
> > First implementation is based on existing local metastorage API for
> > persistent
> > clusters (in-memory clusters will store data in memory). Write/remove
> > operation
> > use Discovery SPI to send updates to the cluster, it guarantees updates
> > order
> > and the fact that all existing (alive) nodes have handled the update
> > message.
> >
> > As a way to find out which node has the latest data there is a "version"
> > value of
> > distributed metastorage, which is basically <number of all updates, hash
> of
> > all
> > updates>. Whole updates history until some point in the past is stored
> > along with
> > the data, so when an outdated node connects to the cluster it will
> receive
> > all the
> > missing data and apply it locally. Listeners will also be invoked after
> > such updates.
> > If there's not enough history stored or joining node is clear then it'll
> > receive
> > shapshot of distributed metastorage so there won't be inconsistencies.
> > "compareAndWrite" / "compareAndRemove" API might help reducing the size
> of
> > the history, especially for Boolean or other primitive values.
> >
> > There are, of course, many more details, feel free to ask about them.
> First
> > implementation is in master, but there are already known improvements
> that
> > can
> > be done and I'm working on them right now.
> >
> > See package "org.apache.ignite.internal.processors.metastorage" for the
> new
> > interfaces and comment your opinion or questions. Thank you!
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-10640
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-4+Baseline+topology+for+caches
> > [3] https://issues.apache.org/jira/browse/IGNITE-8717
> >
> > --
> > Sincerely yours,
> > Ivan Bessonov
> >
>


--
Sincerely yours,
Ivan Bessonov
Reply | Threaded
Open this post in threaded view
|

Re: Distributed MetaStorage discussion

Vladimir Ozerov
Ivan,

The idea is that certain changes to the system are not relevant for all
components. E.g. if SQL schema is changed, then some SQL caches needs to be
invalidated. When affinity topology changes, another part of caches needs
to be invalidated. Having a single version may lead to unexpected latency
spikes and invalidations in this case.

On Fri, Jan 25, 2019 at 4:50 PM Ivan Bessonov <[hidden email]> wrote:

> Vladimir,
>
> thank you for the reply. Topology and affinity changes are not reflected in
> distributed metastorage, we didn't touch baseline history at all. I believe
> that what you really need it just distributed property "sqlSchemaVer" that
> is updated on each schema update. It could be achieved by creating
> corresponding key in distributed metastorage without any specific treatment
> from the API standpoint.
>
> Same thing applies to topology and affinity versions, but motivation here
> is not that clear for me to be honest.
>
> I think that the most common approach with single incrementing version
> is much simpler then several counters and I would prefer to leave it that
> way.
>
>
> пт, 25 янв. 2019 г. в 16:39, Vladimir Ozerov <[hidden email]>:
>
> > Ivan,
> >
> > The change you describe is extremely valuable thing as it allows to
> detect
> > changes into global configuration which is of great importance for SQL.
> > Will topology and affinity changes be reflected in metastore history as
> > well? From SQL perspective it is important for us to be able to
> understand
> > whether cluster topology, data distribution or SQL schema has changed
> > between two versions. Is it possible to have a kind of composite version
> > instead of hashed counter? E.g.
> >
> > class ConfigurationVersion {
> >     long globalVer; // Global counter
> >     long topVer; // Increasing topology version
> >     long affVer; // Increasing affinity version which is incremented
> every
> > time data distribution is changed (node join/leave, baseline changes,
> late
> > affinity assignment)
> >     long sqlSchemaVer; // Incremented every time SQL schema changes
> > }
> >
> > Vladimir.
> >
> >
> > On Fri, Jan 25, 2019 at 11:45 AM Ivan Bessonov <[hidden email]>
> > wrote:
> >
> > > Hello, Igniters!
> > >
> > > Here's more info "Distributed MetaStorage" feature [1]. It is a part of
> > > Phase II for
> > > IEP-4 (Baseline topology) [2] and was mentioned in recent "Baseline
> > > auto-adjust`s
> > > discuss" topic. I'll partially duplicate that message here.
> > >
> > > One of key requirements is the ability to store configuration data (or
> > any
> > > other data)
> > > consistently and cluster-wide. There are also other tickets that
> require
> > > similar
> > > mechanisms, for example [3]. Ignite doesn't have any specific API for
> > such
> > > configurations and we don't want to have many similar implementations
> of
> > > the
> > > same feature across the code.
> > >
> > > There are several API methods required for the feature:
> > >
> > >  - read(key) / iterate(keyPrefix) - access to the distributed data.
> > Should
> > > be
> > >    consistent for all nodes in cluster when it's in active state.
> > >  - write / remove - modify data in distributed metastorage. Should
> > > guarantee that
> > >    every node in cluster will have this update after the method is
> > > finished.
> > >  - writeAsync / removeAsync (not yet implemented) - same as above, but
> > > async.
> > >    Might be useful if one needs to update several values one after
> > another.
> > >  - compareAndWrite / compareAndRemove - helpful to reduce number of
> data
> > >    updates (more on that later).
> > >  - listen(keyPredicate) - a way of being notified when some data was
> > > changed.
> > >    Normally it is triggered on "write/remove" operation or node
> > activation.
> > > Listener
> > >    itself will be notified with <key, oldValue, newValue>.
> > >
> > > Now some implementation details:
> > >
> > > First implementation is based on existing local metastorage API for
> > > persistent
> > > clusters (in-memory clusters will store data in memory). Write/remove
> > > operation
> > > use Discovery SPI to send updates to the cluster, it guarantees updates
> > > order
> > > and the fact that all existing (alive) nodes have handled the update
> > > message.
> > >
> > > As a way to find out which node has the latest data there is a
> "version"
> > > value of
> > > distributed metastorage, which is basically <number of all updates,
> hash
> > of
> > > all
> > > updates>. Whole updates history until some point in the past is stored
> > > along with
> > > the data, so when an outdated node connects to the cluster it will
> > receive
> > > all the
> > > missing data and apply it locally. Listeners will also be invoked after
> > > such updates.
> > > If there's not enough history stored or joining node is clear then
> it'll
> > > receive
> > > shapshot of distributed metastorage so there won't be inconsistencies.
> > > "compareAndWrite" / "compareAndRemove" API might help reducing the size
> > of
> > > the history, especially for Boolean or other primitive values.
> > >
> > > There are, of course, many more details, feel free to ask about them.
> > First
> > > implementation is in master, but there are already known improvements
> > that
> > > can
> > > be done and I'm working on them right now.
> > >
> > > See package "org.apache.ignite.internal.processors.metastorage" for the
> > new
> > > interfaces and comment your opinion or questions. Thank you!
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-10640
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-4+Baseline+topology+for+caches
> > > [3] https://issues.apache.org/jira/browse/IGNITE-8717
> > >
> > > --
> > > Sincerely yours,
> > > Ivan Bessonov
> > >
> >
>
>
> --
> Sincerely yours,
> Ivan Bessonov
>